The dark side of big data

Everybody (1, 2, 3) is talking about big data and the chances that are not to be missed. So it would be fatal to miss this opportunity, for instance, by imposing too many regulations, right?

First things first, what is big data about?

Control. While data analysts looked into a murky pond waiting to observe a fish every now and then, big data promises clear water and a thorough understanding of ongoing processes – what are all the fish doing? At its heart, the term doesn’t describe anything more than making the most of as much data as possible. Even though it is natural that one wishes to have as much knowledge as possible, one should keep in mind that information always brings the danger of being exploited. Obviously, burglars are not supposed to know when families are on holidays. But likewise, a government should not keep track of who is voting against them. The more data you collect on processes, the better you can control them. And as the amount of data that modern societies produce keeps growing significantly, the threat of misuse of this data grows in accordance.

What kind of digital data do we produce?

Only a few years ago most of us didn’t produce any digital data at all. These days, we generate data actively as well as passively. When we search online, shop online, book trips online, communicate online (which includes uploading images and videos), organise contacts and calendars online, then we upload our data actively – some people generously upload all of their personal data. In addition, a lot of web pages use tools that analyse our browsing behaviour, mobile apps create usage statistics or sometimes even spy on personal data, and most disconcerting: Our friends and colleagues share information about us – commonly without thinking about it by uploading all their contacts (often automatically, including profile pictures) and mentioning us in chatrooms or by sharing pictures of us. This happens passively, as we don’t share the data willingly.

By accumulating passive data only, it would be possible to obtain an accurate picture of most persons, composed of interests, social contacts and images. Regarding active data as well, digital fingerprints become alarmingly complete. Let’s say you want to go on a sight-seeing trip. Already during planning you produce data like: When are you going where, with which budget, how long did you look at which websites and as a result, what are you interested in? Which of your probably best friends did you ask to join you, are these friends actually joining you? If not, did they make up an excuse or are they prevented for any reason? And then, when you are actually travelling you are tracked non-stop, your automatic uploads reveal what you are looking at and with whom you are meeting.

Most of this information is available straight away, but from a big data perspective much more is deducible: Based on how you describe the trip in chat records, do you like it? Are you going to break up your relationship? Does your activity correlate with someone you know? Is there a chance you are having an affair, are you susceptible to blackmailing? …

If you are lucky, you want to go to Paris and some service provider tries to make as much money out of you as possible. If you are unlucky, you want to see Tibet and a whole government might be working against you.

I would like to finish this section by quoting a friend of mine:

„Not having a [company name] account does not mean that they have no information about you. It’s more like you don’t have a password to access your data.”

How could such data be exploited?

Some experts claim that there is no bad data, but only bad uses of data. This logic is tempting, but then you could also say there are no bad weapons but only bad persons. Admittedly, there are countries relying on this logic but the consequences are well-known. I guess that most people would feel better knowing that this embarrassing picture taken at last New Year’s Eve party wouldn’t exist instead of hoping that nobody is going to prey on it.

For now, assume that all your digital data is accessible by a single instance, say your government. In this case you provide your government with incredible control over yourself and also your relatives. A mere measurement of how defiant you are could be dangerous for you, depending on where you live. Furthermore, someone gazing at your data is not required to remain inactive. One might try to manipulate your opinion by placing solely the content on (social) media platforms that you are supposed to see.

Let’s try another thought experiment – surveillance of education. I believe that a biased education system is especially dangerous, as it allows to manipulate the awareness of people which in turn allows to advocate radical beliefs. Edsger W. Dijkstra once said:

„It is not the task of the University to offer what society asks for, but to give what society needs.”

The more data-driven and controlled life becomes, the more self-censorship will emerge, ultimately affecting research. Imagine you are forced to wear a GPS fitness tracking device and your employer, say the O.R.-well University, has access to your data. Such a state of affairs would immediately affect the behaviour of students, which has to be as expected. Quick learners, for instance, that skipped a lecture every now and then might be punished for absence, as well as students that leave their dorm after a certain time or persons that show non-heterosexual patterns might be penalized. This scenario is ridiculously far-fetched? Well, apparently we are there already.

The good news is that there is no single instance that has immediate access to all the data we produce. Some popular corporations that collect lots of personal data are: Google / Alphabet, Facebook, VK, Amazon, Microsoft, Dropbox, Apple and WeChat. There are also companies – called data brokers – behind the scenes, being just as interested in your data.

The bad news is that data monopolies get bigger and bigger, buying as many competitors as possible. And worse, governments are very successful at spying on these corporations both legally and illegally. It is somehow paradoxical that our lives become increasingly global, but that at the same time our digital trails are more and more centralised. Clearly, a more decentralised distribution of data would be more robust against exploitation.

But we can trust our governments, why would they want to exploit our data?

Regarding history, I am afraid that there is no reason at all to trust any government. Seemingly stable states can acquire fascistic characteristics surprisingly fast, but potentially dangerous data will not simply disappear. I am not aware of any region of the earth with a history free of human rights violations. Minority groups were and still are suppressed in many places. A very common example pro data privacy is the creation of the so called ‘pink lists’ in Germany. In the 19th century, prior to Nazi-Germany, lists of homosexuals were maintained without the intention to suppress homosexuals. When the Nazis obtained those files years later, they were used to systematically kill the listed individuals. Sure, this is an extreme case, but the possibilities with big data are also much more serious. And it is the pattern that worries me: Supposedly harmless data becoming very dangerous later.

Having large-scale in-depth data on a society could be used for subtle and effective propaganda, not to mention persecution. I doubt that another French Revolution would be possible as soon as there is a government exploiting big data. For more inspiration on this matter you might want to read George Orwell’s “1984” or Dave Eggers’ “The Circle”.

Anything else?

It is not only governments that are in a rage for collecting data. Companies, especially these that are data driven, such as insurance companies, sense new profits by acquiring new data. This isn’t necessarily bad and maybe more of a personal opinion, but I don’t like some of the implications. There is a trend towards usage-based insurances. For example, you get a cheaper car insurance when you drive safely and accept that your driving behaviour is fully tracked. The recorded data is very sensitive, which raises data privacy concerns and insurance companies don’t simply provide free discounts of course. In the end, customers that do not agree to the tracking will pay higher prices, which I think is discriminating. The same concept applies to sharing your fitness data with your health insurance, using fitness tracking devices. Ultimately, you adjust your behaviour in order to function as some instances expect you to, whilst being transparent – or you pay the price. There are also concerns regarding data misinterpretation, which is briefly covered on the second webpage linked in this section.

What will be possible in the future?

Currently, machine learning techniques are more popular than ever before and one breakthrough in research is followed by another one, which is great! Related methods are well suited to interpret big data and with increasing capabilities, potential threats also solidify. Very frequently, for instance, I see pictures of children posted on social media platforms or in non-encrypted chatrooms. As a result, children that grow up now probably have a perfect face recognition descriptor trained without ever touching a digital device themselves and without having had any choice. In theory, this could be used to identify billions of people in vast amounts of images uploaded in the future and also in video data like from surveillance cameras. Further, I think there is a danger of becoming slaves to our data. People love ratings. We are not only rating restaurants and hotels, but also doctors and teachers. Why not let big data do the job of rating for us? The more data, the more accurate the rating. But what if you grew up in the wrong neighbourhood, had criminal friends and have pictures showing you drunk? You might not get a job, as you are rated as potentially dangerous – this is related to the car insurance example above. In such a future, you would make every decision based on how nice your digital trail would look like. Freedom? I don’t think so. For those that have concerns about technological singularity, please include big data into your nightmares. A lot of bright minds are discussing the danger of machines becoming more intelligent than us and related threats. I believe this threat is amplified by big data. Moreover, with the emerging Internet of Things the amount of data we produce will increase sharply in the near future. This is gold for data-driven companies and big players are already buying themselves in, whilst security agencies are not unaware of upcoming ways to spy. Bruce Schneier just published an article sharing similar concerns, but focusing on the Internet of Things.

Aren’t these concerns rather theoretical?

I agree that most of my concerns relate to potential threats. Yet, the world is changing fast and I am pretty sure that due to the Internet of Things, robotics and AI it will look as different in 60 years as it did 60 years ago. There will be awesome new possibilities but we should not be ignorant to involved dangers. Instead, we should try to find secure solutions as soon as possible. If it were North Korea we are talking about, everybody would understand that big data might be a bad idea – it’s always the foreign that is dangerous. But regarding that resources are getting rare, climate is changing and that conflicts are out of control, it might take less than we think to get suspicious governments fueled by the fear of people.

What can we do?

Everybody can do do one’s bit, here are a few things I perceive as important:

Awareness: Start having discussions on this topic, raise your voice. Without awareness there is little chance of being represented by politicians. Or maybe you are even politically active yourself.
Education, education, education! I genuinely believe that many problems would not exist, if we had a more comprehensive education system. Douglas C. Engelbart once said something very true:

„The key thing about all the world’s big problems is that they have to be dealt with collectively. If we don’t get collectively smarter, we’re doomed.“

When only 5% are aware of big data and even fewer people understand the implications, we have a problem. Education is the fundament for both, a meaningful discussion and awareness.
Decentralization: I understand that offering a variety of services brings important effects of synergy and it is very difficult to find an appropriate amount of regulatisation. And, maybe most important from a user’s perspective: Having as much data as possible with a single provider makes things easy. Yet, distributing data over many service providers clearly increases protection from exploitation. Sometimes, paying little money for a service guarantees encryption and you switch from being a product to using a product.
Self-hosting: Even better than distributing your data would be to host as many services as possible by yourself or a person you trust. Perhaps, you know somebody with server administration expertise who can help you out, for instance, by setting up a family cloud.
Open-Source: Using open-source software, it is not you who is transparent but the software you use, which is a major advantage. Regarding the user agreements of some closed-source products, you can’t really know what’s happening with the data on your own devices. The code of popular open-source projects on the other hand is inspected by many independent people, resulting in trustworthiness. Have a look at prism-break, there might be open-source products suiting you well.
Security audits: For some, especially commercial products, there are comprehensible reasons for not publishing the source-code. In this case, the company should at least allow security audits to allow some level of transparency.
No backdoors! The discussion about forcing companies to implement backdoors in products reappears again and again. This is snakeoil. First, it leads to one instance – most likely some security agency, which is typically only superficially controlled by democratic institutions – having access to all your data. Second, backdoors themselves impose a security risk. How could you know that they are not used for industrial espionage or by hackers? Choose your products wisely. By preferring products of companies that care about data privacy you hit those that treat your data irresponsibly. Some vendors for instance, have a record of including backdoors, which should not be tolerated by customers.
Encryption: Last but not least, encrypt your data. It will be much harder to exploit your data, if it is encrypted; maybe even impossible. I can’t see any reason why you would choose an unencrypted messenger over an encrypted one. Modern ones are easy to use and when none of your friends has one, you could at least install both. I am sure someone will follow your example. You wouldn’t simply share your house key with strangers or allow them to listen to your phone calls – why allow them to read your digital communication?

Conclusion

This article covered my big data and data-privacy concerns at the same time. Synergy. There definitely are some potential threats and I hope that I could raise some awareness with this post. I am not trying to say that you should delete your account at any service provider, but you should know which data is generated, what can be deduced and how it could be used in the future.

I am by far not the only person concerned about data privacy. Robin Doherty wrote a nice article about the problems with the “I have nothing to hide” argument and Philip E. Agre elaborates more on pro privacy arguments. Also, John Oliver dedicates one episode of his famous Last Week Tonight show to Government Surveillance. Furthermore, Reporters Without Borders and The Guardian provide special pages covering digital surveillance.

Let’s bring this article to a close with the words of Edward Snowden:

I don’t want to live in a world where everything that I say, everything I do, everyone I talk to, every expression of creativity or love or friendship is recorded.