The fact that anonymising large sets of data is very, very hard should be widely known by now, as a number of researchers have already successfully de-anonymized a variety of metadata sets and published the results of their investigations.
Still, more evidence is welcome, as the question of whether it is at all possible to make data truly anonymous while still keeping it useful is raised again and again.
The latest example has been brought to us by a team of researchers lead by Yves-Alexandre de Montjoye, a graduate student at MIT’s Media Lab.
They analyzed 3 months of credit card records (stripped of names and account numbers) for 1.1 million people, and concluded that they could uniquely reidentify 90% of individuals if they also had four pieces of information that showed their movements on particular days – the type of information that is often easily deducible from posts made on Instagram, Facebook, Twitter and other social networks.
“Knowing the price of a transaction increases the risk of reidentification by 22%, on average,” they also found. Women and high-income individuals are more reidentifiable in credit card metadata than men and low-income people, respectively.
“Credit card records to be as reidentifiable as mobile phone data and their unicity to be robust to coarsening or noise,” the researchers noted. “Like credit card and mobile phone metadata, Web browsing or transportation data sets are generated as side effects of human interaction with technology, are subjected to the same idiosyncrasies of human behavior, and are also sparse and high-dimensional (for example, in the number of Web sites one can visit or the number of possible entry-exit combinations of metro stations). This means that these data can probably be relatively easily reidentified if released in a simply anonymized form and that they can probably not be anonymized by simply coarsening of the data.”
It seems that most data can be “personal” if combined with the right amount of other relevant data. And, as we know, we live in a world where data about us and our activities and interests is constantly collected both online and offline.
“From a technical perspective, our results emphasize the need to move, when possible, to more advanced and probably interactive individual (33) or group (34) privacy-conscientious technologies, as well as the need for more research in computational privacy. From a policy perspective, our findings highlight the need to reform our data protection mechanisms beyond PII and anonymity and toward a more quantitative assessment of the likelihood of reidentification”, the researchers concluded. “Finding the right balance between privacy and utility is absolutely crucial to realizing the great potential of metadata.”