Nonprofit organization dedicated to data-driven tech accountability journalism & privacy protection.
Anonymization of data isn’t enough to guarantee privacy
As COVID-19 spreads to 50 states this month, the U.S. government is in talks with tech companies including Google and Facebook about possibly using location data from Americans’ cellphones to track the spread of the coronavirus, The Washington Post has reported.
Public health officials are exploring whether the data could help them understand how the virus is spreading, which hospitals are overwhelmed, and whether Americans are practicing appropriate social distancing, according to the Post report.
If the plan goes into effect, officials said, the data set would be anonymized by removing personally identifying information, such as names and individual’s locations. Similar protections guard other data sets, like census results and medical and school records. They’re also used to protect information that companies collect, like browsing histories or movie preferences.
Anonymized data sets can have important applications. Banks can use big data to train algorithms to spot fraudulent activity, for example, or social scientists can use census data to measure health outcomes in specific geographic areas or among socioeconomic groups.
But anonymization isn’t enough to guarantee privacy. Decades of research shows that large data sets can often be deanonymized and used to reveal sensitive information about individual people.
In 1996, William Weld, then governor of Massachusetts, collapsed while receiving an honorary degree at Bentley University, near Boston. Weld was taken to the hospital, where, as for any other patient, hospital staff created medical records that detailed what kind of care he had received, what tests were run, his diagnosis, and his prescription. Those records were private — at least they were supposed to be.
At the time, state employees in Massachusetts, including Weld, got health insurance through the Group Insurance Commission (GIC), which decided it would sell those records to researchers. GIC removed identifiers such as patients’ names and Social Security numbers to allow researchers to benefit from the data.
But Latanya Sweeney, then a graduate student in computer science at MIT, figured out that the data wasn’t really so private. Sweeney purchased voter registration records, which include voters’ names, sex, date of birth, and zip code, for $20. When she matched up the voter records with the health records, she found six possible patient records that could belong to Weld. Only three were men, and only one shared Weld’s zip code.
Sweeney exposed a fundamental flaw in anonymous data: It isn’t really anonymous. Even if you remove direct identifiers like someone’s name, address, or Social Security number, when you combine the information in one data set with information in another, you can often find people and learn a lot about them.
Sweeney found that 87 percent of the U.S. population could be identified by just three data points: zip code, date of birth, and gender.
The weaknesses of anonymization have been borne out time and time again. In 2006, researchers from the University of Texas were able to re-identify reviewers from a data set released by Netflix of 500,000 anonymous movie ratings by using information from IMDb.com.
In 2013, researchers in Europe studied location data from 1.5 million people and figured out that the data was so specific to individual habits that they could identify 95 percent of the people with only four location data points.
But despite warnings from academics, little has been done to ensure data privacy or protect against deanonymization. If anything, the problem has only gotten worse since Sweeney’s first experiment in 1996.
Deanonymization happens when a person puts together multiple data sets and starts to join pieces of information together. Like a trail of bread crumbs, each new piece of information leads the researcher further down the path until the anonymous user is eventually identified.
Sweeney’s unmasking of William Weld was fairly low-tech. She merged two data sets with similar columns, just as you might in Excel. There’s not much fancy computing necessary to do that, but the results can be staggering. In 2013, Sweeney spent $50 to purchase patient health data from the state of Washington and matched those records with news reports that included the word “hospitalized.” Those two sets of data allowed her to identify 43 percent of the patients.
Other types of re-identification efforts are more complicated. In the Netflix example, researchers created similarity scores for each piece of data. If the points in two data sets had a high enough score, the researchers would call it a match. That system identified 80 percent of Netflix users in the data set. The more data sets you have, the more you can refine that score until you are nearly certain that you’ve identified a specific person, says Nitin Kohli, a graduate student at UC Berkeley who studies data privacy and security. Kohli describes this as a “snowballing effect.” The more data sets you can amass, the more you can learn about someone.
Deanonymized health data could be used by insurers to discriminate against patients. Anonymized web browsing data has been combined with publicly available information from Twitter to re-identify who did which searches. Location data could be used to track people’s movements, monitor where they pray, who they see, or whether they’re involved in political groups.
In his paper on deanonymization, “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization,” Paul Ohm writes, “Data can either be useful or perfectly anonymous but never both.” Ohm, who is also a member of The Markup’s board of directors, points out that identifying Netflix users may not be intrinsically dangerous but that every piece of identified data adds up to something more dangerous: “Our enemies will find it easier to connect us to facts that they can use to blackmail, harass, defame, frame, or discriminate against us.”
Deanonymization is possible because of how connected all our data is. When one family installs a doorbell with a camera, they allow data to be collected on their neighbors and friends. If you put your DNA into a genetic database, that has implications for your entire family. Even if your parents or siblings never opted into that database, their information is still exposed.
But the problem is much larger than any individual choices users can make. It’s not reasonable to expect people to stop using search engines, credit cards, or cellphones.
Privacy researchers argue that we need structural protections.
One option that’s currently being tested by the Census Bureau is called differential privacy. There is always randomness in statistical sampling. Pollsters, for example, can’t go out and interview every single voter. Instead, they pick a representative sample of the electorate and extrapolate from their answers what people in the rest of the country are thinking. They control for that randomness by including a standard error rate in their calculations.
Differential privacy protects individuals in the data set by intentionally introducing mathematical randomness, also called noise, into the data set. The amount of noise can be shared publicly, just like an error rate, but no one can know which statistics are the noise and which are real people. This solution gives researchers access to the database, but it also protects the privacy of the individuals in the set.
The data may not be as pure as it would have been, but no data set is absolutely perfect. There’s bias in how data gets collected: what information is valuable, who is represented, and what kinds of questions are asked. Adding a bias for privacy, especially when privacy is so easily eroded, might not be the worst solution.
Originally published as "When Is Anonymous Not Really Anonymous?" with the Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.
Create your free account to unlock your custom reading experience.