Posts tagged ‘de-anonymization’

One more re-identification demonstration, and then I’m out

What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.

One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.

Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.

Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:

How to do machine learning ethically

Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.

This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.

And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2015 at 8:20 am Leave a comment

Good and bad reasons for anonymizing data

Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.

Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.

Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.

Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)

Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.

Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.

Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.

Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]

Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.

It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.

Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).

Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).

Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)

Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.

[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

July 9, 2014 at 8:05 am Leave a comment

Reidentification as Basic Science

This essay originally appeared on the Bill of Health blog as part of a conversation on the law, ethics and science of reidentification demonstrations.

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paperdigital camerasRFID tagsscanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.

Thanks to Vitaly Shmatikov for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

May 27, 2013 at 6:16 am Leave a comment

New Developments in Deanonymization

This post is a roundup of developments in deanonymization in the last few months. Let’s start with two stories relating to how a malicious website can silently discover the identity of a visitor, which is an insidious type of privacy breach that I’ve written about quite a bit (1, 2, 3, 4, 5, 6).

Firefox bug exposed your identity. The first is a vulnerability resulting from a Firefox bug in the implementation of functions like exec and test. The bug allows a website to learn the URL of an embedded iframe from some other domain. How can this lead to uncovering the visitor’s identity? Because twitter.com/lists redirects to twitter.com/<username>/lists. This allows a malicious website to open a hidden iframe pointing to twitter.com/lists, query the URL after redirection, and learn the visitor’s Twitter handle (if they are logged in). [1,2]

This is very similar to a previous bug in Firefox that led to the same type of vulnerability. The URL redirect that was exploited there was google.com/profiles/me  → user-specific URL. It would be interesting to find and document all such generic-URL → user-specific-URL redirects in major websites. I have a feeling this won’t be the last time such redirection will be exploited.

Visitor deanonymization in the wild. The second story is an example of visitor deanonymization happening in the wild. It appears that the technique utilizes a tracking cookie from a third-party domain to which the visitor previously gave their email and other info., in other words, #3 in my five-fold categorization of ways in which identity can be attached to browsing logs.

I don’t consider this instance to be particularly significant — I’m sure there are other implementations in the wild — and it’s not technically novel, but this is the first time as far as I know that it’s gotten significant attention from the public, even if only in tech circles. I see this as a first step in a feedback loop of changing expectations about online anonymity emboldening more sites to deanonymize visitors, thus further lowering the expectation of privacy.

Deanonymization of mobility traces. Let’s move on to the more traditional scenario of deanonymization of a dataset by combining it with an auxiliary, public dataset which has users’ identities. Srivatsa and Hicks have a new paper with demonstrations of deanonymization of mobility traces, i.e., logs of users’ locations over time. They use public social networks as auxiliary information, based on the insight that pairs of people who are friends are more likely to meet with each other physically. The deanonymization of Bluetooth contact traces of attendees of a conference based on their DBLP co-authorship graph is cute.

This paper adds to the growing body of evidence that anonymization of location traces can be reversed, even if the data is obfuscated by introducing errors (noise).

So many datasets, so little time. Speaking of mobility traces, Jason Baldridge points me to a dataset containing mobility traces (among other things) of 5 million “anonymous” users in the Ivory Coast recently released by telecom operator Orange. A 250-word research proposal is required to get access to the data, which is much better from a privacy perspective than a 1-click download. It introduces some accountability without making it too onerous to get the data.

In general, the incentive for computer science researchers to perform practical demonstrations of deanonymization has diminished drastically. Our goal has always been to showcase new techniques and improve our understanding of what’s possible, and not to name and shame. Even if the Orange dataset were more easily downloadable, I would think that the incentive for deanonymization researchers would be low, now that the Srivatsa and Hicks paper exists and we know for sure that mobility traces can be deanonymized, even though the experiments in the paper are on a far smaller scale.

Head in the sand: rational?! I gave a talk at a privacy workshop recently taking a look back at how companies have reacted to deanonymization research. My main point was that there’s a split between the take-your-data-and-go-home approach (not releasing data because of privacy concerns) and the head-in-the-sand approach (pretending the problem doesn’t exist). Unfortunately but perhaps unsurprisingly, there has been very little willingness to take a middle ground, engaging with data privacy researchers and trying to adopt technically sophisticated solutions.

Interestingly, head-in-the-sand might be rational from companies’ point of view. On the one hand, researchers don’t have the incentive for deanonymization anymore. On the other hand, if malicious entities do it, naturally they won’t talk about it in public, so there will be no PR fallout. Regulators have not been very aggressive in investigating anonymized data releases in the absence of a public outcry, so that may be a negligible risk.

Some have questioned whether deanonymization in the wild is actually happening. I think it’s a bit silly to assume that it isn’t, given the economic incentives. Of course, I can’t prove this and probably never can. No company doing it will publicly talk about it, and the privacy harms are so indirect that tying them to a specific data release is next to impossible. I can only offer anecdotes to explain my position: I have been approached multiple times by organizations who wanted me to deanonymize a database they’d acquired, and I’ve had friends in different industries mention casually that what they do on a daily basis to combine different databases together is essentially deanonymization.

[1] For a discussion of why a social network profile is essentially equivalent to an identity, see here and the epilog here.
[2] Mozilla pulled Firefox 16 as a result and quickly fixed the bug.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

December 17, 2012 at 8:59 am 1 comment

Is Writing Style Sufficient to Deanonymize Material Posted Online?

I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:

Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.

Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]

Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?

Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.

In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel.  But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.

We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.

Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.

An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.

We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.

We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.

Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.

An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.

In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.

The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.

[1] A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.

[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 20, 2012 at 9:40 am 7 comments

No Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noise

The previous article looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale.

For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that come out of a manufacturing plant behave slightly differently in characteristic ways, and can therefore be distinguished based on their output or behavior. How could this be? The unifying principle is this:

Microscopic physical irregularities due to natural structure and/or manufacturing defects cause observable, albeit tiny, behavioral differences.

Digital camera identification belongs to a class of techniques that exploits ‘pattern noise’ in the ‘sensor arrays’ that capture images. The same techniques can be used to fingerprint a scanner by analyzing pixel-level patterns in the images scanned by it, but that’ll be the focus of a later article.

A long-exposure dark frame [source]. Click image to see full size. Three ‘hot pixels’ and some other sensor noise can be seen.

A photo taken in the absence of any light doesn’t look completely black; a variety of factors introduce noise. There is random noise that varies in every image, but there is also ‘pattern noise’ due to inherent structural defects or irregularities in the physical sensor array. The key property of the latter kind of noise is that it manifests the same way every image taken by the camera.[1] Thus, the total noise vector produced by a camera is not identical between images, nor is it completely independent.

The pixel-level noise components in images taken by the same camera are correlated with each other.

Nevertheless, separating the pattern noise from random noise and the image itself — after all, a good camera will seek to minimize the strength or ‘power’ of the noise in relation to the image — is a very difficult task, and is the primary technical challenge that camera fingerprinting techniques must address.

Security vs. privacy. A quick note about the applications of camera fingerprinting. We saw in the previous article that there are security-enhancing and privacy-infringing applications of document fingerprinting. In fact, this is almost always the case with fingerprinting techniques. [2]

Camera fingerprinting can be used on the one hand for detecting forgeries (e.g., photoshopped images), and to aid criminal investigations by determining who (or rather, which camera) might have taken a picture. On the other hand, it could potentially also be used for unmasking individuals who wish to disseminate photos anonymously online.

Sadly, most papers studying fingerprinting study only the former type of application, which is why we’ll have to speculate a bit on the privacy impact, even though the underlying math of fingerprinting is the same.

Most fingerprinting techniques have both security-enhancing and privacy-infringing applications. The underlying principles are the same but they are applied slightly differently.

Another point to note is that because of the focus on forensics, most of the work in this area so far has studied distinguishing different camera models. But there are some preliminary results on distinguishing ‘identical’ cameras, and it appears that the same techniques will work.

In more detail. Let’s look at what I think is the most well-known paper on sensor pattern noise fingerprinting, by Binghamton University researchers Jan Lukáš, Jessica Fridrich, and Miroslav Golja. [3] Here’s how it works: the first step is to build a reference pattern of a camera from multiple known images taken from it, so that later an unsourced image can be compared against these reference patterns. The authors suggest using at least 50, but for good measure, they use 320 in their experiments. In the forensics context, the investigator probably has physical possession of the camera and therefore can generate an unlimited number of images. We’ll discuss what this requirement means in the privacy-breach context later.

There are two steps to build the reference pattern. First, for each image, a denoising filter is applied, and the denoised image is subtracted from the original to leave only the noise. Next, the noise is averaged across all the reference images — this way the random noise cancels out and leaves the pattern noise.

Comparing a new image to a reference pattern, to test if it came from that camera, is easy: extract the noise from the test image, and compare this noise pixel-by-pixel with the reference noise. The noise from the test image includes random noise, so the match won’t be close to perfect, but nevertheless the correlation between the two noise patterns will be roughly equal to the contribution of pattern noise towards the total noise in the test image. On the other hand, if the test image didn’t come from the same camera, the correlation will be close to zero.

The authors experimented with nine cameras, of which two were from the same brand and model (Olympus Camedia C765). In addition, two other cameras had the same type of sensor. There was not a single error in their 2,700 tests, including those involving the two ‘identical’ cameras — in each case, the algorithm correctly identified which of the nine cameras a given image came from. By extrapolating the correlation curves, they conservatively estimate that for a False Accept Rate of 10-3, their method achieves a False Reject Rate of anywhere between 10-2 to 10-10 or even less depending on the camera model and camera settings.

The takeaway from this seems to be that distinguishing between cameras of different models can be performed with essentially perfect accuracy. Distinguishing between cameras of the same model also seems to have very high accuracy, but it is hard to generalize because of the small sample size.

Improvements. Impressive as the above numbers are, there are at least two major ways in which this result can, and has been improved. First, the Binghamton paper is focused on a specific signal, sensor noise. But there are several stages in image acquisition and processing pipeline in the camera, each of which could leave idiosyncratic effects on the image. This paper out of Turkey incorporates many such effects by considering all patterns of certain types that occur in the lower order (least significant) bits of the image, which seems like a rather powerful technique.

The effects other than sensor noise seem to help more with identifying the camera model than the specific device, but to the extent that the former is a component of the latter, it is useful. They achieve a 97.5% accuracy among 16 test cameras — but with cellphone cameras with pictures at a resolution of just 640×480.

Second is the effect of the scene itself on the noise. Denoising transformations are not perfect — sharp boundaries look like noise. The Binghamton researchers picked their denoising filter (a wavelet transform) to minimize this problem, but a recent paper by Chang-Tsun Li claims to do it better, and shows even better numerical results: with 6 cameras (all different models), accurate (over 99%) identification for image fragments cropped to just 256 x 512.

What does this mean for privacy? I said earlier that there is a duality between security and privacy, but let’s examine the relationship in more detail. In privacy-infringing applications like mass surveillance, the algorithm need not always produce an answer, and it can occasionally be wrong when it does. The penalty for errors is much lower. On the other hand, the matching algorithm in surveillance-like applications needs to handle a far larger number of candidate cameras. The key point is:

The parameters of fingerprinting algorithms can usually be tweaked to handle a larger number of classes (i.e., devices) at the expense of accuracy.

My intuition is that state-of-the-art techniques, configured slightly differently, should allow probabilistic deanonymization from among tens of thousands of different cameras. A Flickr or Picasa profile with a few dozen images should suffice to fingerprint a camera.[4] Combined with metadata such as location, this puts us within striking distance of Internet-scale source-camera identification from anonymous images. I really hope there will be some serious research on this question.

Finally, a word defenses. If you find yourself in a position where you wish to anonymously publicize a sensitive photograph you took, but your camera is publicly tied to your identity because you’ve previously shared pictures on social networks (and who hasn’t), how do you protect yourself?

Compressing the image is one possibility, because that destroys the ‘lower-order’ bits that fingerprinting crucially depends on. However, it would have to be way more aggressive than most camera defaults (JPEG quality factor ~60% according to one of the studies, whereas defaults are ~95%). A different strategy is rotating the image slightly in order to ‘desynchronize’ it, throwing off the fingerprint matching. An attack that defeats this will have to be much more sophisticated and will have a far higher error rate.

The deanonymization threat here is analogous to writing-style fingerprinting: there are simple defenses, albeit not foolproof, but sadly most users are unaware of the problem, let alone solutions.

[1] That was a bit simplified; mathematically, there is an additive component (dark signal nonuniformity) and a multiplicative component (photoresponse nonuniformity). The former is easy to correct for, and higher-end cameras do, but the latter isn’t.

[2] Much has been said about the tension between security and privacy at a social/legal/political level, but I’m making a relatively uncontroversial technical statement here.

[3] Fridrich is incidentally one of the pioneers of speedcubing i.e., speed-solving the Rubik’s cube.

[4] The Binghamton paper uses 320 images per camera for building a fingerprint (and recommends at least 50); the Turkey paper uses 100, and Li’s paper 50. I suspect that if more than one image taken from the unknown camera is available, then the number of reference images can be brought down by a corresponding factor.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

September 19, 2011 at 9:25 am 5 comments

Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge

The title of this post is also the title of a new paper of mine with Elaine Shi and Ben Rubinstein. You can grab a PDF or a web-friendly HTML version generated using my Project Luther software.

A brief de-anonymization history. As early as the first version of my Netflix de-anonymization paper with Vitaly Shmatikov back in 2006, a colleague suggested that de-anonymization can in fact be used to game machine-learning contests—by simply “looking up” the attributes of de-anonymized users instead of predicting them. We off-handedly threw in paragraph in our paper discussing this possibility, and a New Scientist writer seized on it as an angle for her article.[1] Nothing came of it, of course; we had no interest in gaming the Netflix Prize.

During the years 2007-2009, Shmatikov and I worked on de-anonymizing social networks. The paper that resulted (PDF, HTML) showed how to take two graphs representing social networks and map the nodes to each other based on the graph structure alone—no usernames, no nothing. As you might imagine, this was a phenomenally harder technical challenge than our Netflix work. (Backstrom, Dwork and Kleinberg had previously published a paper on social network de-anonymization; the crucial difference was that we showed how to put two social network graphs together rather than search for a small piece of graph-structured auxiliary information in a large graph.)

The context for these two papers is that data mining on social networks—whether online social networks, telephone call networks, or any type of network of links between individuals—can be very lucrative. Social networking websites would benefit from outsourcing “anonymized” graphs to advertisers and such; we showed that the privacy guarantees are questionable-to-nonexistent since the anonymization can be reversed. No major social network has gone down this path (as far as I know), quite possibly in part because of the two papers, although smaller players often fly under the radar.

The Kaggle contest. Kaggle is a platform for machine learning competitions. They ran the IJCNN social network challenge to promote research on link prediction. The contest dataset was created by crawling an online social network—which was later revealed to be Flickr—and partitioning the obtained edge set into a large training set and a smaller test set of edges augmented with an equal number of fake edges. The challenge was to predict which edges were real and which were fake. Node identities in the released data were obfuscated.

There are many, many anonymized databases out there; I come across a new one every other week. I pick de-anonymization projects if it will advance the art significantly (yes, de-anonymization is still partly an art), or if it is fun. The Kaggle contest was a bit of both, and so when my collaborators invited me to join them, it was too juicy to pass up.

The Kaggle contest is actually much more suitable to game through de-anonymization than the Netflix Prize would have been. As we explain in the paper:

One factor that greatly affects both [the privacy risk and the risk of gaming]—in opposite directions—is whether the underlying data is already publicly available. If it is, then there is likely no privacy risk; however, it furnishes a ready source of high-quality data to game the contest.

The first step was to do our own crawl of Flickr; this turned out to be relatively easy. The two graphs (the Kaggle graph and our Flickr crawl), were 95% similar, as we were later able to determine. The difference is primarily due to Flickr users adding and deleting contacts between Kaggle’s crawl and ours. Armed with the auxiliary data, we set about the task of matching up the two graphs based on the structure. To clarify: our goal was to map the nodes in the Kaggle training and test dataset to real Flickr nodes. That would allow us to simply look  up the pairs of nodes in the test set in the Flickr graph to see whether or not the edge exists.

De-anonymization. Our effort validated the broad strategy in my paper with Shmatikov, which consists of two steps: “seed finding” and “propagation.” In the former step we somehow de-anonymize a small number of nodes; in the latter step we use these as “anchors” to propagate the de-anonymization to more and more nodes. In this step the algorithm feeds on its own output.

Let me first describe propagation because it is simpler.[2] As the algorithm progresses, it maintains a (partial) mapping between the nodes in the true Flickr graph and the Kaggle graph. We iteratively try to extend the mapping as  follows: pick an arbitrary as-yet-unmapped node in the Kaggle graph, find the “most similar” node in the Flickr graph, and if they are “sufficiently similar,” they get mapped to each other.

Similarity between a Kaggle node and a Flickr node is defined as cosine similarity between the already-mapped neighbors of the Kaggle node and the already-mapped neighbors of the Flickr node (nodes mapped to each other are treated as identical for the purpose of cosine comparison).

In the diagram, the blue  nodes have already been mapped. The similarity between A and B is 2 / (√3·√3) =  ⅔. Whether or not edges exist between A and A’ or B and B’ is irrelevant.

There are many heuristics that go into the “sufficiently similar” criterion, which are described in our paper. Due to the high percentage of common edges between the graphs, we were able to use a relatively pure form of the propagation algorithm; the one my paper with Shmatikov, in contast, was filled with lots more messy heuristics.

Those elusive seeds. Seed identification was far more challenging. In the earlier paper, we didn’t do seed identification on real graphs; we only showed it possible under certain models for error in auxiliary information. We used a “pattern-search” technique, as did the Backstrom et al paper uses a similar approach. It wasn’t clear whether this method would work, for reasons I won’t go into.

So we developed a new technique based on “combinatorial optimization.” At a high level, this means that instead of finding seeds one by one, we try to find them all at once! The first step is to find a set of k (we used k=20) nodes in the Kaggle graph and k nodes in our Flickr graph that are likely to correspond to each other (in some order); the next step is to find this correspondence.

The latter step is the hard one, and basically involves solving an NP-hard problem of finding a permutation that minimizes a certain weighting function. During the contest I basically stared at this page of numbers for a couple of hours, and then wrote down the mapping, which to my great relief turned out to be correct! But later we were able to show how to solve it in an automated and scalable fashion using simulated annealing, a well-known technique to approximately solve NP-hard problems for small enough problem sizes. This method is one of the main research contributions in our paper.

After carrying out seed identification, and then propagation, we had de-anonymized about 65% of the edges in the contest test set and the accuracy was about 95%. The main reason we didn’t succeed on the other third of the edges was that one or both the nodes had a very small number of contacts/friends, resulting in too little information to de-anonymize. Our task was far from over: combining de-anonymization with regular link prediction also involved nontrivial research insights, for which I will again refer you to the relevant section of the paper.

Lessons. The main question that our work raises is where this leaves us with respect to future machine-learning contests. One necessary step that would help a lot is to amend contest rules to prohibit de-anonymization and to require source code submission for human verification, but as we explain in the paper:

The loophole in this approach is the possibility of overfitting. While source-code verification would undoubtedly catch a contestant who achieved their results using de-anonymization alone, the more realistic threat is that of de-anonymization being used to bridge a small gap. In this scenario, a machine learning algorithm would be trained on the test set, the correct results having been obtained via de-anonymization. Since successful [machine learning] solutions are composites of numerous algorithms, and consequently have a huge number of parameters, it should be possible to conceal a significant amount of overfitting in this manner.

As with the privacy question, there are no easy answers. It has been over a decade since Latanya Sweeney’s work provided the first dramatic demonstration of the privacy problems with data anonymization; we still aren’t close to fixing things. I foresee a rocky road ahead for machine-learning contests as well. I expect I will have more to say about this topic on this blog; stay tuned.

[1] Amusingly, it was a whole year after that before anyone paid any attention to the privacy claims in that paper.

[2] The description is from my post on the Kaggle forum which also contains a few additional details.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 9, 2011 at 12:30 pm 4 comments

Older Posts


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 255 other followers