Data Privacy: The Story of a Paradigm Shift

February 25, 2010 at 10:44 pm 7 comments

Let’s take a break from the Ubercookies series. I’m at the IPAM data privacy workshop in LA, and I want to tell you about the kind of unusual scientific endeavor that it represents. I’ve recently started to write about the process of doing science, what’s good and what’s bad about it, and I expect to have more to say on this topic in this blog.

While “paradigm shift” has become a buzzword, the original sense in which Kuhn used it refers to a specific scientific process. I’ve had the rare experience of witnessing such a paradigm shift unfold, and I may even have played a small part. I am going to tell that story. I hope it will give you a “behind-the-scenes” look into how science works.

I will sidestep the question of whether data privacy is a science. I think it is a science to the extent that computer science is a science. At any rate, I think this narrative provides a nice illustration of Kuhn’s ideas.

First I need to spend some time setting up the scene and the actors. (I’m going to take some liberties and simplify things for the benefit of the broader audience, and I hope my colleagues will forgive me for it.)

The scene. Privacy research is incredibly multidisciplinary, and this workshop represents one extreme of the spectrum: the math behind data privacy. The mathematical study of privacy in databases centers on one question:

If you have a bunch of data collected from individuals, and you want to let other people do something useful with the data, such as learning correlations, how do you do it without revealing individual information?

There are roughly 3 groups that investigate this question and are represented here:

  • computer scientists with a background in cryptography / theoretical CS
  • computer scientists with a background in databases and data mining
  • statisticians.

This classification is neither exhaustive nor strict, but it will suffice for my current purposes.

One of the problems with science and math research is that different communities studying different aspects of the the the same problem (or even studying the same problem from different perspectives) don’t meet together very often. For one, there is a good deal of friction in overcoming the language barriers (different names/ways of thinking about the same things). For another, academics are rewarded primarily for publishing in their own communities. That is why the organizers deserve a ton of credit for bridging the barriers and getting people together.

The paradigms. There is a fundamental, inescapable tension between the utility of data and the privacy of the participants. That’s the one thing that theorists and practitioners can agree on :-) Given that fact, there are two approaches to go about building a theory of privacy-protection, which I will call utility-first and privacy-first. Statisticians and database people tend to prefer the former paradigm, and cryptographers the latter; but this is not a clean division.

Utility-first hopes to be able to preserve the statistical computations that we would want to do if we didn’t have to worry about privacy, and then ask, “how can we improve the privacy of participants while still doing all these things?”  Data anonymization is one natural technique that comes out of this world view: if you are only doing simple syntactic transformations to the data, the utility of the data is not affected very much.

On the other hand, privacy-first says, “let’s first figure out a rigorously provable way to assure the privacy of participants, and then go about figuring out what are the types of computations that can be carried out under this rubric.” The community has collectively decided, with good reason, that differential privacy is the right rubric to use. To explain it properly would require many Greek symbols, so I won’t.

Privacy-first and utility-first are scientific paradigms, not theories. Neither is falsifiable. We can say that one is better, but that is a judgement.

An important caveat must be noted here. The terms do not refer to the social values of putting the utility of the data before the privacy of the participants, or vice versa. Those values are external to the model and are constraints enforced by reality. Instead, we are merely talking about which paradigm gives us better analytical techniques to achieve both the utility and privacy requirements to the extent possible.

The shift. With utility-first, you have strong, well-understood guarantees on the usefulness of the data, but typically only a heuristic analysis of privacy. What this translates to is an upper bound on privacy. With privacy-first, you have strong, well-understood privacy guarantees, but you only know how to perform certain types of computations on the data. So you have a lower bound on utility.

That’s where things get interesting. Utility-first starts to look worse as time goes on, as we discover more and more inferential techniques for breaching the privacy of participants. Privacy-first starts to look better with time, as we discover that more and more types of data-mining can be carried out due to innovative algorithms. And that is exactly how things have played out over the last few years.

I was at a similarly themed workshop at Bertinoro, Italy back in 2005, with much the same audience in attendance. Back then, the two views were about equally prevalent; the first papers on differential privacy were being written or had just been written (of course, the paradigm itself was not new). Fast forward 5 years, and the proponents of one view have started to win over the other, although we quibble to no small extent over the details. Overall, though, the shift has happened in a swift and amicable way, with both sides now largely agreeing on differential privacy.

Why did privacy-first win? I can see many reasons. The privacy protections of the utility-first techniques kept getting broken (a Kuhnian “crisis”?); the de-anonymization research that I and others worked on played a big part here. Another reason might be the way the cryptographic community operates: once they decide that a paradigm is worth investigating, they tend to jump in on it all at once and pick the bones clean. That ensured that within a few years, a huge number of results of the form “how to compute X with differential privacy” were published. A third reason might very well be the fact that these interdisciplinary workshops exist, giving us an opportunity to change each other’s minds.

The fallout. While the debate in theoretical circles seems largely over, the ripple effects are going to be felt “downstream” for a long time to come. Differential privacy is only slowly penetrating other areas of research where privacy is a peripheral but not a fundamental object of study. As for law and policy, Ohm’s paper on the failure of anonymization has certainly created a bang there.

That leaves the most important contingent: practitioners. Technology companies have been quick to learn the lessons — differential privacy was invented by Microsoft researchers — and have been studying questions like sharing search logs with differential privacy assurances and building programming systems incorporating differential privacy (see PINQ developed at Microsoft Research and Airavat funded by Google.)

Other sectors, especially medical informatics, have been far slower to adapt, and it is not clear if they ever will. Multiple speakers at this workshop dealing with applications in different sectors talked about their efforts at anonymizing high-dimensional data (good luck with that). The problems are compounded by the fact that differential privacy isn’t yet at a point where it is easily usable in applications and in many cases the upshot of the theory has been to prove that the simultaneous utility and privacy requirements simply cannot be met. It will probably be the better part of a decade before differential privacy starts to make any real headway into real-world usage.

Summary. I hope I’ve shown you what scientific “paradigms” are, how they are adopted and discarded. Paradigm shifts are important turning points for scientific disciplines and often have big consequences for society as a whole. Finally, science is not a cold sequence of deductions but is done by real people with real motivations; the scientific process has a significant social and cultural component, even if the output of science is objective.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , , , , , .

Google Docs Identity Leak Bug Fixed History Stealing: It’s All Shades of Grey

7 Comments Add your own

  • 1. Suresh  |  February 26, 2010 at 4:15 am

    I like your analysis of lower bounds vs upper bounds. But there’s plausibly another way of explaining why differential privacy “wins”: it’s because in an adversarial attack model, you have to have clearly modelled adversaries, and differential privacy does that, by using the crypto trick of merely limiting the power of the adversary. Whereas all the anonymization work was a lot fuzzier about the kinds of adversarial models.

    Maybe that merely amounts to what you were saying about upper bounds on privacy ?

    Reply
    • 2. Arvind  |  February 26, 2010 at 4:25 am

      Yes, absolutely. I was halfway through writing a paragraph on “adversarial thinking” and how differential privacy embodies it because of the crypto background. But then I deleted it because the post was getting too long :) I decided to kinda hint at it instead by talking about upper bounds.

      The other reason I didn’t write about it is that people have been asking me for examples of real-life adversaries, and I owe them a whole post, so I thought this would fit in better with that post.

      The most common objections to differential privacy come from people who don’t agree that adversaries should be modeled in a worst-case manner. I think we’re gonna need some more high-profile inferential privacy breaches to change their minds :)

      Reply
      • 3. Suresh  |  February 26, 2010 at 4:28 am

        Too true. people should pay more attention to the history of security breaches. And as more and more data gets into the cloud, this will only get worse and worse. Arguably the event that kicked off the original surge of interest in privacy was the demonstration by Sweeney that the MA governor’s info could be hacked ;)

        Reply
  • 4. noamnisan  |  February 26, 2010 at 5:32 am

    I like your viewpoint of looking at Differential Privacy as a paradigm. Its effect is noticed even without using any of the field’s technical results, just the general point of view: add some noise and prove that sufficiently little information is leaked and that a sufficiently good answer still emerges. It is the latter they may be first to be picked up by practitioners, it seems.

    Reply
  • 5. cowherd  |  February 28, 2010 at 7:41 pm

    “The most common objections to differential privacy come from people who don’t agree that adversaries should be modeled in a worst-case manner.”

    People in the “real” world don’t care about security; they care about risk. Risk accounts for the likelihood of a threat (among other factors). Unless the exposure or consequence are much higher than the inverse of the likelihood, in practice the worst case model will continue to be ignored as more cost than it is worth.

    Reply
    • 6. Arvind  |  March 4, 2010 at 8:50 pm

      I think reality has consistently shown that people tend to underestimate privacy risks. It’s hard to measure the harm to individuals, because it is the aggregate of a million tiny harms, so let’s look only at examples where the risk to the data curator did indeed materialize.

      1. AOL had to fire top executives after the search data release
      2. Netflix got sued and got bad press
      3. Decode Genetics had its grand plans in Iceland (i.e., the Health Database Act of 1998) shut down by the Supreme Court, partly because of the re-identification risk. The company later went bankrupt.
      4. After the Homer et al. paper, a whole bunch of genetics research databases went behind closed doors and researchers now have to jump through hoops to get them and everyone complains about it.

      I could go on. If you look outside re-identification, it gets even worse. Google got a huge amount of bad press due to privacy problems in Buzz even though they fixed it in 4 days.

      I’m not arguing for the worst-case model as beneficial the good of humanity. I’m arguing that it is beneficial purely based on self-interest.

      Reply
  • […] the right paradigm in the privacy context. The theoretical study of database privacy seems to be doing rather well by borrowing methods from cryptography, and I’ve argued in support of adversarial thinking […]

    Reply

Leave a comment

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 266 other subscribers