An open letter to Netflix from the authors of the de-anonymization paper

March 15, 2010 at 4:53 pm 19 comments

Dear Netflix,

Today is a sad day. It is also a day of hope.

It is a sad day because the second Netflix challenge had to be cancelled. We never thought it would come to this. One of us has publicly referred to the dampening of research as the “worst possible outcome” of privacy studies. As researchers, we are true believers in the power of data to benefit mankind.

We published the initial draft of our de-anonymization study just two weeks after the dataset for the first Netflix Prize became public. Since we had the math to back up our claims, we assumed that lessons would be learned, and that if there were to be a second data release, it would either involve only customers who opted in, or a privacy-preserving data analysis mechanism. That was three and a half years ago.

Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.

We were pleasantly surprised to read the plain, unobfuscated language in the blog post announcing the cancellation of the second contest. We hope that this signals a change in your outlook with respect to privacy. We are happy to see that you plan to “continue to explore ways to collaborate with the research community.”

Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems. But there are practical challenges, and overcoming them will likely require setting up an online system for data analysis rather than an “anonymize and release” approach.

Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.

It’s not often that a moral imperative aligns with business incentives. We hope that you will take advantage of this opportunity.

Arvind Narayanan and Vitaly Shmatikov

For background, see our paper and FAQ.

To stay on top of future posts on 33bits.org, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: blog_dape, de-anonymization, netflix, privacy.

History Stealing: It’s All Shades of Grey Is Making Public Data “More Public” a Privacy Violation?

19 Comments Add your own

1. Bertil Hatt | March 16, 2010 at 12:00 pm

Hard to say anything much—except I’m glad you are trying to adress what appears to be the most challenging and important issue today.

One approach I’ve explored (more because I’m lazy & can’t code than anything else, so it is valueless) is to cut the data download and give access through a cloud solution (preferably one that could bill testers for CPU usage) that explicitely prevents some known ways to reveal personal information. Testers accept you and a known, limited set of privacy experts to overlook their queries and prevent new infingement — and document them.

You would be moderators for data-journalism, in a way.
Reply
2. Dan Kaminsky | March 17, 2010 at 4:51 am

Arvind, Vitaly,

I hate to be the bearer of bad news, but you just don’t get to have it both ways.

You’ve made the point, painfully but definitively, that no data dump can be shown to be anonymized a priori, due to unavoidable correlations with external data sources. In other words, no data is a closed system.

The result of which is, yes, you have won. There simply will not be further anonymized data dumps.

Look. I know this isn’t the direction you wanted things to go in. I know you see some entire super fascinating research path, where you attempt to find a cooler and better ways to make water not wet. But I’m going to tell you, from a cryptographic perspective, you’ve simply made your case that any system patterned enough to predicted, is patterned enough to be correlated with external data sources.

Sure. I might not be right. But keep in mind, the bar in which you were correct before was offensive (“find one way to correlate”) and the new bar you’re hinting towards is defensive (“prevent any way to correlate”). The latter bar is miserably higher, and from a business perspective, even participating in some sort of grand challenge to clear it is ridiculous for NetFlix when simply not releasing the data is a viable option.

You’re not going to find someone who clamors louder for accurate data — but really? Maybe try to get your next round of data from someone else. After all, a second deanonymization at NetFlix would be orders of magnitude more painful, and realistically, you’re not going to succeed at “anon done right” anytime soon.

–Dan
Reply
- 3. Arvind | March 17, 2010 at 5:03 am
  
  Dan,
  
  We are not claiming that there is a better anonymization method, we are quite aware that there is none. This isn’t news to us — in fact we’ve been making that point at every opportunity for the last 3-4 years.
  
  That is why we said in the letter that protecting privacy will “likely require setting up an online system for data analysis rather than an “anonymize and release” approach.”
  Reply
  - 4. Dan Kaminsky | March 17, 2010 at 6:53 am
    
    Arvind,
    
    Oh, come on. You can’t say, “We are not claiming that there is a better anonymization method, we are quite aware that there is none.” — when you write:
    
    ==
    Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems
    ==
    
    Arvind, what do you think Differential Privacy *is*? I read Dwork’s paper — cool work, btw! — and here’s what I find:
    
    ===
    A privacy guarantee that limits risk incurred by joining encourages participation in the dataset, increasing social utility.
    This is the starting point on our path to dierential
    privacy.
    ===
    
    Sure looks to me like you’re saying, in that paragraph, that NetFlix should support research into something you just said wasn’t going to work.
    
    Look. There’s not going to be an online system for data analysis. I’ve got my scraping code, and so do you, and neither of us are going to be stopped by any server side limitations. The reality is that you can’t solve this issue with technology, only contract law, and even then only by drastically limiting to whom you hand the data.
    
    So that’s the reality. Small groups of vetted people will be the only people who get to play with the real bits. They won’t get to talk about it, and they won’t get to publish. NetFlix, and those like them, will get somewhat worse results from this — but the results won’t be *obviously* worse, especially compared to any competition which will be equally constrained.
    
    I don’t say this because it’s the future. I say this because it’s the present.
    Reply
    - 5. Adam Kramer | March 18, 2010 at 1:09 am
      
      I think that the middle ground here lies in the continuum between “public” and “private.” Contract law and a terms of service agreement can do a lot towards informing people what their data can be used for*, but it can also do a lot in terms of restricting what a data set can be used for by a team of researchers.
      
      Releasing a data set only when a researcher accepts (or signs, perhaps in blood) a ToS that explicitly prohibits sharing the data (or reverse-engineering userids) is one example. Releasing slightly different data to different researchers in order to track whose data gets “leaked” also seems quite reasonable–after all, the researcher doesn’t have to sign or download the data if they don’t want the risk of being held accountable for its use.
      
      Sufficient contract law in these cases can change the problem from a company betraying users’ privacy to a researcher doing so–and we researchers all had IRB approval to store and study the netflix data, right? It is, after all, research with human subjects published under your university byline–or a criminal actually stealing data from researchers. We already have infrastructure in place to deal with these cases, and they still (potentially) allow anybody access to your the data. It is analogous to locking your house unlocked–if someone walks in and takes your TV, they are still a thief.
      
      *: I know that nobody reads ToS; that is a separate issue which I believe HAS a solution.
      Reply
      - 6. Dan Kaminsky | March 18, 2010 at 2:25 am
        
        Adam,
        
        You’re right that this is an issue of contract law, not of technology. That’s why I brought up contract law.
        
        But here’s the problem. EULAs and TOS’s and the like just aren’t respected, when they’re a click through arrangement. Some kid will click through, write a new version of Arvind and Vitaly’s paper, and claim academic freedom. And in the battle between a couple of kids, and a large corp, the kids always win.
        
        Look. I’m a big fan of privacy work. I’m in the middle of supporting a privacy effort right now, so it’s not like I’m not part of this world myself. But, when I read things like:
        
        ===
        Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas.
        ===
        
        …all I see is some guy at NetFlix, probably reading this very thread, cracking up as I write out the words he desperately wishes he could have written himself: “Well then, maybe you shouldn’t have _TOTALLY F***ED US_.”
        
        What possible motivation could NetFlix have to work with Data Privacy researchers? They’re not out to build a more anonymizable dataset. They almost certainly see, as we all do, that that is not possible. Their goal was simply to have a better prediction engine.
        
        They got 10% better with openness. OK. Suppose they had simply hired and vetted ten teams — $50K upfront, $500K if you’re the best. How well do you think they would have done? 5%? 8%? Pretty close. Close enough. And no bad press.
        
        The reality is that the first open competition with real, solid data, has ended with a coda that says “Don’t do this, the privacy people might get the lawyers to fire you.” And you (should) know, every large company is really run by the lawyers. So the game is probably over.
        
        It is sad, though. Social science was about to have more, and better data, than the hard sciences.
        
        –Dan
        Reply
    - 7. Arvind | March 18, 2010 at 9:11 am
      
      Privacy is not the same as anonymization, and differential privacy is not a better way of anonymizing data. Rather, I see it as a way of bypassing the need for anonymization. For example, see this paper.
      Reply
      - 8. Dan Kaminsky | March 18, 2010 at 9:50 am
        
        Arvind,
        
        Oh come on. The first thing I saw when I clicked that link was:
        
        ===
        Unlike prior privacy work concerned with cryptographically securing the computation of recommendations, differential privacy constrains a computation in a way that precludes any inference about the underlying records from its output.
        ===
        
        In other words, a better way of anonymizing data, such that its release does not impact privacy — something we both know won’t actually work.
        
        Look. I’m totally sympathetic to your position. I’ll even grant that, if you hadn’t done all this, someone else would have, and people probably shouldn’t be angry at you for this inevitable happening. But, it’s where we are.
        Reply
        
        9. Thomas | April 2, 2010 at 4:04 am
        
        Hi Dan,
        
        You seem to misunderstand differential privacy — I recommend sitting down and reading one of these papers. I don’t know what you mean when you say it “won’t actually work”. If you read the mathematical definition, you will see that it offers quite a strong privacy guarantee. And there are many papers showing that for a surprisingly large number of tasks, it does work — you get useful algorithms (including for netflix recommender systems) that -provably- preserve differential privacy.
        
        10. AnonCryptographer | April 4, 2010 at 2:16 am
        
        I’m with Dan Kaminsky. I don’t have the guts to use my real name, but I’ve read about differential privacy, and I’m quite skeptical that it’ll work for this. I don’t think the problem is that I fail to understand differential privacy; I think the problem is that anonymization is darn hard.
        
        But whatever. The burden of proof is not on me; the burden of proof is on you, to prove convincingly that it is possible to anonymize/privatize this kind of data, without loss of utility. So far, you haven’t done so. The ball’s in your court. Let’s see the proof.
        
        For instance, how about demonstrating how you would have anonymized the first Netflix data set. Prove that your method preserves anonymity, and doesn’t eliminate utility. Then we can talk.
  - 11. Marcin Wojnarski | TunedIT | March 24, 2010 at 10:03 am
    
    … protecting privacy will likely require setting up an online system for data analysis rather than an “anonymize and release” approach …
    
    TunedIT (http://tunedit.org) is a challenge platform where online data analysis is possible, without releasing the data. Participants may submit executable code of their learning algorithm, which is then trained and tested on a dataset that’s kept secret on the server.
    
    This approach was used recently in Advanced Track of RSCTC Discovery Challenge that concerned DNA microarray data analysis (http://tunedit.org/challenge/RSCTC-2010-A) – online scoring of algorithms allowed for more precise evaluation, because every algorithm could have been trained multiple times on the same dataset, using different splits – something like cross-validation or leave-one-out approach. This was important because microarray data are very sparse by nature and single train+test evaluation doesn’t work pretty well.
    Reply
    - 12. Arvind | March 24, 2010 at 4:16 pm
      
      That is very interesting. Thanks for letting us know!
      Reply
13. Jennifer Lee | March 17, 2010 at 9:14 am

Setting up an online system for data analysis is an interactive way to preserve database privacy, isn’t it? Currently working in the non-interactive direction (anonymize and release), but it’s good to keep our mind open on the benefit of the interactive setting.

I still can’t tell which method is better. I guess I’ll have to go through more trials and errors, as well as monitoring current developments to find out :). Anonymization is a hard problem. The fact that there may never be full privacy guarantee just reminds myself not to expect too much from what I’m doing.

Thanks for the post again!
Reply
- 14. Dan Kaminsky | March 18, 2010 at 2:27 am
  
  No, you just scrape all the data you want out of the interactive system. You can try to stop this, but then you end up with the DRM problem.
  Reply
15. FXPAL Blog » Blog Archive » Whither data privacy? | March 17, 2010 at 12:02 pm

[…] On Friday Netflix canceled the sequel to its Netflix prize due to privacy concerns. The announcement of the cancellation has had a mixed reception from both researchers and the public. Narayanan and Shmatikov, the researchers who exposed the privacy issues in the original Netflix prize competition data, write “Today is a sad day. It is also a day of hope.” […]
Reply
16. Six Lines | Netflix Privacy Research | March 23, 2010 at 3:39 pm

[…] Narayanan and Vitaly Shmatikov, the authors of the Netflix dataset de-anonymization paper, wrote an open letter to Netflix about their recent cancellation of their second contest due to privacy concerns. It’s worth […]
Reply
17. Is Making Public Data “More Public” a Privacy Violation? « 33 Bits of Entropy | April 5, 2010 at 6:12 pm

[…] third reason is the “greater good.” I’ve opposed that line of reasoning when used to justify reneging on an explicit privacy promise. But when it […]
Reply
18. Mimi Yin | May 26, 2010 at 4:57 pm

Theoretical and practical evaluations of differential privacy aside, has anyone tried to put a number to the differential privacy guarantee? Meaning, has anyone tried to quantify your record will be almost indiscernible. Or the presence of your record will have negligible impact on answers given.

The organization I’m working with (The Common Data Project) is experimenting with using differential privacy to “automate” the release of sensitive data to the public.

One hurdle we’ve run up against is that there doesn’t seem to be an agreed upon “magic number” to quantify the “negligable impact” or “almost indiscernible” language in the guarantee?

Without such a quantity, we’re concerned that the differential privacy guarantee runs into some of the same problems that plague current definitions? (ie. vagueness which translates into difficult to enforce)

Does anyone here happen to know of any papers or discussions on this particular aspect of differential privacy? Any leads would be greatly appreciated!

We’ve documented our quest for the quantifiable guarantee in excruciating detail here: http://bit.ly/bj8gWF
Reply
19. Arvind | May 27, 2010 at 5:59 pm

Yes, there is no agreed-upon value of epsilon. That is indeed one stumbling block when implementing differential privacy.

However that’s a different problem than the difficulty with other definitions. Not having a magic value of epsilon does not make differential privacy any difficult to enforce once you do pick epsilon.

It would seem that a discussion group for everyone implementing or thinking of implementing differential privacy would be valuable.
Reply