Posts tagged ‘prizes’

Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out

Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.

These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?

I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.

That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.

For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.

  • Release data on only a subset of users, minimizing the quantitative risk. [1]
  • Release a synthetic dataset created to mimic the characteristics of the real data. [2]

For the second stage, there are various possibilities, not mutually exclusive:

  • Require contestants to sign a data-use agreement.
  • Restrict the contest to a shortlist of best performers from the first stage.
  • Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data. recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.

In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.

The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.

[1] Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.

[2] Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 14, 2011 at 6:54 pm Leave a comment

The Surprising Effectiveness of Prizes as Catalysts of Innovation

Although the strategic use of prizes to foster sci-tech innovation has a long history, it has exploded in the last two decades—35% annual growth on average, or doubling every 2.3 years.[1] Much has been said on the topic, but I have yet to see a clear answer to the core mystery:

Why do prizes work?

Specifically, why are they more effective than simply hiring people to do it? The question is more complex than it sounds, and a valid explanation must address the following:

  • Why shouldn’t government and industry research funding be switched over entirely to a prize-based model?
  • Why did the prize revolution happen in the last two decades, and not earlier?
  • How do prizes succeed in spite of the massive duplication of effort that you’d expect due to numerous contestants trying to solve the same problem?

Prizes exploit the productivity-reward imbalance

In many fields there is a huge disparity—order of magnitude or more—between the productivity of the top performers and the median performers. The structure of the corporation, having co-evolved with the industrial revolution for harnessing workers to build railroads or textiles, is fundamentally limited in its ability to reward employees in creative endeavors in proportion to their contribution, or even measure it. Academia is a little better due to the precedence of fame over monetary reward, but has its own problems.

Enter prizes. The winner-take-all structure gives individuals or small organizations of exceptional caliber a chance to earn prestige as well as cash that they don’t otherwise have a shot at.

Given that the best innovators are more likely to feel that an academic or corporate job under-rewards them, self-selected prize contestants are likely to skew toward high-performers.

Prizes channel existing research funding

The Netflix prize attracted 34,000 contestants. At an average of just 1 hour (valued at $100) per contestant, the monetary value of the time spent on the contest dwarfs the prize amount. And the majority of contestants—or at least the ones with a serious chance—were already employed as researchers. This effect is broadly true: for example, contestants spent a total of over $100 million in pursuit of the Ansari X Prize which carries a $10 million award.

The real funding for prize-winning efforts comes from Government grants and corporate research labs. The prize itself serves to mainly to legitimize the task as a research goal.

This is in no way meant to be a criticism of prizes—sure, prizes direct attention away from other problems, but one expects that on average, problems for which prizes are offered are more important than others.

Nor does the ability of prizes to spur effort far in excess of the monetary award necessarily mean that contestant behavior is irrational, since the prestige and media attention are typically worth far more than the cash, and because failure to win the prize doesn’t mean the effort is wasted.

That said, the well-known human tendency to systematically overestimate one’s own abilities certainly has a role in explaining the power of prizes to attract talent. According to the same McKinsey report linked above, “many of the participants that we interviewed were absolutely convinced they were going to win [the Ansari X Prize], if not this year, then surely the next.”

What about democratization?

The openness of prizes is often advanced as a key reason for their superiority over traditional research funding. There are two very different components to this assertion: the first is that prizes encourage hybridization of expertise from different fields, given that researchers often fall into the trap of collaborating only within their own communities. There is evidence for this from a study of Innocentive.

The second argument is that prizes allow even non-expert members of the general public, who might otherwise never be involved in research, to participate. I find this argument unconvincing and there is little evidence to support it, if you ignore anecdotes from the 19th century when science funding was meager by today’s standards. However, crowdsourcing to the public seems a good strategy for prizes that are more about problem solving than original research. may be a good example, depending on how it pans out.

The Internet as an enabler

Now let’s look at the three auxiliary questions I posed above. My explanation for prize effectiveness—self-selection, redirection of funding, and interdisciplinary collaboration—can answer them comfortably. If all research funding were based on prizes, it would defeat the purpose since prizes only serve to redirect existing research funding.

The rapid growth of the sector since 1990 is an obvious indication that the Internet had something to do with it. But how exactly? I think there are several reasons. First, the Internet could be making it easier for experts from different physical locations and/or areas of expertise to team up and to collaborate.

Second, increased reach, shorter cycles and improved economies of scale in most markets in the Internet era have exacerbated the performance-reward imbalance, as well as making the imbalance more obvious to all involved. This is a factor fueling the startup revolution as well.

Finally, and perhaps crucially, I believe the Internet has largely nullified one of the key disadvantages of prizes, which is duplication of effort. The Netflix prize, for one, was marked by a remarkable degree of sharing, and sponsors of new contests are increasingly tweaking the process to ensure that teams build on each other’s ideas.

These factors are only going to accelerate in the future, which suggests that the torrid growth of prizes in number and amount is going to continue for some time to come. There are now many companies dedicated to running these contests—Innocentive is the leader, and Kaggle is a startup focused on the data-mining space. Exciting times.

[1] My numbers are based on this McKinsey report which seems by far the most comprehensive study of prizes and is well worth reading for anyone interested in the subject. The aggregate purse of prizes over $100,000 grew from $50MM to $302MM from 1991 to 2008, during which period the share of “inducement prizes,” the kind we’re concerned with here, showed remarkable growth from 3% of the total to 78%.

Thanks to Steve Lohr for pointers to research when he interviewed me for his NYTimes Bits piece, and to @dan_munz and other Twitter followers for useful suggestions.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 6, 2011 at 3:03 pm Leave a comment


I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 259 other followers