Posts tagged ‘blog_dape’

One more re-identification demonstration, and then I’m out

What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.

One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.

Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.

Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:

How to do machine learning ethically

Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.

This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.

And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2015 at 8:20 am Leave a comment

Good and bad reasons for anonymizing data

Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.

Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.

Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.

Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)

Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.

Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.

Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.

Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]

Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.

It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.

Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).

Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).

Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)

Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.

[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

July 9, 2014 at 8:05 am Leave a comment

Personalized coupons as a vehicle for perfect price discrimination

Given the pervasive tracking and profiling of our shopping and browsing habits, one would expect that retailers would be very good at individualized price discrimination —  figuring out what you or I would be willing to pay for an item using data mining, and tailoring prices accordingly. But this doesn’t seem to be happening. Why not?

This mystery isn’t new. Mathematician Andrew Odlyzko predicted a decade ago that data-driven price discrimination would become much more common and effective (paper, interview). Back then, he was far ahead of his time. But today, behavioral advertising at least has gotten good enough that it’s often creepy. The technology works; the impediment to price discrimination lies elsewhere. [1]

It looks like consumers’ perception of unfairness of price discrimination is surprisingly strong, which is why firms balk at overt price discrimination, even though covert price discrimination is all too common. But the covert form of price discrimination is not only less efficient, it also (ironically) has significant social costs — see #3 below for an example. Is there a form of pricing that allows for perfect discrimination (i.e., complete tailoring to individuals), in a way that consumers find acceptable? That would be the holy grail.

In this post, I will argue that the humble coupon, reborn in a high-tech form, could be the solution. Here’s why.

1. Coupons tap into shopper psychology. Customers love them.

Coupons, like sales, introduce unpredictability and rewards into shopping, which provides a tiny dopamine spike that gets us hooked. JC Penney’s recent misadventure in trying to eliminate sales and coupons provides an object lesson:

“It may be a decent deal to buy that item for $5. But for someone like me, who’s always looking for a sale or a coupon — seeing that something is marked down 20 percent off, then being able to hand over the coupon to save, it just entices me. It’s a rush.”

Some startups have exploited this to the hilt, introducing “gamification” into commerce. Shopkick is a prime example. I see this as a very important trend.

2. Coupons aren’t perceived as unfair.

Given the above, shoppers have at best a dim perception of coupons as a price discrimination mechanism. Even when they do, however, coupons aren’t perceived as unfair to nearly the same degree as listing different prices for different consumers, even if the result in either case is identical. [2]

3. Traditional coupons are not personalized.

While customers may have different reasons for liking coupons, from firms’ perspective the way in which traditional coupons aid price discrimination is pretty simple: by forcing customers to waste their time. Econ texts tend to lay it out bluntly. For example, R. Preston McAfee:

Individuals generally value their time at approximately their wages, so that people with low wages, who tend to be the most price-sensitive, also have the lowest value of time. … A thrifty shopper may be able to spend an hour sorting through the coupons in the newspaper and save $20 on a $200 shopping expedition … This is a good deal for a consumer who values time at less than $20 per hour, and a bad deal for the consumer that values time in excess of $20 per hour. Thus, relatively poor consumers choose to use coupons, which permits the seller to have a price cut that is approximately targeted at the more price-sensitive group.

Clearly, for this to be effective, coupon redemption must be deliberately made time-consuming.

To the extent that there is coupon personalization, it seems to be for changing shopper behavior (e.g., getting them to try out a new product) rather than a pricing mechanism. The NYT story from last year about Target targeting pregnant women falls into this category. That said, these different forms of personalization aren’t entirely distinct, which is a point I will return to in a later article.

4. The traditional model doesn’t work well any more.

Paper coupons have a limited future. As for digital coupons, there is a natural progression toward interfaces that make it easier to acquire and redeem them. In particular, as more shoppers start to pay using their phones in stores, I anticipate coupon redemption being integrated into payment apps, thus becoming almost frictionless.

An interesting side-effect of smartphone-based coupon redemption is that it gives the shopper more privacy, avoiding the awkwardness of pulling out coupons from a purse or wallet. This will further open up coupons to a wealthier demographic, making them even less effective at discriminating between wealthier shoppers and less affluent ones.

5. The coupon is being reborn in a data-driven, personalized form.

With behavioral profiling, companies can determine how much a consumer will pay for a product, and deliver coupons selectively so that each customer’s discount reflects what they are willing to pay. They key difference is what while in the past, customers decided whether or not to look for, collect, and use a coupon, in the new model companies will determine who gets which coupons.

In the extreme, coupons will be available for all purchases, and smart shopping software on our phones or browsers will automatically search, aggregate, manage, and redeem these coupons, showing coupon-adjusted prices when browsing for products. More realistically, the process won’t be completely frictionless, since that would lose the psychological benefit. Coupons will probably also merge with “rewards,” “points,” discounts, and various other incentives.

There have been rumblings of this shift here and there for a few years now, and it seems to be happening gradually. Google’s acquisition of Incentive Targeting a few months ago seems significant, and at the very least demonstrates that tech companies are eyeing this space as well, and not just retailers. As digital feudalism takes root, it could accelerate the trend of individualized shopping experiences.

In summary, personalized coupons offer a vehicle for realizing the full potential of data mining for commerce by tailoring prices in a way that consumers seem to find acceptable. Neither coupons nor price discrimination should be viewed in isolation — together with rewards and various other incentive schemes, they are part of the trend of individualized, data mining-driven commerce that’s here to stay.

Footnotes

[1] Since I’m eschewing some academic terminology in this post, here are a few references and points of clarification. My interest is in first-degree price discrimination. Any price discrimination requires market power; my assumption is that is the case in practice because competition is always imperfect, and we should expect quite a bit of first-degree price discrimination. The observed level is puzzlingly low.

The impact of technology on the ability to personalize prices is complex, and behavioral profiling is only one aspect. Technology also makes competition less perfect by allowing firms to customize products to a greater degree, so that there are no exact substitutes. Finally, technology hinders first-degree price discrimination to an extent by allowing consumers to compare prices between different retailers more easily. The interaction between these effects is analyzed in this paper.

Technology also increases the incentive to price discriminate. As production becomes more and more automated, marginal costs drop relative to fixed costs. In the extreme, digital goods have essentially zero marginal cost. When marginal production costs are low, firms will try to tailor prices since any sale above marginal cost increases profits.

My use of the terms overt and covert is rooted in the theory of price fairness in psychology and behavioral economics, and relates to the presentation of the transaction. While it is somewhat related to first- vs. second/third-degree price discrimination, it is better understood as a separate axis, one that is not captured by theories of rational firms and consumers.

[2] An exception is when non-coupon customers are made aware that others are getting a better deal. This happens, for example, when there is a prominent coupon-code form field in an online shopping checkout flow. See here for a study.

Thanks to Sebastian Gold for reviewing a draft, and to Justin Brickell for interesting conversations that led me to this line of thinking.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

June 25, 2013 at 7:09 am 8 comments

Reidentification as Basic Science

This essay originally appeared on the Bill of Health blog as part of a conversation on the law, ethics and science of reidentification demonstrations.

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paperdigital camerasRFID tagsscanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.

Thanks to Vitaly Shmatikov for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

May 27, 2013 at 6:16 am Leave a comment

Price Discrimination and the Illusion of Fairness

In my previous article I pointed out that online price discrimination is suspiciously absent in directly observable form, even though covert price discrimination is everywhere. Now let’s talk about why that might be.

By “covert” I don’t mean that the firm is trying to keep price discrimination a secret. Rather, I mean that the differential treatment isn’t made explicit — e.g., by not basing it directly on a customer attribute — and thereby avoiding triggering the perception of unfairness or discrimination. A common example is selective distribution of coupons instead of listing different prices. Such discounting may be publicized, but it is still covert.

The perception of fairness

The perception of fairness or unfairness, then, is at the heart of what’s going on. Going back to the WSJ piece, I found it interesting to see the reaction of the customer to whom Staples quoted $1.50 more for a stapler based on her ZIP code: “How can they get away with that?” she asks. To which my initial reaction was, “Get away with what, exactly? Supply and demand? Econ 101?”

Even though some of us might not feel the same outrage, I think all of us share at least a vague sense of unease about overt price discrimination. So I decided to dig deeper into the literature in psychology, marketing, and behavioral economics on the topic of price fairness and understand where this perception comes from. What I found surprised me.

First, the fairness heuristic is quite elaborate and complex. In a vast literature spanning several decades, early work such as the “principle of dual entitlement” by Kahneman and coauthors established some basics. Quoting Anderson and Simester: “This theory argues that customers’ have perceived fairness levels for both firm profits and retail prices. Although firms are entitled to earn a fair profit, customers are also entitled to a fair price. Deviations from a fair price can be justified only by the firm’s need to maintain a fair profit. According to this argument, it is fair for retailers to raise the price of snow shovels if the wholesale price increases, but it is not fair to do so if a snowstorm leads to excess demand.”

Much later work has added to and refined that model. A particularly impressive and highly cited 2004 paper reviews the literature and proposes an elaborate framework with four different classes inputs to explain how people decide if pricing is fair or unfair in various situations. Some of the findings are quite surprising. For example: in case of differential pricing to the buyer’s disadvantage, “trust in the seller has a U-shaped effect on price fairness perceptions.”

The illusion of fairness

Sounds like we have a well-honed and sophisticated decision procedure, then? Quite the opposite, actually. The fairness heuristic seems to be rather fragile, even if complex.

Let’s start with an example. Andrew Odlyzko, in his brilliant essay on price discrimination — all the more for the fact that it was published back in 2003 [1] — has this to say about Coca Cola’s ill-fated plans for price-adjusting vending machines: “In retrospect, Coca Cola’s main problem was that news coverage always referred to its work as leading to vending machines that would raise prices in warm weather. Had it managed to control publicity and present its work as leading to machines that would lower prices in cold weather, it might have avoided the entire controversy.”

We know how to explain the public’s reaction to the Coca Cola announcement using behavioral economics — the way it was presented (or framed), customers take the lower price as the “reference price,” and the price increase seems unfair, whereas the Odlyzko’s suggested framing would anchor the higher price as the reference price. Of course, just because we can explain how the fairness heuristic works doesn’t make it logical or consistent, let alone properly grounded in social justice.

More generally, every aspect of our mental price fairness assessment heuristic seems similarly vulnerable to hijacking by tweaking the presentation of the transaction without changing the essence of price discrimination. Companies have of course gotten wise to this; there’s even academic literature on it. One of the techniques proposed in this paper is “reference group signaling” — getting a customer to change the set of other customers to whom they mentally compare themselves. [2]

The perception of fairness, then, can be more properly called the illusion of fairness.

The fragility of the fairness heuristic becomes less surprising considering that we apparently share it with other primates. This hilarious clip from a TED talk shows a capuchin monkey reacting poorly, to put it mildly, to differential treatment in a monkey-commerce setting (although the jury may still be out on the significance of this experiment). If our reaction to pricing schemes is partly or largely due to brain circuitry that evolved millions of years ago, we shouldn’t expect it to fare well when faced with the complexities of modern business.

Lose-lose

Given that the prime impediment to pervasive online price discrimination is a moral principle that is fickle and easily circumventable, one can expect that companies to do exactly that, since they can reap most of the benefits of price discrimination without the negative PR. Indeed, it is my belief that more covert price discrimination is going on than is generally recognized, and that it is accelerating due to some technological developments.

This is a problem because price discrimination does raise ethical concerns, and these concerns are every bit as significant when it is covert. [3] However, since it is much less transparent, there’s less of an opportunity for public debate.

There are two directions in which I want to take this series of articles from this point: first a look at how new technology is enabling powerful forms of tailoring and covert price discrimination, and second, a discussion of what can be done to make price discrimination more transparent and how to have an informed policy discussion about its benefits and dangers.

[1] I had the pleasure of sitting next to Professor Odlyzko at a conference dinner once, and I  expressed my admiration of the prescience of his article. He replied that he’d worked it all out in his head circa 1996 but took a few years to put it down on paper. I could only stare at him wordlessly.

[2] I’m struck by the similarities between price fairness perceptions and privacy perceptions. The aforementioned 2004 price fairness framework can be seen as serving a roughly analogous function to contextual integrity, which is (in part) a theory of consumer privacy expectations. Both these theories are the result of “reverse engineering,” if you will, of the complex mental models in their respective domains using empirical behavioral evidence. Continuing the analogy, privacy expectations are also fragile, highly susceptible to framing, and liable to be exploited by companies. Acquisti and Grossklags, among others, have done some excellent empirical work on this.

[3] In fact, crude ways of making customers reveal their price sensitivity lead to a much higher social cost than overt price discrimination. I will take this up in more detail in a future post.

Thanks to Alejandro Molnar, Joseph Bonneau, Solon Barocas, and many others for insightful conversations on this topic.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

January 22, 2013 at 10:24 am 10 comments

New Developments in Deanonymization

This post is a roundup of developments in deanonymization in the last few months. Let’s start with two stories relating to how a malicious website can silently discover the identity of a visitor, which is an insidious type of privacy breach that I’ve written about quite a bit (1, 2, 3, 4, 5, 6).

Firefox bug exposed your identity. The first is a vulnerability resulting from a Firefox bug in the implementation of functions like exec and test. The bug allows a website to learn the URL of an embedded iframe from some other domain. How can this lead to uncovering the visitor’s identity? Because twitter.com/lists redirects to twitter.com/<username>/lists. This allows a malicious website to open a hidden iframe pointing to twitter.com/lists, query the URL after redirection, and learn the visitor’s Twitter handle (if they are logged in). [1,2]

This is very similar to a previous bug in Firefox that led to the same type of vulnerability. The URL redirect that was exploited there was google.com/profiles/me  → user-specific URL. It would be interesting to find and document all such generic-URL → user-specific-URL redirects in major websites. I have a feeling this won’t be the last time such redirection will be exploited.

Visitor deanonymization in the wild. The second story is an example of visitor deanonymization happening in the wild. It appears that the technique utilizes a tracking cookie from a third-party domain to which the visitor previously gave their email and other info., in other words, #3 in my five-fold categorization of ways in which identity can be attached to browsing logs.

I don’t consider this instance to be particularly significant — I’m sure there are other implementations in the wild — and it’s not technically novel, but this is the first time as far as I know that it’s gotten significant attention from the public, even if only in tech circles. I see this as a first step in a feedback loop of changing expectations about online anonymity emboldening more sites to deanonymize visitors, thus further lowering the expectation of privacy.

Deanonymization of mobility traces. Let’s move on to the more traditional scenario of deanonymization of a dataset by combining it with an auxiliary, public dataset which has users’ identities. Srivatsa and Hicks have a new paper with demonstrations of deanonymization of mobility traces, i.e., logs of users’ locations over time. They use public social networks as auxiliary information, based on the insight that pairs of people who are friends are more likely to meet with each other physically. The deanonymization of Bluetooth contact traces of attendees of a conference based on their DBLP co-authorship graph is cute.

This paper adds to the growing body of evidence that anonymization of location traces can be reversed, even if the data is obfuscated by introducing errors (noise).

So many datasets, so little time. Speaking of mobility traces, Jason Baldridge points me to a dataset containing mobility traces (among other things) of 5 million “anonymous” users in the Ivory Coast recently released by telecom operator Orange. A 250-word research proposal is required to get access to the data, which is much better from a privacy perspective than a 1-click download. It introduces some accountability without making it too onerous to get the data.

In general, the incentive for computer science researchers to perform practical demonstrations of deanonymization has diminished drastically. Our goal has always been to showcase new techniques and improve our understanding of what’s possible, and not to name and shame. Even if the Orange dataset were more easily downloadable, I would think that the incentive for deanonymization researchers would be low, now that the Srivatsa and Hicks paper exists and we know for sure that mobility traces can be deanonymized, even though the experiments in the paper are on a far smaller scale.

Head in the sand: rational?! I gave a talk at a privacy workshop recently taking a look back at how companies have reacted to deanonymization research. My main point was that there’s a split between the take-your-data-and-go-home approach (not releasing data because of privacy concerns) and the head-in-the-sand approach (pretending the problem doesn’t exist). Unfortunately but perhaps unsurprisingly, there has been very little willingness to take a middle ground, engaging with data privacy researchers and trying to adopt technically sophisticated solutions.

Interestingly, head-in-the-sand might be rational from companies’ point of view. On the one hand, researchers don’t have the incentive for deanonymization anymore. On the other hand, if malicious entities do it, naturally they won’t talk about it in public, so there will be no PR fallout. Regulators have not been very aggressive in investigating anonymized data releases in the absence of a public outcry, so that may be a negligible risk.

Some have questioned whether deanonymization in the wild is actually happening. I think it’s a bit silly to assume that it isn’t, given the economic incentives. Of course, I can’t prove this and probably never can. No company doing it will publicly talk about it, and the privacy harms are so indirect that tying them to a specific data release is next to impossible. I can only offer anecdotes to explain my position: I have been approached multiple times by organizations who wanted me to deanonymize a database they’d acquired, and I’ve had friends in different industries mention casually that what they do on a daily basis to combine different databases together is essentially deanonymization.

[1] For a discussion of why a social network profile is essentially equivalent to an identity, see here and the epilog here.
[2] Mozilla pulled Firefox 16 as a result and quickly fixed the bug.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

December 17, 2012 at 8:59 am 1 comment

Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out

Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.

These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?

I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.

That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.

For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.

  • Release data on only a subset of users, minimizing the quantitative risk. [1]
  • Release a synthetic dataset created to mimic the characteristics of the real data. [2]

For the second stage, there are various possibilities, not mutually exclusive:

  • Require contestants to sign a data-use agreement.
  • Restrict the contest to a shortlist of best performers from the first stage.
  • Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data.

Overstock.com recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.

In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.

The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.

[1] Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.

[2] Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 14, 2011 at 6:54 pm Leave a comment

Price Discrimination is All Around You

This is the first in a series of articles that will show how we’re at a turning point in the history of price discrimination and discuss the consequences. This article presents numerous examples of traditional price discrimination that you see today, many of which are funny, sad, or downright devious.

Price discrimination, more euphemistically known as differential pricing and dynamic pricing, exploits the fact that in any transaction each customer has a different “willingness to pay.”

What is “willingness to pay,” and how does the seller determine it? To illustrate, let me quote a hilarious story by Steve Blank on selling enterprise software. The protagonist is one Sandy Kurtzig.

Sandy Kurtzig

Since it was the first non-IBM enterprise software on IBM mainframes, [when] she got her first potential order, she didn’t know how to price it. It must have been back in the mid-’70s. She’s [with] this buyer, has a P.O. on his desk, negotiating pricing with Sandy.

So, Sandy said she goes into the buyer who says, “How much is it?”

And Sandy gulped and picked the biggest number she thought anybody would ever rationally pay. And said, “$75,000″. And she said all the buyer did was write down $75,000.

And she realized, shit, she left money on the table. … And she said, “Per year.”

And the buyer wrote down, “Per year.”

And she went, oh, crap what else? She said, “There’s maintenance.”

He said, “How much?”

“25 percent per year.”

And he said, “That’s too much.”

She said, “15 percent.”

And he said, “OK.”

Sadly, not all transactions are as much fun as pricing enterprise software ;-) The price usually has to be determined without meeting the buyer face to face. There are three types of price discrimination based on how the price is determined:

  1. Each buyer is charged a custom price. (Traditionally, there has never been enough data to do this.)
  2. Price depends on an attribute of the buyer such as age or gender.
  3. Different price for different categories of buyers, with the seller somehow getting the buyer to reveal which category they fall into. As we’ll see, hilarity frequently ensues.

Additionally, each buyer may be sold the same product, or it could be customized to each segment—in the extreme case, to each buyer. This is called product differentiation.

Alright. Time to dive into some examples.

1. Student discounts at movies, museums, etc. are one of the simplest types of price discrimination. Students are generally poorer and more price sensitive, so the business hopes to attract more of them by making it cheaper.

Why museums and movies, and not say grocery stores? Two reasons: first, if the grocery store tried it, they’d quickly run into the problem of resale by the group that qualifies for the lower price. (It could manifest as parents sending their kids to get groceries.) The museum doesn’t have this problem because they ask for a student ID.

Second, grocery stores set prices pretty close to their marginal cost anyway, so there’s not as much of a scope for variable pricing. With museums, on the other hand, it costs them next to nothing to admit an extra visitor. All of their costs are fixed costs.

Prevention of resale and low marginal costs relative to fixed costs are two important ingredients for price discrimination.

2. Ladies’ night at bars is another simple example of price discrimination based on an attribute (gender). Rather than women having a lower willingness to pay, it is perhaps more accurate to say that men are more desperate to get in :-)

Interestingly, this is one of the few examples whose legality is questionable. Wikipedia has a good survey. Also, it is not a “pure” example since the point of ladies’ night is not just to get more women through the door but also, indirectly, to get more men through the door.

3. A less obvious example is the variation of gas prices (and other commodities) within the same chain across locations. This is because people in richer ZIP codes are willing to pay more on average.

An important caveat: some of the variation is typically explainable by differences in marginal cost (such as rent) between different locations, but not all of it.

4. Financial aid at universities is a rather complex case of price discrimination. Instead of charging different rates to different students, the seller has a base rate and gives discounts (aid) to qualifying students.

Discounting is a frequently used form of “concealed” price discrimination.

You can see aid programs in humanitarian/political terms or in economic terms; the two paradigms are not in conflict with each other. In the economic view, students with higher scores receive aid because they have more college options and are therefore more price-sensitive. Poorer students and minorities receive aid because they are less able/willing to pay.

In the examples so far, the attribute(s) that factor into discrimination are either obvious (gender, race, location) or it is in the buyer’s interest to disclose them to the seller (student status, financial need). Now let’s look at examples where the seller has to be crafty in getting the buyer to disclose it.

5. Car prices vary greatly between market segments, far more than can be explained by differences in marginal cost. Car buyers segment themselves because owning a higher-end car is a status symbol.

Product differentiation is frequently used to get buyers to segment themselves.

The same principle applies to numerous other product categories like wine and coffee. But at least you’re getting at least a nominally superior product for a higher price. Let’s look at examples where buyers voluntarily pay more for the same product.

6. Dell.com used to ask customers if they were home users, small businesses, or other categories. The prices for the same products varied according to the category you declared. There was no legally binding reason to be honest about your disclosure, and no enforcement mechanism.

Now for a more devious example.

7. “Staples brazenly sends out different office supply catalogs with different prices to the same customers. The price-sensitive buyers know which to buy from. The inattentive ones pay extra.” [source]

A similar example: restaurants with long menus sometimes highlight some popular choices on the first page. The same items are available in the long-form menu for cheaper, if only you knew where they’re buried.

These examples illustrate an extremely common form of price discrimination:

Buyers who are willing to jump through hoops demonstrate their high price-sensitivity and therefore get lower prices.

This theme is so fundamental that it has been practiced for thousands of years in the form of haggling.

8. The jumping-through-hoops principle suggests that it makes economic sense for the seller to make discounts hard to get. Nowhere is this more apparent than with Black Friday deals—stand in ridiculously long lines all night to get fabulous discounts. Wealthier customers who don’t bother doing so will get much less of a discount during regular store hours, even on Black Friday.

9. More examples of hard-to-get discounts: woot.com, mailing-list deals and Southwest Airlines DING. Many of these involve artificial scarcity and time-limitations to make them more difficult to get, thus ensuring that those who take advantage are buyers who might otherwise not buy at all.

10. Perhaps the most extreme example of roping in buyers who might otherwise not buy is deliberately crippling your own product, known in economics as damaged goods.

IBM did this with its popular LaserPrinter by adding chips that slowed down the printing to about half the speed of the regular printer. The slowed printer sold for about half the price, under the IBM LaserPrinter E name.

That example and more like it are from here. And a more poignant example from railways of long ago:

It is not because of the few thousand francs which would have to be spent to put a roof over the third-class carriages or to upholster the third-class seats that some company or other has open carriages with wooden benches. What the company is trying to do is to prevent the passengers who can pay the second class fare from traveling third class; it hits the poor, not because it wants to hurt them, but to frighten the rich. And it is again for the same reason that the companies, having proved almost cruel to the third-class passengers and mean to the second-class ones, become lavish in dealing with first-class passengers. Having refused the poor what is necessary, they give the rich what is superfluous.

These examples should make clear that:

Getting buyers to reveal their willingness to pay often has signficant social costs.

11. There are endless examples of clever tricks to learn the customer’s price-sensitivity in the airline industry. The price for the same seat can vary greatly depending on a variety of factors. The most well-known one is that you get lower prices if your trip spans a weekend, because it probably means you’re not a business traveler.

12. First class and business class seating on airlines is also price discrimination, but of a very different kind. Here it’s not different prices for the same product but different prices for slightly different products. Buyers segment themselves due to product differentiation, a phenomenon we’ve seen before with cars.

The first class/economy price spread can often be as high as 10x, which illustrates the wide range of customers’ willingness to pay. For a variety of reasons, most other markets haven’t managed to attain such a high price spread.

The “holy grail” of price discrimination is to achieve dramatically higher price spreads in most markets.

Aaaaaand we’re done with the examples!

Note that this is far from a complete list—I haven’t covered clearance sales, loyalty programs and frequent flyer miles, hi-lo pricing, drug prices that vary by country, and so forth, but I hope I’ve convinced you that price discrimination in some form already happens in nearly every market.

But here’s the kicker: I’ve deliberately left out what I consider the most important class of examples, because I’m going to devote a whole article to it. I will argue that this emerging form of price discrimination is going to explode in popularity and dwarf anything we’ve seen so far. Feel free to guess what I’m thinking about in the comments, and stay tuned!

Many thanks to Justin Brickell, Alejandro Molnar and Adam Bossy for useful discussions and comments. Thanks also to my Twitter followers for putting up with my ‘tweetathon’ on this topic two months ago and providing feedback.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 2, 2011 at 2:48 pm 6 comments

The Linkability of Usernames: a Step Towards “Uber-Profiles”

Daniele PeritoClaude Castelluccia, Mohamed Ali Kaafar, and Pere Manils have a neat paper How Unique and Traceable are Usernames? that addresses the following question:

Suppose you find the same username on different online services, what is the probability that these usernames refer to the same physical person?

The background for this investigation is that there is tremendous commercial value in linking together every piece of online information about an individual. While the academic study of constructing “uber-profiles” by linking social profiles is new (see Large Online Social Footprints—An Emerging Threat for another example), commercial firms have long been scraping profiles, aggregating them, and selling them on the grey market. Well-known public-facing aggregators such as Spokeo mainly use public records, but online profiles are quickly becoming part of the game.

Paul Ohm has even talked of a “database of ruin.” No matter what moral view one takes of this aggregation, the technical questions are fascinating.

The research on Record Linkage could fill an encyclopedia (see here for a survey) but most of it studies traditional data types such as names and addresses. This paper is thus a nice complement.

Usernames are particularly useful for carrying out linkage across different sites for two reasons:

  • They are almost always available, especially on systems with pseudonymous accounts.
  • When comparing two databases of profiles, usernames are a good way to quickly find candidate matches before exploring other attributes.

The mathematical heavy-lifting that the authors do is described by the following:

… we devise an analytical model to estimate the uniqueness of a user name, which can in turn be used to assign a probability that a single username, from two different online services, refers to the same user

and

we extend this model to cases when usernames are different across many online services … experimental data shows that users tend to choose closely related usernames on different services.

For example, my Google handle is ‘randomwalker’ and my twitter username is ‘random_walker’. Perito et al’s model can calculate how obscure the username ‘random_walker’ is, as well as how likely it is that ‘random_walker’ is a mutation of ‘randomwalker’, and come up with a combined score representing the probability that the two accounts refer to the same person. Impressive.

The authors also present experimental results. For example, they find that with a sample of 20,000 usernames drawn from a real dataset, their algorithms can find the right match about 60% of the time with a negligible error rate (i.e., 40% of the time it doesn’t produce a match, but it almost never errs.) That said, I find the main strength of the paper to be in the techniques more than the numbers.

Their models know all about the underlying natural language patterns, such as the fact that ‘random_walker’ is more meaningful than say ‘rand_omwalker’. This is achieved using what are called Markov models. I really like this class of techniques; I used Markov models many years ago in my paper on password cracking with Vitaly Shmatikov to model how people pick passwords.

The setting studied by Perito et al. is when two or more offline databases of usernames are available. Another question worth considering is determining the identity of a person behind a username via automated web searches. See my post on de-anonymizing Lending Club data for an empirical analysis of this.

There is a lot to be said about the psychology behind username choice. Ben Gross’s dissertation is a fascinating look at the choice of identifiers for self-representation. I myself am very attached to ‘randomwalker’; I’m not sure why that is.

A philosophical question related to this research is whether it is better to pick a unique username or a common one. The good thing about a unique username is that you stand out from the crowd. The bad thing about a unique username is that you stand out from the crowd. The question gets even more interesting (and consequential) if you’re balancing Googlability and anonymity in the context of naming your child, but that’s a topic for another day.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 16, 2011 at 5:19 pm 2 comments

An open letter to Netflix from the authors of the de-anonymization paper

Dear Netflix,

Today is a sad day. It is also a day of hope.

It is a sad day because the second Netflix challenge had to be cancelled. We never thought it would come to this. One of us has publicly referred to the dampening of research as the “worst possible outcome” of privacy studies. As researchers, we are true believers in the power of data to benefit mankind.

We published the initial draft of our de-anonymization study just two weeks after the dataset for the first Netflix Prize became public. Since we had the math to back up our claims, we assumed that lessons would be learned, and that if there were to be a second data release, it would either involve only customers who opted in, or a privacy-preserving data analysis mechanism. That was three and a half years ago.

Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.

We were pleasantly surprised to read the plain, unobfuscated language in the blog post announcing the cancellation of the second contest. We hope that this signals a change in your outlook with respect to privacy. We are happy to see that you plan to “continue to explore ways to collaborate with the research community.”

Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems. But there are practical challenges, and overcoming them will likely require setting up an online system for data analysis rather than an “anonymize and release” approach.

Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.

It’s not often that a moral imperative aligns with business incentives. We hope that you will take advantage of this opportunity.

Arvind Narayanan and Vitaly Shmatikov


For background, see our paper and FAQ.

To stay on top of future posts on 33bits.org, subscribe to the RSS feed or follow me on Twitter.

March 15, 2010 at 4:53 pm 19 comments

Older Posts


About 33bits.org

I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 266 other subscribers