33 Bits of Entropy

One more re-identification demonstration, and then I’m out

What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.

One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.

Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.

Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:

How to do machine learning ethically

Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.

This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.

And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2015 at 8:20 am Leave a comment

Good and bad reasons for anonymizing data

Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.

Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.

Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.

Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)

Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.

Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.

Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.

Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]

Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.

It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.

Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).

Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).

Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)

Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.

[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

July 9, 2014 at 8:05 am Leave a comment

How to prepare a technical talk

I used to suck at giving technical talks. I would usually confuse my audience, and often confuse myself. By the time I became a prof, I sucked a lot less. These days I enjoy giving technical talks and lectures more than non-technical ones, and my students seem to like them better as well.

So something had changed; I’d developed a process. The other day I sat down to see if I could extract what this process was. It turned out to be surprisingly formulaic, like an algorithm, so I’d like to share it with you. I’m guessing this is obvious to most professors who teach technical topics, but I hope it will be helpful to those who’re relatively new to the game.

There are three steps. They’re simple but not easy.

Identify the atomic concepts
Draw the dependency graph
Find a topological ordering of the graph

Identify atomic concepts. The key word here is atomic. The idea is to introduce only one key concept at one time and give the audience time to internalize the concept before moving on to the next one.

This is hard for two reasons. First, concepts that seem atomic to an expert are often an amalgam of different concepts. Second, it’s audience-specific. You have to have a good mental model of which concepts are already familiar to your audience.

Draw the dependency graph. Occasionally I use a whiteboard for this, but usually it’s in my head. This is a tricky step because it’s easy to miss dependencies. When the topic I’m teaching is the design of a technical system, I ask myself questions like, “what could go wrong in this component?” and “why wasn’t this alternative design used?” This helps me flesh out the internal logic of the system in the form of a graph.

Find a topological ordering. This is just a fancy way of saying we want to order the concepts so that each concept only depends on the ones already introduced. Sometimes this is straightforward, but sometimes the dependency graph has cycles!

Of the topics I’ve taught recently, Bitcoin seems especially difficult in this regard. Each concept is bootstrapped off of the others, but somehow the system magically works when you put everything together. What I do in these cases is introduce intermediate steps that don’t exist in the actual design I’m teaching, and remove them later [1].

Think of a technical topic as a skyscraper. When it’s presented in a paper, it’s analogous to unveiling a finished building. The audience can admire it and check that it’s stable/correct (say, by verifying theorems or other technical arguments.) But just as staring at a building doesn’t help you learn how to build one, the presentation in a typical paper is all but useless for pedagogical purposes. Having dependencies between concepts is perfectly acceptable in papers, because papers are not meant to be read in a single pass.

The instructor’s role, then, is to reverse engineer how the final concept might plausibly be built up step by step. This is analogous to showing the scaffolding of the building and explaining each step in its construction. Talks and lectures, unlike papers, must necessarily have this linear form because the audience can’t keep state in their heads.

[1] This process introduces new nodes in the dependency graph and removes some edges so that it is no longer cyclic.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

November 26, 2013 at 6:29 am Leave a comment

How to pick your first research project

At Princeton I get to advise many gifted graduate and undergraduate students in doing research. Combining my experience as a mentor with reflecting on my experience as a student, I’d like to offer some guidance on how to pick your first research project.

I’m writing this post because selecting a research problem to work on is significantly harder than actually solving it. I mean the previous sentence quite literally and without exaggeration. As an undergraduate and early-stage graduate researcher, I repeatedly spent months at a time working on research problems only to have to abandon my efforts because I found out I was barking up the wrong tree. Scientific research, it turns out, is largely about learning to ask the right questions.

The good news is that three simple criteria will help you avoid most of the common pitfalls.

1. Novelty. Original research is supposed to be, well, original. There are two components to novelty. The first is to make sure the problem you’re trying to solve hasn’t already been solved. This is way trickier than it seems — you might miss previous research because you’re using different names for concepts compared to the standard terminology. But the issue is deeper: two ideas may be equivalent without sounding at all the same at a superficial level. Your advisor’s help will be crucial here.

The other aspect to novelty is that you should have a convincing answer to the question “why has this problem not been solved yet?” Often this might involve a dataset that only recently became available, or some clever insight you’ve come up with that you suspect others missed. In practice, one often has an insight and then looks for a problem to apply it to. This means you have to put in a good bit of creative thinking even to pick a research question, and you must be able to estimate the difficulty level of solving it.

If your answer to the question is, “because the others who tried it weren’t smart enough,” you should probably think twice. It may not be prudent to have the success of your first project ride on your intellectual abilities being truly superlative.

2. Relevance. You must try to ensure that you select a problem that matters, one whose solution will impact the world directly or indirectly (and hopefully for the better). Again, your advisor’s help will be essential. (That said, professional researchers do produce massive volumes of research papers that no one cares about.) I encourage my students to pick subproblems of my ongoing long-term research projects. This is a safe way to pick a problem that’s relevant.

3. Measurable results. This one becomes automatic as you get experienced, but for beginning researchers it can be confusing. The output of your research should be measurable and reproducible; ideally you should be able to formulate your goals as a testable hypothesis. Measurability means that many interesting projects that are novel and make the world better are nevertheless unsuitable for research. (They may be ideal for a startup or a hobby project instead.) “Build a website for illiterate kids in poor countries to learn effectively” is an example of a task that’s hard to frame as a research question.

Irrelevant criteria. Let me also point out what’s not on this list. First, the general life advice you often hear, to do something you’re passionate about, is unfortunately a terrible way to pick a research problem. If you start from something you’re passionate about, the chance that it will meet the three criteria above is pretty slim. Often one has to consider a dozen or more research ideas before settling on one to work on.

You should definitely pick a research area you’re passionate about. But getting emotionally invested in a specific idea or research problem before you’ve done the due diligence is a classic mistake, one that I made a lot as a student.

Second, the scope or importance of the problem is another criterion you shouldn’t fret much about for your first project. Your goal is as much to learn the process of research as to produce results. You probably have a limited amount of time in which you want to evaluate if this whole research thing is the right fit for you. While you should definitely pick a useful and relevant research task, it should be something that you have a reasonable chance of carrying to fruition. Don’t worry about curing cancer just yet.

Note that the last point is at odds with advice given to more experienced researchers. Richard Hamming, in a famous talk titled “You and your research,” advised researchers to pick the most important problem that they have a shot at solving. I’ve written a version of the current post for those who’re in it for the long haul, and my advice there is to embrace risk and go for the big hits.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

November 1, 2013 at 3:58 pm Leave a comment

Academic publishing as (ruinous) competition: Is there a way out?

Aaron Johnson invited me to speak as part of a panel on academic publishing at PETS 2013. This is a rough transcript of my talk, written from memory.

Aaron mentioned he was looking for one more speaker for this panel, so that we could hear the view of someone naive and inexperienced, and asked if I was available. I said, “Great, I do that every day!” So that will be the tone of my comments today. I don’t have any concrete proposals that can be implemented next year or in two years. Instead these are blue-sky thoughts on how things could work someday and hopeful suggestions for moving in that direction. [1]

I just finished my first year as a faculty member at Princeton. It’s still a bit surreal. I wasn’t expecting to have an academic career. In fact, back in grad school, especially the latter half, whenever someone asked me what I wanted to do after I graduated, my answer always was, “I don’t know for sure yet, but there’s one career I’m sure I don’t want — academia.”

I won’t go into the story of why that was and how it changed. But it led to some unusual behavior. I ranted a lot about academia on Twitter, as Aaron already mentioned when he introduced me. Also, many times I “published” stuff by putting up a blog post. For instance I had a series of posts on the ability of a malicious website to deanonymize visitors (1, 2, 3, 4, 5, 6). People encouraged me to turn it into a paper, and I could have done that without much extra effort. But I refused, because my primary goal was to quickly disseminate the information, and I felt my blog posts had accomplished that adequately. True, I wouldn’t get academic karma, but why would I care? I wasn’t going to be an academic!

When I eventually decided I wanted to apply for academic positions, I talked to a professor whose opinion I greatly respected. He expressed skepticism that I’d get any interviews, given that I’d been blogging instead of writing papers. I remember thinking, “oh shit, I’ve screwed up my career, haven’t I?” So I feel extremely lucky that my job search turned out successfully.

At this point a sane person would have decided to quit while they were ahead, and start playing the academic game. But I guess sanity has never really been one of my strong points. So in the last year I’ve been thinking a lot about what the process of research collaboration and publishing would look like if we somehow magically didn’t have to worry at all about furthering our individual reputations.

Polymath

Something that’s very close to my ideal model of collaboration is the Polymath project. I was fascinated when I heard about it a few years ago. It was started by mathematician Tim Gowers in a blog post titled “Is massively collaborative mathematics possible?” [2] He and Terry Tao are the leaders of the project. They’re among the world’s top mathematicians. There have been several of these collaborations so far and they’ve been quite successful, solving previously open math problems. So I’ve been telling computer scientists about these efforts and asking if our community could produce something like this. [3]

To me there are three salient aspects of Polymath. The first is that the collaboration happens online, in blog posts and comments, rather than phone or physical meetings. When I tell people this they are usually enthusiastic and willing to try something like that. The second aspect is that it is open, in that there is no vetting of participants. Now people are a bit unsure, and say, “hmm, what’s the third?” Well, the third aspect is that there’s no keeping score of who contributed what. To which they react, “whoa, whoa, wait, what??!!”

I’m sure we can all see the problem here. Gowers and Tao are famous and don’t have to worry about furthering their careers. The other participants who contribute ideas seem to do it partly altruistically and partly because of the novelty of it. But it’s hard to imagine this process being feasible on a bigger scale.

Misaligned incentives

Let’s take a step back and ask why there’s this gap between doing good research and getting credit for it. In almost every industry, every human endeavor, we’ve tried to set things up so that the incentives for individuals and the broader societal goals of the activity align with each other. But sometimes individual incentives get misaligned with the societal goals, and that leads to problems.

Let’s look at a few examples. Individual traders play the stock market with the hope of getting rich. But at the same time, it helps companies hedge against risk and improves overall financial stability. At least that’s the theory. We’ve seen it go wrong. Similarly, copyright is supposed to align the desire of creators to make money with the goal of the maximum number of people enjoying the maximum number of creative works. That’s gotten out of whack because of digital technology.

My claim is that we’re seeing the same problem in academic research. There’s a metaphor that explains what’s going on in research really well, and to me it is the root of all of the ills that I want to talk about. And that metaphor is publishing as competition. What do I mean by that? Well, peer review is a contest. Succeeding at this contest is the immediate incentive that we as researchers have. And we hope that this will somehow lead to science that benefits humanity.

To be clear, I’m far from the first one to make this observation. Let me quote someone who’s much better qualified to talk about this. Oded Goldreich, I’m sure most of you know of him, has a paper titled “On Struggle and Competition in Scientiﬁc Fields.” Here’s my favorite quote from the paper. He’s talking about the flagship theory conferences.

Eventually, FOCSTOC may become a pure competition, deﬁned as a competition having no aim but its own existence (i.e., the existence of a competition). That is, pure competitions serve no scientiﬁc purpose. Did FOCSTOC reach this point or is close to it? Let me leave this question open, and note that my impression is that things are deﬁnitely evolving towards this direction. In any case, I think we should all be worried about the potential of such an evolution.

I’m don’t know enough about the theory community to have an opinion on how big a problem this is. Still, I’m sure we can agree with the sentiment of the last sentence.

But here’s the very next paragraph. I think it gives us hope.

Other TOC conferences seem to suffer less from the aforementioned phenomena. This is mainly because they “count” less as evidence of importance (i.e., publications in them are either not counted by other competitions or their eﬀect on these competitions is less signiﬁcant). Thus, the vicious cycle described above is less powerful, and consequently these conferences may still serve the intended scientiﬁc purposes.

We see the same thing in the security and privacy community. Something I’ve seen commonly is a situation where you have a neat result, but nothing earth-shattering, and it’s not good enough as it is for a top tier venue. So what do you do? You pad it with bullshit and submit it, and it gets in. Another trend that this encourages is deliberately making a bad or inaccurate model so that you can solve a harder problem. But PETS publications and participants seem to suffer less from these effects. That’s why I’m happy to be discussing this issue with this group of people.

Paper as final output

It seems like we’re at an impasse. We can agree that publishing-as-competition has all these problems, but hiring committees and tenure committees need competitions to identify good research and good researchers. But I claim that publishing as competition fails even at the supposed goal of identifying useful research.

The reason for that is simple. Publishing as competition encourages or even forces viewing the paper as the final output. But it’s not! The hard work begins, not ends when the paper is published. This is unlike the math and theory communities, where the paper is in fact the final output. If publishing-as-competition is so bad for theory, it’s much worse for us.

In security and privacy research, the paper is the starting point. Our goal is not to prove theorems but to more directly impact the world in some way. By creating privacy technologies, for example. For research to have impact, authors have to do a variety of things after publication depending on the nature of the research. Build technology and get people to adopt it. Explain the work to policymakers or to other researchers who are building upon it. Or even just evangelize your ideas. Some people claim that ideas should stand on their own merit and compete with other ideas on a level playing field. I find this quite silly. I lean toward the view expressed in this famous quote you’ve probably heard: “if your ideas are any good you’ll have to shove them down people’s throats.”

The upshot of this is that impact is heavily shortchanged in the publication-as-competition model. This is partly because of what I’ve talked about, we have no incentive to do any more work after getting the paper published. But an equally important reason is that the community can’t judge the impact of research at the point of publication. Deciding who “wins the prizes” at the point of publication, before the ideas have a chance to prove themselves, has disastrous consequences.

So I hope I’ve convinced you that publication-as-competition is at the root of many of our problems. Let me give one more example. Many of us like the publish-then-filter model, where reviews are done in the open on publicly posted papers with anyone being able to comment. One major roadblock to moving to this model is that it screws up the competition aspect. The worry is that papers that receive a lot of popular attention will be reviewed favorably, and so forth. We want papers to be reviewed on a level playing field. But if the worth of a paper can’t be judged at publication time, that means all this fairness is toward an outcome that is meaningless anyway. Do we still want to keep this model at all costs?

A way forward?

So far I’ve done a lot of complaining. Let me offer some suggestions now. I want to give two sets of suggestions that are complementary. The first is targeted at committees, whether tenure committees, hiring committees, award communities, or even program committees to an extent, and to the community in general. The second is targeted at authors.

Here’s my suggestion for committees and the community: we can and should develop ways to incentivize and measure real impact. Let me give you a four examples. I have more that I’d be happy to discuss later. First, retrospective awards. That is, “best paper from this conference 10 years ago” or some such. I’ve been hearing more about these of late, and I think that’s good news. The idea is that impact is easier to evaluate 10 years after publication.

Second, overlay journals. These are online journals that are a way of “blessing” papers that have already been published or made public. There is a lag between initial publication and inclusion in the overlay journal, and that’s a good thing. Recently the math community has come up with a technical infrastructure for running overlay journals. I’m very excited about this. [4]

There are two more that are related. These are specific to our research field. For papers that are about a new tool, I think we should look at adoption numbers as an important component of the review process. Finally, such papers should also have an “incentives” section or subsection. Because all too often we write papers that we imagine unspecified parties will implement and deploy, but it turns out there isn’t the slightest economic incentive for any company or organization to do so.

I think we should also find ways to measure contributions through blog posts and sharing data and code in publications. This seems more tricky. I’d be happy to hear suggestions on how to do it.

Next, this is what I want to say to authors: the supposed lack of incentives for nontraditional ways of publishing is greatly exaggerated. I say this from my personal experience. I said earlier that I was very lucky that my job search turned out well. That’s true, but it wasn’t all luck. I found out to my surprise that my increased visibility through blogging and especially the policy work that came out of it made a huge difference to my prospects. If I’d had three times as many publications and no blog, I probably would have had about the same chances. I’m sure some departments didn’t like my style, but there are definitely others that truly value it.

My Bitcoin experiment

I have one other personal experience to share with you. This is an experiment I’ve been doing over the last month or so. I’d been thinking about the possibility of designing a prediction market on top of Bitcoin that doesn’t have a central point of control. Some of you may know the sad story of Intrade. So I tweeted my interest in this problem, and asked if others had put thought into it. Several people responded. I started an email thread for this group, and we went to work.

12,000 words and several conference calls later, we’re very happy with where we are, and we’ve started writing a paper presenting our design. What’s even better is who the participants are — Jeremy Clark at Carleton, Joe Bonneau who did his Ph.D. with Ross Anderson and is currently at Google, and Andrew Miller at UMD who is Jon Katz’s Ph.D. student. All these people are better qualified to write this paper than I am. By being proactive and reaching out online, I was able to assemble and work with this amazing team. [5]

But this experiment didn’t go all the way. While I used Twitter to find the participants and was open to accepting anyone, the actual collaboration is being done through traditional channels. My original intent was to do it in public, but I realized quite early on that we had something publication-worthy and became risk-averse.

I plan to do another experiment, this time with the explicit goal of doing it in public. This is again a Bitcoin-related paper that I want to write. Oddly enough, there is no proper tutorial of Bitcoin, nor is there a survey of the current state of research. I think combining these would make a great paper. The nature of the project makes it ideal to do online. I haven’t figured out the details yet, but I’m going to launch it on my blog and see how it goes. You’re all welcome to join me in this experiment. [6]

So that’s basically what I wanted to share with you today. I think the current model of publication as competition has gone too far, and the consequences are starting to get ruinous. It’s time we put a stop to it. I believe that committees on one hand, and authors on the other both have the incentive to start changing things unilaterally. But if the two are combined, the results can be especially powerful. In fact, I hope that it can lead to a virtuous cycle. Thank you.

[1] Aaron didn’t actually say that, of course. You probably got that. But who knows if nuances come across in transcripts.

[2] At this point I polled the room to see who’d heard of Polymath before. Only three hands went up (!)

[3] There is one example that’s closer to computer science that I’m aware of: this book on homotopy type theory written in a similar spirit as the Polymath project.

[4] During my talk I incorrectly cited the URL for this infrastructure as selectedpapers.net. That is a somewhat related but different project. It is actually the Episciences project.

[5] Since the talk, we’ve had another excellent addition to the team: Josh Kroll at Princeton, who recently published a neat paper on the economics of Bitcoin mining with Ian Davey and Ed Felten.

[6] Something that I meant to mention at the end but ran out of time for is Michael Neilsen’s excellent book Reinventing Discovery: The New Era of Networked Science. If you find the topic of this post at all interesting, you should absolutely read this book.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

July 15, 2013 at 7:13 am 10 comments

Personalized coupons as a vehicle for perfect price discrimination

Given the pervasive tracking and profiling of our shopping and browsing habits, one would expect that retailers would be very good at individualized price discrimination — figuring out what you or I would be willing to pay for an item using data mining, and tailoring prices accordingly. But this doesn’t seem to be happening. Why not?

This mystery isn’t new. Mathematician Andrew Odlyzko predicted a decade ago that data-driven price discrimination would become much more common and effective (paper, interview). Back then, he was far ahead of his time. But today, behavioral advertising at least has gotten good enough that it’s often creepy. The technology works; the impediment to price discrimination lies elsewhere. [1]

It looks like consumers’ perception of unfairness of price discrimination is surprisingly strong, which is why firms balk at overt price discrimination, even though covert price discrimination is all too common. But the covert form of price discrimination is not only less efficient, it also (ironically) has significant social costs — see #3 below for an example. Is there a form of pricing that allows for perfect discrimination (i.e., complete tailoring to individuals), in a way that consumers find acceptable? That would be the holy grail.

In this post, I will argue that the humble coupon, reborn in a high-tech form, could be the solution. Here’s why.

1. Coupons tap into shopper psychology. Customers love them.

Coupons, like sales, introduce unpredictability and rewards into shopping, which provides a tiny dopamine spike that gets us hooked. JC Penney’s recent misadventure in trying to eliminate sales and coupons provides an object lesson:

“It may be a decent deal to buy that item for $5. But for someone like me, who’s always looking for a sale or a coupon — seeing that something is marked down 20 percent off, then being able to hand over the coupon to save, it just entices me. It’s a rush.”

Some startups have exploited this to the hilt, introducing “gamification” into commerce. Shopkick is a prime example. I see this as a very important trend.

2. Coupons aren’t perceived as unfair.

Given the above, shoppers have at best a dim perception of coupons as a price discrimination mechanism. Even when they do, however, coupons aren’t perceived as unfair to nearly the same degree as listing different prices for different consumers, even if the result in either case is identical. [2]

3. Traditional coupons are not personalized.

While customers may have different reasons for liking coupons, from firms’ perspective the way in which traditional coupons aid price discrimination is pretty simple: by forcing customers to waste their time. Econ texts tend to lay it out bluntly. For example, R. Preston McAfee:

Individuals generally value their time at approximately their wages, so that people with low wages, who tend to be the most price-sensitive, also have the lowest value of time. … A thrifty shopper may be able to spend an hour sorting through the coupons in the newspaper and save $20 on a $200 shopping expedition … This is a good deal for a consumer who values time at less than $20 per hour, and a bad deal for the consumer that values time in excess of $20 per hour. Thus, relatively poor consumers choose to use coupons, which permits the seller to have a price cut that is approximately targeted at the more price-sensitive group.

Clearly, for this to be effective, coupon redemption must be deliberately made time-consuming.

To the extent that there is coupon personalization, it seems to be for changing shopper behavior (e.g., getting them to try out a new product) rather than a pricing mechanism. The NYT story from last year about Target targeting pregnant women falls into this category. That said, these different forms of personalization aren’t entirely distinct, which is a point I will return to in a later article.

4. The traditional model doesn’t work well any more.

Paper coupons have a limited future. As for digital coupons, there is a natural progression toward interfaces that make it easier to acquire and redeem them. In particular, as more shoppers start to pay using their phones in stores, I anticipate coupon redemption being integrated into payment apps, thus becoming almost frictionless.

An interesting side-effect of smartphone-based coupon redemption is that it gives the shopper more privacy, avoiding the awkwardness of pulling out coupons from a purse or wallet. This will further open up coupons to a wealthier demographic, making them even less effective at discriminating between wealthier shoppers and less affluent ones.

5. The coupon is being reborn in a data-driven, personalized form.

With behavioral profiling, companies can determine how much a consumer will pay for a product, and deliver coupons selectively so that each customer’s discount reflects what they are willing to pay. They key difference is what while in the past, customers decided whether or not to look for, collect, and use a coupon, in the new model companies will determine who gets which coupons.

In the extreme, coupons will be available for all purchases, and smart shopping software on our phones or browsers will automatically search, aggregate, manage, and redeem these coupons, showing coupon-adjusted prices when browsing for products. More realistically, the process won’t be completely frictionless, since that would lose the psychological benefit. Coupons will probably also merge with “rewards,” “points,” discounts, and various other incentives.

There have been rumblings of this shift here and there for a few years now, and it seems to be happening gradually. Google’s acquisition of Incentive Targeting a few months ago seems significant, and at the very least demonstrates that tech companies are eyeing this space as well, and not just retailers. As digital feudalism takes root, it could accelerate the trend of individualized shopping experiences.

In summary, personalized coupons offer a vehicle for realizing the full potential of data mining for commerce by tailoring prices in a way that consumers seem to find acceptable. Neither coupons nor price discrimination should be viewed in isolation — together with rewards and various other incentive schemes, they are part of the trend of individualized, data mining-driven commerce that’s here to stay.

Footnotes

[1] Since I’m eschewing some academic terminology in this post, here are a few references and points of clarification. My interest is in first-degree price discrimination. Any price discrimination requires market power; my assumption is that is the case in practice because competition is always imperfect, and we should expect quite a bit of first-degree price discrimination. The observed level is puzzlingly low.

The impact of technology on the ability to personalize prices is complex, and behavioral profiling is only one aspect. Technology also makes competition less perfect by allowing firms to customize products to a greater degree, so that there are no exact substitutes. Finally, technology hinders first-degree price discrimination to an extent by allowing consumers to compare prices between different retailers more easily. The interaction between these effects is analyzed in this paper.

Technology also increases the incentive to price discriminate. As production becomes more and more automated, marginal costs drop relative to fixed costs. In the extreme, digital goods have essentially zero marginal cost. When marginal production costs are low, firms will try to tailor prices since any sale above marginal cost increases profits.

My use of the terms overt and covert is rooted in the theory of price fairness in psychology and behavioral economics, and relates to the presentation of the transaction. While it is somewhat related to first- vs. second/third-degree price discrimination, it is better understood as a separate axis, one that is not captured by theories of rational firms and consumers.

[2] An exception is when non-coupon customers are made aware that others are getting a better deal. This happens, for example, when there is a prominent coupon-code form field in an online shopping checkout flow. See here for a study.

Thanks to Sebastian Gold for reviewing a draft, and to Justin Brickell for interesting conversations that led me to this line of thinking.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

June 25, 2013 at 7:09 am 8 comments

Reidentification as Basic Science

This essay originally appeared on the Bill of Health blog as part of a conversation on the law, ethics and science of reidentification demonstrations.

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.

Thanks to Vitaly Shmatikov for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

May 27, 2013 at 6:16 am Leave a comment

Privacy technologies course roundup: Wiki, student projects, HotPETs

In earlier posts about the privacy technologies course I taught at Princeton during Fall 2012, I described how I refuted privacy myths, and presented an annotated syllabus. In this concluding post I will offer some additional tidbits about the course.

Wiki. I referred to a Wiki a few times in my earlier post, and you might wonder what that was about. The course included an online Wiki discussion component, and this was in fact the centerpiece. Students were required to participate in the online discussion of the day’s readings before coming to class. The in-class discussion would use the Wiki discussion as a starting point.

The advantages of this approach are: 1. it gives the instructor a great degree of control in shaping the discussion of each paper, 2. the instructor can more closely monitor individual students’ progress 3. class discussion can focus on particularly tricky and/or contentious points, instead of rehashing the obvious.

Student projects. Students picked a variety of final projects for the class, and on the whole exceeded my expectations. Here are two very different projects, in brief.

Nora Taranto, a History of Science major, wrote a policy paper about the privacy implications of the shift to electronic medical records. Nora writes:

I wrote a paper about the privacy implications of patient-care institutions (in the United States) using electronic medical record (EMR) systems more and more frequently. This topic had particular relevance given the huge number of privacy breaches that occurred in 2012 alone. Meanwhile, there is a simultaneous criticism coming from care providers about the usability of such EMR systems. As such, many different communities—in the information privacy sphere, in the medical community, in the general public, etc.—have many different things to say. But, given the several privacy breaches that occurred within a couple of weeks in April 2012 and together implicated over a million individuals, concerns have been raised in particular about how secure EMR systems are. These concerns are especially worrisome given the federal government’s push for their adoption nationwide beginning in 2009 when the American Recovery and Reinvestment Act granting funds to hospitals explicitly for the purpose of EMR implementation.

So I looked into the benefits and costs of such systems, with a particular slant towards the privacy benefits/costs. Overall, these systems do have a number of protective mechanisms at their disposal, some preventative and others reactive. While these protective barriers are all necessary, they are not sufficient to guarantee the patient his or her privacy rights in the modern day. These protective mechanisms—authentication schemes, encryption, and data logs/anomaly-detection—need to be expanded and further developed to provide an adequate amount of protection for personal health information. While the government is, at the moment, encouraging the adoption of EMR systems for maximal penetration, medical institutions ought to use caution in considering which systems to implement and ought to hold themselves to a higher standard. Moreover, greater regulatory oversight of EMR systems on the market would help institutions maintain this cautious approach.

Abu Saparov, Ajay Roopakalu, and Raﬁ Shamim, also undergraduates, designed an implemented an alternative to centralized key distribution. They write:

Our project for the course was to create and implement a decentralized public key distribution protocol and show how it could be used. One of the initial goals of our project was to experience first-hand some of the things that made the design of a usable and useful privacy application so hard. Early on in the process, we decided to try to build some type of application that used cryptography to enhance the privacy of communication with friends. Some of the reasons that we chose this general topic were the fact that all of us had experience with network programming and that we thought some of the things that cryptography can achieve are uniquely cool. We were also somewhat motivated by the prospect of using our application to talk with each other and our other friends after we graduate. We eventually gravitated towards two ideas: (1) a completely peer-to-peer chat system that is encrypted from end-to-end, and (2) a “dumb” social network that allows users to share posts that only their friends (and not the server) can see. During the semester, our focus shifted to designing and implementing the underlying key distribution mechanism upon which these two systems could be built.

When we began to flesh out the designs for our two ideas, we realized that the act of retrieving a friend’s public cryptographic keys was the first challenge to solve. Certificate authorities are the most common way to obtain public keys, but require a large degree of trust to be placed in a small number of authorities. Web of Trust is another option, and is completely decentralized, but often proves difficult in practice because of the need for manual key signing. We decided to make our own decentralized protocol that exposes an easily usable API for clients to use in order to obtain public keys. Our protocol defines an overlay network that features regular nodes, as well as supernodes that are able to prove their trustworthiness, although the details of this are controllable through a policy delegate. The idea is for supernodes to share the task of remembering and verifying public keys through a majority vote of neighboring supernodes. Users running other nodes can ask the supernodes for a friend’s public key. In order to trick someone, an adversary would have to control over half of the supernodes from which a user requested a key. Our decision to go with an overlay network created a variety of issues such as synchronizing information between supernodes, being able to detect and report malicious supernodes, and getting new nodes incorporated into the network. These and the countless other design problems we faced definitely allowed us to appreciate the difficulty of writing a privacy application, but unfortunately, we were not fully able to test every element of our protocol and its implementation. After creating the protocol, we implemented small, bare-bones applications for our initial ideas of peer-to-peer chat and an encrypted social network.

Master’s students Chris Eubank, Marcela Melara, and Diego Perez-Botero did a project on mobile web tracking which, with some further work, turned into a research paper that Chris will speak about at W2SP tomorrow.

Finally, I’m happy to say that I will be discussing the syllabus and my experiences teaching this class at HotPETs this year, in Bloomington, IN, in July.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

May 23, 2013 at 2:14 pm 1 comment

What Happened to the Crypto Dream? Now in a new and improved paper form!

Last October I gave a talk titled “What Happened to the Crypto Dream?” where I looked at why crypto seems to have done little for personal privacy. The reaction from the audience (physical and online) was quite encouraging — not that everyone agreed, but they seemed to find it thought provoking — and several people asked me if I’d turn it into a paper. So when Prof. Alessandro Acquisti invited me to contribute an essay to the “On the Horizon” column in IEEE S&P magazine, I jumped at the chance, and suggested this topic.

Thanks to some fantastic feedback from colleagues and many improvements to the prose by the editors, I’m happy with how the essay has turned out. Here it is in two parts: Part 1, Part 2.

While I’m not saying anything earth shaking, I do make a somewhat nuanced argument — I distinguish between “crypto for security” and “crypto for privacy,” and further subdivide the latter into a spectrum between what I call “Cypherpunk Crypto” and “Pragmatic Crypto.” I identify different practical impediments that apply to those two flavors (in the latter case, a complex of related factors), and lay out a few avenues for action that can help privacy-enhancing crypto move in a direction more relevant to practice.

I’m aware that this is a contentious topic, especially since some people feel that the time is ripe for a resurgence of the cypherpunk vision. I’m happy to hear your reactions.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

April 29, 2013 at 12:06 pm Leave a comment

Privacy technologies: An annotated syllabus

Last semester I taught a course on privacy technologies here at Princeton. Earlier I discussed how I refuted privacy myths that students brought into class. In this post I’d like to discuss the contents of the course. I hope it will be useful to other instructors who are interested in teaching this topic as well as for students undertaking self-study of privacy technologies. Beware: this post is quite long.

What should be taught in a class on privacy technologies? Before we answer that, let’s take a step back and ask, how does one go about figuring out what should be taught in any class?

I’ve seen two approaches. The traditional, default, overwhelmingly common approach is to think of it in terms of “covering content” without much consideration to what students are getting out of it. The content that’s deemed relevant is often determined by what the fashionable research areas happen to be, or historical accident, or some combination thereof.

A contrasting approach, promoted by authors like Bain, applies a laser focus on skills that students will acquire and how they will apply them later in life. On teaching orientation day at Princeton, our instructor, who clearly subscribed to this approach, had each professor describe what students would do in the class they are teaching, then wrote down only the verbs from these descriptions. The point was that our thinking had to be centered around skills that students would take home.

I prefer a middle ground. It should be apparent from my description of the traditional approach above that I’m not a fan. On the other hand, I have to wonder what skills our teaching coach would have suggested for a course on cosmology — avoiding falling into black holes? Alright, I’m exaggerating to make a point. The verbs in question are words like “synthesize” and “evaluate,” so there would be no particular difficulty in applying them to cosmology. But my point is that in a cosmology course, I’m not sure the instructor should start from these verbs.

Sometimes we want students to be exposed to knowledge primarily because it is beautiful, and being able to perceive that beauty inspires us, instills us with a love of further learning, and I dare say satisfies a fundamental need. To me a lot of the crypto “magic” that goes into privacy technologies falls into that category (not that it doesn’t have practical applications).

With that caveat, however, I agree with the emphasis on skills and life impact. I thought of my students primarily as developers of privacy technologies (and more generally, of technological systems that incorporate privacy considerations), but also as users and scholars of privacy technologies.

I organized the course into sections, a short introductory section followed by five sections that alternated in the level of math/technical depth. Every time we studied a technology, we also discussed its social/economic/political aspects. I had a great deal of discretion in guiding where the conversation around the papers went by giving them questions/prompts on the class Wiki. Let us now jump in. The italicized text is from the course page, the rest is my annotation.

0. Intro

Goals of this section: Why are we here? Who cares about privacy? What might the future look like?

Dan Solove. Why Privacy Matters Even if You Have ‘Nothing to Hide’ (Chronicle)
David Brin. The Transparent Society (WIRED, circa 1996, later expanded into a book)

In addition to helping flesh out the foundational assumptions of this course that I discussed in the previous post, pairing these opposing views with each other helped make the point that there are few absolutes in this class, that privacy scholars may disagree with each other, and that the instructor doesn’t necessarily agree with the viewpoints in the assigned reading, much less expects students to.

1. Cryptography: power and limitations

Goals. Travel back in time to the 80s and early 90s, understand the often-euphoric vision that many crypto pioneers and hobbyists had for the impact it would have. Understand how cryptographic building blocks were thought to be able to support this restructuring of society. Reason about why it didn’t happen.

Understand the motivations and mathematical underpinnings of the modern research on privacy-preserving computations. Experiment with various encryption tools, discover usability problems and other limitations of crypto.

David Chaum. Security without Identification: Card Computers to make Big Brother Obsolete (1985)
Steven Levy. Crypto Rebels (WIRED, 1993; later a 2001 book)
Eric Hughes. A cypherpunk’s manifesto. (short essay, 1993.)

I think the Chaum paper is a phenomenal and underutilized resource for teaching. My goal was to really immerse students in an alternate reality where the legal underpinnings of commerce were replaced by cryptography, much as Chaum envisioned (and even going beyond that). I created a couple of e-commerce scenarios for Wiki discussion and had them reason about how various functions would be accomplished.

My own views on this topic are set forth in this talk (now a paper; coming soon). In general I aimed to shield students from my viewpoints, and saw my role as helping them discover (and be able to defend) their own. At least in this instance I succeeded. Some students took the position that the cypherpunk dream is just around the corner.

The ‘Garbled Circuit Protocol’ (Yao’s theorem on secure two-party computation) and its implications (lecture)

This is one of the topics that sadly suffers from a lack of good expository material, so I instead lectured on it.

Alma Whitten and Doug Tygar. Why Johnny Can’t Encrypt: A Usability Evaluation of PGP 5.0
Nikita Borisov, Ian Goldberg, Eric Brewer. Off-the-Record Communication, or, Why Not To Use PGP
Thomas Ptacek. Javascript Cryptography Considered Harmful

One of the exercises here was to install and use various crypto tools and rediscover the usability problems. The difficulties were even worse than I’d anticipated.

2. Data collection and data mining, economics of personal data, behavioral economics of privacy

Goals. Jump forward in time to the present day and immerse ourselves in the world of ubiquitous data collection and surveillance. Discover what kinds of data collection and data mining are going on, and why. Discuss how and why the conversation has shifted from Government surveillance to data collection by private companies in the last 20 years.

Theme: first-party data collection.

New York Times. How Companies Learn Your Secrets
Andrew Odlyzko. Privacy, Economics, and Price Discrimination on the Internet

Theme: third-party data collection.

Julia Angwin. The Web’s New Gold Mine: Your Secrets (First in the Wall Street Journal’s What They Know series)
Jonathan R. Mayer and John C. Mitchell. Third-Party Web Tracking: Policy and Technology

Theme: why companies act the way they do.

Joseph Bonneau and Sören Preibusch. The Privacy Jungle: On the Market for Data Protection in Social Networks
Bruce Schneier. How Security Companies Sucker Us With Lemons (WIRED)

Theme: why people act the way they do.

Alessandro Acquisti and Jens Grossklags. What Can Behavioral Economics Teach Us About Privacy?
Alessandro Acquisti. Privacy in Electronic Commerce and the Economics of Immediate Gratification

This section is rather self-explanatory. After the math-y flavor of the first section, this one has a good amount of economics, behavioral economics, and policy. One of the thought exercises was to project current trends into the future and imagine what ubiquitous tracking might lead to in five or ten years.

3. Anonymity and De-anonymization

Important note: communications anonymity (e.g., Tor) and data anonymity/de-anonymization (e.g., identifying people in digital databases) are technically very different, but we will discuss them together because they raise some of the same ethical questions. Also, Bitcoin lies somewhere in between the two.

Roger Dingledine, Nick Mathewson, Paul Syverson. Tor: The Second-Generation Onion Router

Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System

Tor and Bitcoin (especially the latter) were the hardest but also the most rewarding parts of the class, both for them and for me. Together they took up 4 classes. Bitcoin is extremely challenging to teach because it is technically intricate, the ecosystem is rapidly changing, and a lot of the information is in random blog/forum posts.

In a way, I was betting on Bitcoin by deciding to teach it — if it had died with a whimper, their knowledge of it would be much less relevant. In general I think instructors should choose to make these such bets more often; most curricula are very conservative. I’m glad I did.

Nils Homer at al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays
[Optional] Arvind Narayanan, Elaine Shi, Benjamin I. P. Rubinstein. Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge

It was a challenge to figure out which deanonymization paper to assign. I went with the DNA one because I wanted them to see that deanonymization isn’t a fact about data, but a fact about the world. Another thing I liked about this paper is that they’d have to extract the not-too-complex statistical methodology in this paper from the bioinformatics discussion in which it is embedded. This didn’t go as well as I’d hoped.

I’ve co-authored a few deanonymization papers, but they’re not very well written and/or are poorly suited for pedagogical purposes. The Kaggle paper is one exception, which I made optional.

Paul Ohm. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization
[Optional] Jane Yakowitz Bambauer. Tragedy of the Data Commons

This is another pair of papers with opposing views. Since the latter paper is optional, knowing that most of them wouldn’t have read it, I used the Wiki prompts to raise many of the issues that the author raises.

4. Lightweight Privacy Technologies and New Approaches to Information Privacy

While cryptography is the mechanism of choice for cypherpunk privacy and anonymity tools like Tor, it is too heavy a weapon in other contexts like social networking. In the latter context, it’s not so much users deploying privacy tools to protect themselves against all-powerful adversaries but rather a service provider attempting to cater to a more nuanced understanding of privacy that users bring to the system. The goal of this section is to consider a diverse spectrum of ideas applicable to this latter scenario that have been proposed in recent years in the fields of CS, HCI, law, and more. The technologies here are “lightweight” in comparison to cryptographic tools like Tor.

Scott Lederer, Jason Hong et al. Personal Privacy through Understanding and Action: Five Pitfalls for Designers
Franziska Roesner et al. User-Driven Access Control: Rethinking Permission Granting in Modern Operating Systems

Fred Stutzman and Woodrow Hartzog. Obscurity by Design: An Approach to Building Privacy into Social Media
Woodrow Hartzog and Fred Stutzman. The Case for Online Obscurity

Jerry Kang et al. Self-surveillance Privacy
[Optional] Ryan Calo. Against Notice Skepticism In Privacy (And Elsewhere)

Helen Nissenbaum. A Contextual Approach to Privacy Online

5. Purely technological approaches revisited

This final section doesn’t have a coherent theme (and I admitted as much in class). My goal with the first two papers was to contrast a privacy problem which seems amenable to a purely or primarily technological formulation and solution (statistical queries over databases of sensitive personal information) with one where such attempts have been less successful (the decentralized, own-your-data approach to social networking and e-commerce).

Differential Privacy. (Lecture)
- Cynthia Dwork. Differential Privacy.

Differential privacy is another topic that is sorely lacking in expository material, especially from the point of view of students who’ve never done crypto before. So this was again a lecture.

Arvind Narayanan et al. A Critical Look at Decentralized Personal Data Architectures

John Perry Barlow A Declaration of the Independence of Cyberspace (short essay, 1996)
James Grimmelmann. Sealand, HavenCo, and the Rule of Law

These two essays aren’t directly related to privacy. One of the recurring threads in this course is the debate between purely technological and legal or other approaches to privacy; the theme here is to generalize it to a context broader than privacy. The Barlow essay asserts the exceptionalism of Cyberspace as an unregulable medium, whereas the Grimmelmann paper provides a much more nuanced view of the relationship between the law and new technological frontiers.

I’m making available the entire set of Wiki discussion prompts for the class (HTML/PDF). I consider this integral to the syllabus, for it shapes the discussion very significantly. I really hope other instructors and students find this useful as a teaching/study guide. For reference, each set of prompts (one set per class) took me about three hours to write on average.

There are many more things I want to share about this class: the major take-home ideas, the rationale for the Wiki discussion format, the feedback I got from students, a description of a couple of student projects, some thoughts on the sociology of different communities studying privacy and how that impacted the class, and finally, links to similar courses that are being taught elsewhere. I’ll probably close this series with a round-up post including as many of the above topics as I can.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

April 16, 2013 at 4:02 am 2 comments

33 Bits of Entropy

One more re-identification demonstration, and then I’m out

Good and bad reasons for anonymizing data

How to prepare a technical talk

How to pick your first research project

Academic publishing as (ruinous) competition: Is there a way out?

Personalized coupons as a vehicle for perfect price discrimination

Reidentification as Basic Science

Privacy technologies course roundup: Wiki, student projects, HotPETs

What Happened to the Crypto Dream? Now in a new and improved paper form!

Privacy technologies: An annotated syllabus

About 33bits.org

Me, elsewhere

Email Subscription