Posts tagged ‘web’

Unlikely Outcomes? A Distributed Discussion on Decentralized Personal Data Architectures

In recent years there has been a mushrooming of decentralized social networks, personal data stores and other such alternatives to the current paradigm of centralized services. In the academic paper A Critical Look at Decentralized Personal Data Architectures last year, my coauthors and I challenged the feasibility and desirability of these alternatives (I also gave a talk about this work). Based on the feedback, we realized it would be useful to explicate some of our assumptions and implicit viewpoints, add context to our work, clarify some points that were unclear, and engage with our critics on some of the more contentious claims.

We found the perfect opportunity to do this via an invitation from Unlike Us Reader, produced by the Institute of Network Cultures — it’s a magazine run by a humanities-oriented group of people, with a long-standing interest in digital culture, but they also attract some politically oriented developers. The Unlike Us conference, from which this edited volume stems, is also very interesting. [1]

Three of the five original authors — Solon, Vincent and I — teamed up with the inimitable Seda Gürses for an interview-style conversation (PDF). Seda is unique among privacy researchers — one of her interests is to understand and reconcile the often maddeningly divergent viewpoints of the different communities that study privacy, so she was the ideal person to play the role of interlocutor. Seda solicited feedback from about two dozen people in the hobbyist, activist and academic communities, and synthesized the responses into major themes. Then the three of us took turns responding to the prompts, which Solon, with Seda’s help, organized into a coherent whole. A majority of the commenters consented to making their feedback public, and Seda has collected the discussion into an online appendix.

This was an unusual opportunity, and I’m grateful to everyone who made it happen, particularly Seda and Solon who put in an immense amount of work. My participation was very enjoyable. Research proceeds at such a pace that we rarely have the opportunity to look back and cogitate about the process; when we do, we’re often surprised by what we find. For example, here’s something I noted with amusement in one of my responses:

My interest in decentralized social networking apparently dates to 2009, as I just discovered by digging through my archives. I’d signed up to give a talk on pitfalls of social networking privacy at a Stanford workshop, and while preparing for it I discovered the rich academic literature and the various hobbyist efforts in the decentralized model. My slides from that talk seem to anticipate several of the points we made about decentralized social networking in the paper (albeit in bullet-point form), along with the conclusion that they were “unlikely to disrupt walled gardens.” Funnily enough, I’d completely forgotten about having given this talk when we were writing the paper.

I would recommend reading this text as a companion to our original paper. Read it for extra context and clarifications, a discussion of controversial points, and as a way of stimulating thinking about the future prospects of alternative architectures. It may also be an interesting read as an example of how people writing an article together can have different views, and as a bit of a behind-the-scenes look at the research process.

[1] In particular, the latest edition of the conference that just concluded had a panel titled “Are you distributed? The Federated Web Show” moderated by Seda, with Vincent as one of the participants. It touched upon many of the same themes as our work.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

March 27, 2013 at 7:44 am 1 comment

A Critical Look at Decentralized Personal Data Architectures

I have a new paper with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, Helen Nissenbaum and Dan Boneh (the Adnostic gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized personal data architectures” — aren’t quite the panacea that they’ve been made out to be.

The paper is only a synopsis of our work so far — in our notes we have over 80 projects, papers and proposals that we’ve studied, so we intend to follow up with a more complete analysis. For now, our goal is to kick off a discussion and give the community something to think about. The paper was a lot of fun to write, and we hope you will enjoy reading it. We recognize that many of our views and conclusions may be controversial, and we welcome comments.


While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals.

Developers, regulators, and consumer advocates have looked to alternative decentralized architectures as the natural response to threats posed by these centralized services.  The result has been a great variety of solutions that include personal data stores (PDS), infomediaries, Vendor Relationship Management (VRM) systems, and federated and distributed social networks.  And yet, for all these efforts, decentralized personal data architectures have seen little adoption.

This position paper attempts to account for these failures, challenging the accepted wisdom in the web community on the feasibility and desirability of these approaches. We start with a historical discussion of the development of various categories of decentralized personal data architectures. Then we survey the main ideas to illustrate the common themes among these efforts. We tease apart the design characteristics of these systems from the social values that they (are intended to) promote. We use this understanding to point out numerous drawbacks of the decentralization paradigm, some inherent and others incidental. We end with recommendations for designers of these systems for working towards goals that are achievable, but perhaps more limited in scope and ambition.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 21, 2012 at 8:27 am 3 comments

“You Might Also Like:” Privacy Risks of Collaborative Filtering

I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.

Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.

Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!

Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:

Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.

In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user.  We evaluate their feasibility using public data from popular websites Hunch,, LibraryThing, and Amazon.

The screenshot below shows an example of a related-items list on Amazon. There are up to 100 items in such lists.

Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.

It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.

We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.

Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.

But here’s why we think this paper is important:

  • Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
  • It undermines the widely accepted dichotomy between “personally identifiable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
  • It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difficult to protect fine-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.

That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.

See also: Joe Calandrino’s post about this paper.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 24, 2011 at 6:11 pm 4 comments


I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 260 other followers