Posts tagged ‘conference’

Unlikely Outcomes? A Distributed Discussion on Decentralized Personal Data Architectures

In recent years there has been a mushrooming of decentralized social networks, personal data stores and other such alternatives to the current paradigm of centralized services. In the academic paper A Critical Look at Decentralized Personal Data Architectures last year, my coauthors and I challenged the feasibility and desirability of these alternatives (I also gave a talk about this work). Based on the feedback, we realized it would be useful to explicate some of our assumptions and implicit viewpoints, add context to our work, clarify some points that were unclear, and engage with our critics on some of the more contentious claims.

We found the perfect opportunity to do this via an invitation from Unlike Us Reader, produced by the Institute of Network Cultures — it’s a magazine run by a humanities-oriented group of people, with a long-standing interest in digital culture, but they also attract some politically oriented developers. The Unlike Us conference, from which this edited volume stems, is also very interesting. [1]

Three of the five original authors — Solon, Vincent and I — teamed up with the inimitable Seda Gürses for an interview-style conversation (PDF). Seda is unique among privacy researchers — one of her interests is to understand and reconcile the often maddeningly divergent viewpoints of the different communities that study privacy, so she was the ideal person to play the role of interlocutor. Seda solicited feedback from about two dozen people in the hobbyist, activist and academic communities, and synthesized the responses into major themes. Then the three of us took turns responding to the prompts, which Solon, with Seda’s help, organized into a coherent whole. A majority of the commenters consented to making their feedback public, and Seda has collected the discussion into an online appendix.

This was an unusual opportunity, and I’m grateful to everyone who made it happen, particularly Seda and Solon who put in an immense amount of work. My participation was very enjoyable. Research proceeds at such a pace that we rarely have the opportunity to look back and cogitate about the process; when we do, we’re often surprised by what we find. For example, here’s something I noted with amusement in one of my responses:

My interest in decentralized social networking apparently dates to 2009, as I just discovered by digging through my archives. I’d signed up to give a talk on pitfalls of social networking privacy at a Stanford workshop, and while preparing for it I discovered the rich academic literature and the various hobbyist efforts in the decentralized model. My slides from that talk seem to anticipate several of the points we made about decentralized social networking in the paper (albeit in bullet-point form), along with the conclusion that they were “unlikely to disrupt walled gardens.” Funnily enough, I’d completely forgotten about having given this talk when we were writing the paper.

I would recommend reading this text as a companion to our original paper. Read it for extra context and clarifications, a discussion of controversial points, and as a way of stimulating thinking about the future prospects of alternative architectures. It may also be an interesting read as an example of how people writing an article together can have different views, and as a bit of a behind-the-scenes look at the research process.

[1] In particular, the latest edition of the conference that just concluded had a panel titled “Are you distributed? The Federated Web Show” moderated by Seda, with Vincent as one of the participants. It touched upon many of the same themes as our work.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

March 27, 2013 at 7:44 am 1 comment

Web Privacy Measurement: Genesis of a Community

Last week I participated in the Web Privacy Measurement conference at Berkeley. It was a unique event because the community is quite new and this was our very first gathering. The WSJ Data Transparency hackathon is closely related; the Berkeley conference can be thought of as an academic counterpart. So it was doubly fascinating for me — both for the content and because of my interest in the sociology of research communities.

A year ago I explained that there is an information asymmetry when it comes to online privacy, leading to a “market for lemons.” The asymmetry exists for two main reasons: one is that companies don’t disclose what data they collect about you and what they do with it; the second is that even if they do, end users don’t have the capacity to aggregate and process that information and make decisions on the basis of it.

The Web Privacy Measurement community essentially exists to mitigate this asymmetry. The primary goal is to ferret out what is happening to your data online, and a secondary one is making this information useful by pushing for change, building tools for opt-out and control, comparison of different players, etc. The size of the community is an indication of how big the problem has gotten.

Before anyone starts trotting out the old line, “see, the market can solve everything!”, let me point out that the event schedule demonstrates, if anything, the opposite. The majority of what is produced here is intended wholly or partly for the consumption of regulators. Like many others, I found the “What privacy measurement is useful for policymakers?” panel to be the most interesting one. And let’s not forget that most of this is Government-funded research to begin with.

This community is very different from the others that I’ve belonged to. The mix of backgrounds is extraordinary: researchers mainly from computing and law, and a small number from other disciplines. Most of the researchers are academics, but a few work for industrial research labs, a couple are independent, and one or two work in Government. There were also people from companies that make privacy-focused products/services, lawyers, hobbyists, scholars in the humanities, and ad-industry representatives. Overall, the community has a moderately adversarial relationship with industry, naturally, and a positive relationship with the press, regulators and privacy advocates.

The make-up is somewhat similar to the (looser-knit) group of researchers and developers building decentralized architectures for personal data, a direction that my coauthors and I have taken a skeptical view of in this recent paper. In both cases, the raison d’être of the community is to correct the imbalance of power between corporations and the public. There is even some overlap between the two groups of people.

The big difference is that the decentralization community, typified by Diaspora, mostly tries to mount a direct challenge and overthrow the existing order, whereas our community is content to poke, measure, and expose, and hand over our findings to regulators and other interested parties. So our potential upside is lower — we’re not trying to put a stop to online tracking, for example — but the chance that we’ll succeed in our goals is much higher.

Exciting times. I’m curious to see how things evolve. But this week I’m headed to PLSC, which remains my favorite privacy-related conference.

Thanks to Aleecia McDonald for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

June 4, 2012 at 8:47 am Leave a comment

Web Crawlers and Privacy: The Need to Reboot Robots.txt

This is a position paper I co-authored with Pete Warden and will be discussing at the upcoming IAB/IETF/W3C Internet privacy workshop this week.

Privacy norms, rules and expectations in the real world go far beyond the “public/private dichotomy. Yet in the realm of web crawler access control, we are tied to this binary model via the robots.txt allow/deny rules. This position paper describes some of the resulting problems and argues that it is time for a more sophisticated standard.

The problem: privacy of public data. The first author has argued that individuals often expect privacy constraints on data that is publicly accessible on the web. Some examples of such constraints relevant to the web-crawler context are:

  • Data should not be archived beyond a certain period (or at all).
  • Crawling a small number of pages is allowed, but large-scale aggregation is not.
  • “Linkage of personal information to other databases is prohibited.

Currently there is no way to specify such restrictions in a machine-readable form. As as result, sites resort to hacks such as identifying and blocking crawlers whose behavior they don’t like, without clearly defining acceptable behavior. Other sites specify restrictions in the Terms of Service and bring legal action against violators. This is clearly not a viable solution — for operators of web-scale crawlers, manually interpreting and encoding the ToS restrictions of every site is prohibitively expensive.

There are two reasons why the problem has become pressing: first, there is an ever-increasing quantity of behavioral data about users that is valuable to marketers — in fact, there is even a black market for this data — and second, crawlers have become very cheap to set up and operate.

The desire for control over web content is by no means limited to user privacy concerns. Publishers concerned about copyright are equally in search of a better mechanism for specifying fine-grained restrictions on the collection, storage and dissemination of web content. Many site owners would also like to limit the acceptable uses of data for competitive reasons.

The solution space. Broadly, there are three levels at which access/usage rules may be specified: site-level, page-level and DOM element-level. Robots.txt is an example of a site-level mechanism, and one possible solution is to extend robots.txt. A disadvantage of this approach, however, is that the file may grow too large, especially in sites with user-generated content what may wish to specify per-user policies.

A page-level mechanism thus sounds much more suitable. While there is already a “robots” attribute to the META tag, it is part of the robots.txt specification and has the same limitations on functionality. A different META tag is probably an ideal place for a new standard.

Taking it one step further, tagging at the DOM element-level using microformats to delineate personal information has also been proposed. A possible disadvantage of this approach is the overhead of parsing pages that crawlers will have to incur in order to be compliant.

Conclusion. While the need to move beyond the current robots.txt model is apparent, it is not yet clear what should replace it. The challenge in developing a new standard lies in accommodating the diverse requirements of website operators and precisely defining the semantics of each type of constraint without making it too cumbersome to write a compliant crawler. In parallel with this effort, the development of legal doctrine under which the standard is more easily enforceable is likely to prove invaluable.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

December 5, 2010 at 7:54 pm 5 comments

Conferences: The Good, the Bad and the Ugly aspects

I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.

The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.

I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.

Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.

The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.

CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.

I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 17, 2010 at 7:24 am 4 comments

Privacy Law Scholars Conference

I had a great time at the Privacy Law Scholars Conference in Berkeley last week, perhaps more so than at any CS conference I’ve attended. A major reason was that there were — get this — no talks. Well, just one keynote speech. The format centered around 75 minutes-long discussion sessions (which seem to be called workshops), with 5 parallel tracks; in each session, you pick which track you want to attend. You are supposed to have read the paper beforehand, and usually everyone in the room has something to say and gets a chance to do so.

This seems way more sensible to me than the format of CS conferences, where there is only one track. I can’t imagine that anyone would genuinely want to attend all the talks. Ideally, for any given talk, half the people should skip it and spend their time networking instead, but in my experience this never happens. Worse, the talks are only 20-30 minutes long; while this is enough time to motiviate the paper and inspire the listeners to go read it afterward, it is never enough to explain the whole paper. Sometimes speakers don’t get this concept, and the results are not pretty.

Anyways, I was surprised by the ease with which I could read law papers and participate in the discussions, even if my understanding was (obviously) not nearly as deep as that of a law scholar. This is something to ponder — while legalese is dense and frequently obfuscated, law papers are a breeze to read, at least based on my small sample size.

There is one paper, by Paul Ohm, that I particularly enjoyed: it is about re-examining privacy laws and regulatory strategies in the light of re-identification techniques. This generated a lot of interest at the conference, and I found the discussion fascinating. A major reason I started 33bits was to to be able to play a part in informing these developments; it seems that this blog has indeed helped, which is highly gratifying. I learnt a lot about privacy and anonymity in general, and I look forward to writing more about it in future posts, to the extent that I can do so without talking about specific workshop discussions, which are confidential.

June 10, 2009 at 8:16 pm 8 comments


I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 261 other followers