Is Making Public Data “More Public” a Privacy Violation?

April 5, 2010 at 6:11 pm 13 comments

What on earth does more public mean? Technologists draw a simple distinction between data that is public and data that is not. Under this view, the notion of making data more public is meaningless. But common sense tells us otherwise: it’s hard to explain the opposition to public surveillance if you assume that it’s OK to collect, store and use “public” information indiscriminately.

There are entire philosophical theories devoted to understanding what one can and cannot do with public data in different contexts. Recently, danah boyd argued in her SXSW keynote in support of “privacy through obscurity” and how technology is destroying this comfort. According to boyd, most public data is “quasi-public” and technologists don’t have the right to “publicize” it.

Some examples. One can debate the point in the abstract, but there is no question that companies and individuals have repeatedly been bitten when applying the “it’s already public” rule. Let’s look at some examples (the list and the discussion is largely concerned with data on the web).

  1. The availability of the California Birth Index on the web caused considerable consternation about a decade ago, despite the fact that birth records in the state are public and anyone’s birth record can be obtained through official channels albeit in a cumbersome manner.
  2. IRSeek planned to launch a search engine for IRC in 2007 by monitoring and indexing public channels (chatrooms). There was a predictable privacy outcry and they were forced to shut down.
  3. The Infochimps guys crawled the Twitter graph back in 2008 and posted it on their site. Twitter forced them to take the dataset down.
  4. The story was repeated with Pete Warden and Facebook; this time it was nastier and involved the threat of a lawsuit.
  5. MySpace recently started selling user data in bulk on Infochimps. As MySpace has pointed out, the data is already public, but privacy concerns have nevertheless been raised.
  6. One reason for the backlash against Google Buzz was auto-connect: it connected your activity on Google Reader and other services and streamed it to your friends. Your Google Reader activities were already public, but Buzz took it further by broadcasting it.
  7. Spokeo is facing similar criticism. As Snopes explains, “Spokeo displays listings that sometimes contain more personal information than many people are comfortable having made publicly accessible through a single, easy-to-use search site.”

The latter four examples are all from the last couple of months. For some reason the issue has suddenly started cropping up all the time. The current situation is bad for everyone: data trustees and data analysts have no clear guidelines in place, and users/consumers are in a position of constantly having to fight back against a loss of privacy. We need to figure out some ground rules to decide what uses of public data on the web are acceptable.

Why not “none?” I don’t agree with a blanket argument against using data for purposes other than originally intended, for many reasons. The first is that users’ privacy expectations, when they go beyond the public/private dichotomy, are generally poorly articulated, frequently unreasonable and occasionally self-contradictory. (An unfortunate but inevitable consequence of the complexity of technology.) The second reason is that these complex privacy rules, even if they can be figured out, often need to be communicated to the machine.

The third reason is the “greater good.” I’ve opposed that line of reasoning when used to justify reneging on an explicit privacy promise. But when it comes to a promise that was never actually made but merely intuitively understood (or mis-understood) by users, I think the question is different, and my stance is softer. Privacy needs to be weighed against the benefit to society from “publicizing” data — disseminating, aggregating and analyzing it.

In the next article of this series, I will give a rigorous technical characterization of what constitutes publicizing data. My hope is that this will go a long way towards determining what is and is not a violation of privacy. In the meanwhile, I look forward to hearing different opinions.

Thanks to Pete Warden and Vimal Jeyakumar for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , .

An open letter to Netflix from the authors of the de-anonymization paper Facebook, Privacy, Public Opinion and Pitchforks

13 Comments Add your own

  • 1. MacSmiley  |  April 5, 2010 at 6:26 pm

    Can you say, Identity Theft?

    Reply
  • 2. harryjohnston  |  April 5, 2010 at 11:13 pm

    There was a somewhat similar situation way back in 1995, when USENET was publicly archived for the first time. That is, posts going back several years (IIRC) were suddenly available on the web. (Look up Dejanews in Wikipedia.)

    I recall some folks complaining about it at the time, although the controversy didn’t result in the service being shut down. According to the Wikipedia article they did result in opt-out mechanisms being developed.

    Reply
  • 3. George  |  April 6, 2010 at 2:57 am

    There’s a lot of recent examples of ostensibly private data being made public. One of the more interesting ones to me is probably http://www.dirtyphonebook.com. I don’t have anything to hide so I’m not worried but I know a lot of people are mad about their private details being exposed.

    I don’t understand how Boyd can claim their is a condition called “quasi-public”. Sounds sorta made up to me.

    Reply
    • 4. Arvind  |  April 6, 2010 at 3:00 am

      Thanks for that link. As for boyd’s claim, I think her meaning is clear enough: people often post things publicly that they intend only their friends to read. In fact the majority of online posts probably fall into this category. One can argue whether or not technologists have a duty to respect that, but I don’t think there’s anything made-up here.

      Reply
  • 5. Luc  |  April 14, 2010 at 8:29 pm

    Considering that you do your purchases in public, is it hard to argue that credit card companies could publicly give access to your purchase history (date and items)?

    Is a conversation between two people at the shopping mall, public or private?

    In life, we develop a construct of privacy. We create our expectations and we adjust ourselves to situations by lowering our voice or raising the music. Technology, on the other hand, evolves more rapidly than our ability to change our behavior. We still “expect” an email to be private. We “expect” that our IRC conversations are not “overheard”.

    I don’t deny that our “public” information can be useful for commercial or scientific purposes. What is often lacking is the knowledge that our information in a certain situation CAN be “searched, archived, and distributed beyond the expected audience” before we expose the information.

    As information sharing is linked to many revenue models, it is often protected (poor Pete), not clearly stated and often hidden in the fine prints of a ten-page policy.

    Good luck with the model.

    Reply
  • 6. R. Sinohara  |  May 14, 2010 at 1:06 pm

    About identity thievery, isn’t making hard-to-find but still available information easier to find going to make people more aware of how public such information is and thus get them to be more careful, and make authentication methods better?

    Reply
    • 7. Arvind  |  May 14, 2010 at 7:00 pm

      One can hope so. Some people I know are working on just that. It is one thing for (say) a civil liberties organization to publicize data to raise awareness, and quite another for someone with a commercial motive to do so. I’m analyzing the latter in this article.

      Reply
  • 8. R. Sinohara  |  May 14, 2010 at 1:10 pm

    Really, this binary distinction between public and non-public is really the best way to go? That kinda seems to be the problem.

    I really believe there is a difference between some data being available and somebody go around publishing it.

    Doesn’t ‘publish’ originally mean actively go around distributing information (witch is quantifiable)? That seems different than passively giving access to data.

    Reply
  • 9. Stephen Wilson  |  May 29, 2010 at 11:16 pm

    “Technologists draw a simple distinction between data that is public and data that is not”. This is a great example of where technologists as a class tend not to understand privacy protection, because the public-private distinction is not important at law. In information privacy law, the terms”public” and “private” are generally avoided, and instead what matters is whether or not information is personally identifiable.

    Obviously there are shades of grey in gauging identifiability, but it’s a much more tangible and measureable quality than ‘publicness’. As you say there are “entire philosophical theories devoted to understanding what one can and cannot do with public data in different contexts”. That is, the public-private dichotomy is intractable, and technologists should learn to avoid it altogether.

    Technologists need to understand this: Most information privacy law prohibits the collection of personal information without a good reason or consent, and the arbitrary secondary use of information collected for an original purpose. These principles are basically blind to whether the source data is “public” or “private”. So a whole lot of personal information might be available in “public” over open wifi networks, but it does not follow that Google and others may collect that information and do with it as they like.

    The fact that information privacy law doesn’t care if information is “public” might be counterintuitive to many technologists, but that’s they way the law is. And ignorance of the law is no excuse!

    Stephen Wilson, Lockstep, Australia.

    Reply
    • 10. Arvind  |  May 30, 2010 at 12:30 am

      Stephen,

      You make some good points, but let me clarify a couple of things.

      Firstly, my article is absolutely not about whether or not making data more public is a violation of the law. Rather, it is about whether or not it is a violation of users’ expectations. You can have a privacy disaster even if you are perfectly compliant with information privacy law. That is largely the setting that I’m concerned with.

      You say,

      “The public-private dichotomy is intractable, and technologists should learn to avoid it altogether.” You also say that all secondary uses of data are, or should be, prohibited.

      I think that is just silly. Today’s web would simply cease to exist without the secondary usage of data. (If the law prohibits it, then it will be worked around.) Far from avoiding the public-private distinction, it is crucial for technologists to learn to navigate it.

      Finally, nobody is arguing that it was OK for Google to collect WiFi data. Bringing that up is a red herring and is counterproductive.

      Reply
      • 11. Stephen Wilson  |  May 30, 2010 at 1:09 am

        Hi Arvind.

        I should explain that I’ve been engaged in a lot of dialogues lately which I think expose a systemic misunderstanding of privacy *law* by technologists. The Google wifi example is obviously front and centre there. So I took some of the points in your blog and without knowing if you were supporting the views or just mentioning them, I used the points to advance my thesis. I only mentioned Google to sharpen my point, not to join you to that issue in any way. No offence intended!

        I don’t think secondary usage should be prohibited. I just want technologists to recognise that privacy law is actually very clear on Collection and on Use & Disclosure. You might not be saying otherwise … but the publicness of any particular information does not in and of itself mean that it escapes information privacy law.

        Your blog opened with a line about technologsts seeing a simple distinction, and I merely used that as a jumping off point for my argument that that particular distinction is unhelpful. In fact, it betrays a category error that makes it difficult for technologists to engage in the privacy debate. My thesis is that technologists and privacy policy wonks have very different frames of reference.

        Again, sorry to have used your blog as a platform!

        Cheers,

        Stephen.

        Reply
        • 12. Arvind  |  May 30, 2010 at 6:03 pm

          No worries. Looks like we’re talking about different things and aren’t really contradicting each other.

          “the publicness of any particular information does not in and of itself mean that it escapes information privacy law.”

          Yes, I’m not disputing that.

          Thank you for your comments. No offense taken.

          Reply
  • […] a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. […]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 254 other followers