Posts tagged ‘law’

Bad Internet Law: What Techies Can Do About It

From the dangerous copyright lobby-sponsored PROTECT IP to a variety of misguided social networking safety laws, the spectre of bad Internet law is rearing its ugly head with increasing frequency. And at the e-G8 forum, Sarkozy and others talked about even more ambitious plans to “civilize” the Internet that will surely have repercussions in the U.S. as well. Three things are common to these efforts: a general ignorance of technological reality, an attempt to preserve pre-Internet era norms and business models that don’t necessarily make sense anymore, and severe chilling effects on free speech and innovation.

The bad news is that fighting specific laws as they come up is an uphill battle. What has changed in the last ten years is that the Internet has thoroughly permeated society, and therefore the interest groups pushing these laws are much more determined to get their way. The good news is that lawmakers are reasonably receptive to arguments from both sides. So far, however, they are not hearing nearly enough of our side of the story. It’s time for techies to step up and get more actively involved in policy if we hope to preserve what we’ve come to see as our way of life. Here’s how you can make a difference.

1. Stick to your strengths—explain technology. The primary reason why Washington is prone to making bad tech law is that they don’t understand tech, and don’t understand how bits are different from atoms. Not only is educating policymakers on tech more effective, as a technologist you’ll have more credibility if you stick to doing that, rather than opining on specific policy measures.

2. Don’t go it alone. Giving equal weight to every citizen’s input on individual issues may or may not be a good idea in theory, but it certainly doesn’t work that way in practice. Money, expertise, connections and familiarity with the system all count. You’ll find it much easier to be heard and to make a difference if you combine your efforts with an existing tech policy group. You’ll also learn the ropes much more quickly by networking. Organizations like the EFF are always looking for help from outside technologists.

3. Charity begins at home—talk to your policy people. If you work at a large tech company, you’re already in a great position: your company has a policy group, a.k.a. lobbyists. Help them with their understanding of tech and business constraints, and have them explain the policy issues they’re involved in. Engineers often view the in-house policy and legal groups as a bunch of lawyers trying to impose arbitrary rules. This attitude hurts in the long run.

4. Learn to navigate the Three Letter Agencies. “The Government” is not a monolithic entity. To a first approximation there are the two Houses, a variety of Agencies, Departments and Commissions, the state legislatures and the state Attorneys General. They differ in their responsibilities, agendas, means of citizen participation and the receptiveness to input on technology. It can be bewildering at first but don’t worry too much about it; you can pick it up as you go along. Weird but true: most Internet issues in the House are handled by the “Energy and Commerce” subcommittee!

While I have focused on bad Internet laws, since that is where the tech/politics disconnect is most obvious, there are certainly many laws and regulations that have a largely positive, or at least a mixed reception in technology circles. Net neutrality is a prominent example; I am myself involved in the Do Not Track project. These are good opportunities to get involved as well, since there is always a shortage of technical folks. I would suggest picking one or two issues, even though it might be tempting to speak out about everything you have an opinion on.

To those of you who are about to post something like, “What’s the point? Congresspeople are all bought and paid for and aren’t going to listen to us anyway,” I have two things to say:

  • Tech policy is certainly hard because of the huge chasm, but cynicism is unwarranted. Lawmakers are willing to listen and you will have an impact if you stick with it.
  • If you’re not interested, that’s your prerogative. But please refrain from discouraging others who’re fighting for your rights. Defeatism and apathy are part of the problem.

Finally, here are some tech policy blogs and resources if you feel like “lurking” before you’re ready to jump in.

Thanks to Pete Warden and Jonathan Mayer for comments on a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 7, 2011 at 4:56 pm Leave a comment

The Unsung Success of CAN-SPAM

In today’s debate around Do Not Track, detractors frequently make a comparison to the CAN-SPAM Act and how it failed to stop spam. Indeed, in 2010 an average of 183 billion spam emails were sent per day, so clearly the law was thoroughly ineffective.

Or was it?

Decrying the effect of CAN-SPAM by looking at the total number, or even the percentage, of spam emails betrays a lack of understanding of what the Act was intended to do and how laws operate in general. Clearly, the Act does nothing to deter spammers in Ukraine or China; it’s not like the legislators were oblivious to this. To understand the positive effects that CAN-SPAM has had, it is necessary to go back to 2003 and see why spam filters weren’t working very well back then.

Typically thousands of dimensions are used, but only three are shown here

To a first approximation, a spam filter, like all machine learning-based “classifiers,” works by representing an email as a point in a multi-dimensional space and looking at which side of a surface (such as a “hyperplane”) it falls on. The hyperplane is “learned” by looking at already-classified emails. When you click the “report spam” button, you’re “training” this classifier, and it tweaks the hyperplane to become slightly more accurate in the future.

For emails that look obviously like spam, the classifier will never make a mistake, no matter how many millions of them it sees. The emails that it has trouble with are those that have some properties of ham and some properties of spam — those close to the boundary.

It is difficult for spammers to make their emails look legitimate, because ultimately they need to sell you a penis-enlargement product or whatever other scam they’re peddling. Back in the good old days when spam filters were hand-coded, they’d use tricks like replacing the word Viagra with Vi@gra. But the magic of machine learning ensures that modern filters will automatically update themselves very quickly.

Ham that looks like spam is much more of a problem. E-mail marketing is a grey area, and marketers will do anything they can to entice you to open their messages. Why honestly title your email “October widget industry newsletter” when you can instead title it “You gotta check this out!!”  Compounding this problem is the fact that people get much more upset by false positives (legitimate messages getting lost) than false negatives (spam getting through to inbox).

It now becomes obvious how CAN-SPAM made honest people honest (and the bad guys easier to prosecute) and how that changed the game. The rules basically say, “don’t lie.” If you look a corpus of email today, you’ll find that the spectrum that used to exist is gone — there’s obviously legitimate e-mail (that intends to comply) and obviously illegitimate e-mail (that doesn’t care). The blue dots in the picture have been forced to migrate up — or risk being in violation. As you can imagine, spam filters have a field day in this type of situation.

And I can prove it. Instead of looking at how much spam is sent, let’s look at how much spam is getting through. Obviously this is harder to measure, but there is a simple proxy: search volume. The logic is straightforward: people who have a spam problem will search for it, in the hope of doing something about it.

Note: data is not available before 2004

A-ha! A five-fold decrease since CAN-SPAM was passed. That doesn’t prove that the decrease is necessarily due to the Act, but it does prove that those who claim spam is still a major problem have no clue what they’re talking about.

There’s unsolicited email that is legitimate under CAN-SPAM; most people would consider these to be spam as well. Here’s where another provision of the Act comes in: one-click unsubscribe. Michael Dayah reports on an experiment showing that for this type of spam, unsubscription is almost completely effective.

Incidentally, his view of CAN-SPAM concurs with mine:

The CAN-SPAM act then strongly bifurcated spammers. Some came into the light and followed the rules, using relevant subjects, no open relays, understandable language, and an unsubscribe link that supposedly functioned. Other went underground, doing their best to skirt the content filtering with nonsense text and day-old Chinese landing domains.

I would go so far as to say that the Act is a good model for the interplay between law and technology in solving a difficult problem. I’m not sure to what extent the lawmakers anticipated the developments that followed its passage, but CAN-SPAM is completely undeserving of the negative, even derisive reputation that it has acquired.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

December 20, 2010 at 5:37 pm 4 comments

“Do Not Track” Explained

While the debate over online behavioral advertising and tracking has been going on for several years, it has recently intensified due to media coverage — for example, the Wall Street Journal What They Know series — and congressional and senate attention. The problems are clear; what can be done? Since purely technological solutions don’t seem to exist, it is time to consider legislative remedies.

One of the simplest and potentially most effective proposals is Do Not Track (DNT) which would give users a way to opt out of behavioral tracking universally. It is a way to move past the arms race between tracking technologies and defense mechanisms, focusing on the actions of the trackers rather than their tools. A variety of consumer groups and civil liberties organizations have expressed support for Do Not Track; Jon Leibowitz, chairman of the Federal Trade Comission has also indicated that DNT is on the agency’s radar.

Not a list. While Do Not Track is named in analogy to the Do Not Call registry, and the two are similar in spirit, they are very different in implementation. Early DNT proposals envisaged a registry of users, or a registry of tracking domains; both are needlessly complicated.

The user-registry approach has various shortcomings, at least one of which is fatal: there are no universally recognized user identifiers in use on the Web. Tracking is based on ad-hoc identification mechanisms, including cookies, that the ad networks deploy; by mandating a global, robust identifer, a user registry would in one sense exacerbate the very problem it attempts to solve. It also allows for little flexibility in allowing the user to configure DNT on a site-by-site basis.

The domain-registry approach involves mandating ad networks to register domains used for tracking with a central authority. Users would have the ability to download this list of domains and configure their browser to block them. This strategy has multiple problems, including: (i) the centralization required makes it fickle (ii) it is not clear how to block tracking domains without blocking ads altogether, since displaying an ad requires contacting the server that hosts it and (iii) it requires a level of consumer vigilance that is unreasonable to expect — for example, making sure that the domain list is kept up-to-date by every piece of installed web-enabled software.

The header approach. Today, consensus has been emerging around a far simpler DNT mechanism: have the browser signal to websites the user’s wish to opt out of tracking, specifially, via a HTTP header, such as “X-Do-Not-Track”. The header is sent out with every web request — this includes the page the user wishes to view, as well as each of the objects and scripts embedded within the page, including ads and trackers. It is trivial to implement in the web browser — indeed, there is already a Firefox add-on that implements a such a header.

The header-based approach also has the advantage of requiring no centralization or persistence. But in order for it to be meaningful, advertisers will have to respect the user’s preference not to be tracked. How would this be enforced? There is a spectrum of possibilities, ranging from self-regulation via the Network Advertising Initiative, to supervised self-regulation or “co-regulation,” to direct regulation.

At the very least, by standardizing the mechanism and meaning of opt-out, the DNT header promises a greatly simplified way for users to opt-out compared to the current cookie mechanism. Opt-out cookies are not robust, they are not supported by all ad networks, and are interpreted variously by those that do (no tracking vs. no behavioral advertising). The DNT header avoids these limitations and is also future-proof, in that a newly emergent ad network requires no new user action.

In the rest of this article, I will discuss the technical aspects of the header-based Do Not Track proposal. I will discuss four issues: the danger of a tiered web, how to define tracking, detecting violations, and finally user-empowerment tools. Throughout this discussion I will make a conceptual distinction between content providers or publishers (2nd party) and ad networks (3rd party).

Tiered web. Harlan Yu has raised a concern that DNT will lead to a tiered web in which sites will require users to disable DNT to access certain features or content. This type of restriction, if widespread, could substantially undermine the effectiveness of DNT.

There are two questions to address here: how likely is it that DNT will lead to a tiered web, and what, if anything, should be done to prevent it. The latter is a policy question — should DNT regulation prevent sites from tiering service — so I will restrict myself to the former.

Examining ad blocking allows us to predict how publishers, whether acting by themselves or due to pressure from advertisers, might react to DNT. From the user’s perspective, assuming DNT is implemented as a browser plug-in, ad blocking and DNT would be equivalent to install and, as necessary, disable for certain sites. And from the site’s perspective, ad blocking would result in a far greater decline in revenue than merely preventing behavioral ads. We should therefore expect that DNT will be at least as well tolerated by websites as ad blocking.

This is encouraging, since there are very few mainstream sites today that refuse to serve content to visitors with ad blocking enabled. Ad blocking is quite popular (indeed, the most popular extensions for both Firefox and Chrome are ad blockers). A few sites have experimented with tiering for ad-blocking users, but soon after rescinded due to user backlash. Public perception is a another factor that is likely to skew things even further in favor of DNT being well-tolerated: access to content in exchange for watching ads sounds like a much more palatable bargain than access in exchange for giving up privacy.

One might nonetheless speculate what a tiered web might look like if the ad industry, for whatever reason, decided to take a hard stance against DNT. It is once again easy to look to existing technologies, since we already have a tiered web: logged-in vs anonymous browsing. To reiterate, I do not believe that disabling DNT as a requirement for service will become anywhere near as prevalent as logging in as a requirement for service. I bring up login only to make the comforting observation there seems to be a healthy equilibrium between sites that require login always, some of the time, or never.

Defining tracking. It is beyond the scope of this article to give a complete definition of tracking. Any viable definition will necessarily be complex and comprise both technological and policy components. Eliminating loopholes and at the same time avoiding collateral damage — for example, to web analytics or click-fraud detection — will be a tricky proposition. What I will do instead is bring up a list of questions that will need to be addressed by any such definition:

  • How are 2nd parties and 3rd parties delineated? Does DNT affect 2nd-party data collection in any manner, or only 3rd parties?
  • Are only specific uses of tracking (primarily, targeted advertising) covered, or is all cross-site tracking covered by default, save possibly for specific exceptions?
  • Under use-cases covered (i.e., prohibited) under DNT, can 3rd parties collect any individual data at all or should no data be collected? What about aggregate statistical data?
  • If individual data can be collected, what categories? How long can it be retained, and for what purposes can it be used?

Detecting violations. The majority of ad networks will likely have an incentive to comply voluntarily with DNT. Nonetheless, it would be useful to build technological tools to detect tracking or behavioral advertising carried out in violation of DNT. It is important to note that since some types of tracking might be permitted by DNT, the tools in question are merely aids to determine when a further investigation is warranted.

There are a variety of passive (“fingerprinting”) and active (“tagging”) techniques to track users. Tagging is trivially detectable, since it requires modifying the state of the browser. As for fingerprinting, everything except for IP address and the user-agent string requires extra API calls and network activity that is in principle detectable. In summary, some crude tracking methods might be able to pass under the radar, while the finer grained and more reliable methods are detectable.

Detection of impermissible behavioral advertising is significantly easier. Intuitively, two users with DNT enabled should see roughly the same distribution of advertisements on the same web page, no matter how different their browsing history. In a single page view, there could be differences due to fluctuating inventories, A/B testing, and randomness, but in the aggregate, two DNT users should see the same ads. The challenge would be in automating as much of this testing process as possible.

User empowerment technologies. As noted earlier, there is already a Firefox add-on that implements a DNT HTTP header. It should be fairly straightforward to create one for each of the other major browsers. If for some reason this were not possible for a specific browser, an HTTP proxy (for instance, based on privoxy) is another viable solution, and it is independent of the browser.

A useful feature for the add-ons would be the ability to enable/disable DNT on a site-by-site basis. This capability could be very powerful, with the caveat that the user-interface needs to be carefully designed to avoid usability problems. The user could choose to allow all trackers on a given 2nd party domain, or allow tracking by a specific 3rd party on all domains, or some combination of these. One might even imagine lists of block/allow rules similar to the Adblock Plus filter lists, reflecting commonly held perceptions of trust.

To prevent fingerprinting, web browsers should attempt to minimize the amount of information leaked by web requests and APIs. There are 3 contexts in which this could be implemented: by default, as part of the existing private browsing mode, or in a new “anonymous browsing mode.” While minimizing information leakage benefits all users, it helps DNT users in particular by making it harder to implement silent tracking mechanisms. Both Mozilla and reportedly the Chrome team are already making serious efforts in this direction, and I would encourage other browser vendors to do the same.

A final avenue for user empowerment that I want to highlight is the possibility of achieving some form of browser history-based targeting without tracking. This gives me an opportunity to plug Adnostic, a Stanford-NYU collaborative effort which was developed with just this motivation. Our whitepaper describes the design as well as a prototype implementation.

This article is the result of several conversations with Jonathan Mayer and Lee Tien, as well as discussions with Peter Eckersley, Sid Stamm, John Mitchell, Dan Boneh and others. Elie Bursztein also deserves thanks for originally bringing DNT to my attention. Any errors, omissions and opinions are my own.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

September 20, 2010 at 4:13 pm 7 comments

Myths and Fallacies of “Personally Identifiable Information”

I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of “personally identifiable information” (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:

The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.

On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub “PII” in a way that prevents the possibility of re-identification. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why “PII” has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.

What we are gunning after is the get-out-of-jail-free card, a.k.a. “safe harbor,” particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic “de-identification” of the data (although this isn’t the spirit of the law). Even your genome is not considered identifying!

Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.

We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn’t going to be fixed overnight. However, the current situation is a farce and needs to change.

Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.

Thanks to Vitaly Shmatikov for comments on a draft of this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 21, 2010 at 8:12 pm 6 comments

Conferences: The Good, the Bad and the Ugly aspects

I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.

The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.

I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.

Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.

The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.

CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.

I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 17, 2010 at 7:24 am 4 comments

The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity

It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.

The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.

The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.

Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?

Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)

The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.

I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.

Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.

There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.

November 28, 2009 at 5:22 am Leave a comment

Oklahoma Abortion Law: Bloggers get it Wrong

The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.

As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the text of the law a more careful reading, and realized that the bloggers have been misinterpreting the law all along.

While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made public are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.

I’m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a lawsuit aimed at stopping the law from going into effect. The text of this complaint, as Paul Ohm notes, does not raise the “public posting” claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.

But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I’m reminded of the affair of the gay sheep.

October 9, 2009 at 6:24 pm 10 comments

Privacy Law Scholars Conference

I had a great time at the Privacy Law Scholars Conference in Berkeley last week, perhaps more so than at any CS conference I’ve attended. A major reason was that there were — get this — no talks. Well, just one keynote speech. The format centered around 75 minutes-long discussion sessions (which seem to be called workshops), with 5 parallel tracks; in each session, you pick which track you want to attend. You are supposed to have read the paper beforehand, and usually everyone in the room has something to say and gets a chance to do so.

This seems way more sensible to me than the format of CS conferences, where there is only one track. I can’t imagine that anyone would genuinely want to attend all the talks. Ideally, for any given talk, half the people should skip it and spend their time networking instead, but in my experience this never happens. Worse, the talks are only 20-30 minutes long; while this is enough time to motiviate the paper and inspire the listeners to go read it afterward, it is never enough to explain the whole paper. Sometimes speakers don’t get this concept, and the results are not pretty.

Anyways, I was surprised by the ease with which I could read law papers and participate in the discussions, even if my understanding was (obviously) not nearly as deep as that of a law scholar. This is something to ponder — while legalese is dense and frequently obfuscated, law papers are a breeze to read, at least based on my small sample size.

There is one paper, by Paul Ohm, that I particularly enjoyed: it is about re-examining privacy laws and regulatory strategies in the light of re-identification techniques. This generated a lot of interest at the conference, and I found the discussion fascinating. A major reason I started 33bits was to to be able to play a part in informing these developments; it seems that this blog has indeed helped, which is highly gratifying. I learnt a lot about privacy and anonymity in general, and I look forward to writing more about it in future posts, to the extent that I can do so without talking about specific workshop discussions, which are confidential.

June 10, 2009 at 8:16 pm 8 comments

Article about Netflix paper in law journal

David Molnar pointed me to an article in the Shidler Journal of Law that prominently cites the Netflix dataset de-anonymization paper. I’m very happy to see this; when we wrote our paper, we were hoping to see the legal community analyze the implications of our work for privacy laws. As the article notes:

Re-identification of anonymized data with individual consumers may expose companies to increased liability. If data is re-identified, this may be due to the failure of companies to take reasonable precautions to protect consumer data. In addition, companies may violate their own privacy policies by releasing anonymous information to third parties that can be easily re-identified with individual users.

New lines will need to be drawn defining what is acceptable data-release policy, and in a way that takes into account the actual re-identification risk instead of relying on syntactic crutches such as removing “personally identifiable” information. Perhaps there will need to be a constant process of evaluating and responding to continuing improvements in re-identification algorithms.

Perhaps the ability of third parties to discover information about an individual’s movie rankings is not too disturbing, as movie rankings are not generally considered to be sensitive information. But because these same techniques can lead to the re-identification of data, far greater privacy concerns are implicated.

Indeed, since we wrote our paper, there have been several high profile cases in the news or in the courts where our re-identification techniques can be used to cause much more sensitive privacy breaches, including the Google-Viacom lawsuit involving Youtube viewer logs and the targeted advertising companies Phorm and Nebuad. While the lessons of our paper have begun to propagate “downstream” to the realms of law, advocacy and policy, it has come too late to make a difference in the above examples.

Part of the reason why I started this blog is in the hope of accelerating this process by reaching out to people outside the computer science community. While our papers might be couched in technical language, the results of our research are general enough to be easily accessible to a broad audience, and I hope that this blog will become a central point for disseminating information more broadly.

September 30, 2008 at 9:46 pm Leave a comment


About 33bits.org

I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 266 other subscribers