Posts tagged ‘free speech’

Is Writing Style Sufficient to Deanonymize Material Posted Online?

I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:

Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.

Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]

Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?

Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.

In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel.  But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.

We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.

Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.

An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.

We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.

We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.

Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.

An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.

In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.

The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.

[1] A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.

[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 20, 2012 at 9:40 am 7 comments

Insights on fighting “Protect IP” from a Q&A with Congresswoman Lofgren

Summary. Appeals to free speech and chilling effects are at best temporary measures in the fight against Protect IP and domain seizures. Even if we win this time it will keep coming back in modified form; the only way defeat it for good is to convince Washington that artists are in fact thriving, that piracy is not the real problem, and that takedown efforts are not in the interest of society. We in the tech world know this, but we are doing a poor job of making ourselves heard in Washington, and this needs to change.

As most of you know, the Protect IP Act is a horrendous piece of proposed legislation sponsored by the “content industry” that gives branches of the Government powers to sieze domain names at will, force websites to remove links, etc. Congresswoman Zoe Lofgren has been one of the very few legislators fighting the good fight, speaking out against this grave threat to free speech.

I was invited to a brown bag lunch with Rep. Lofgren at Mozilla today. (Mozilla has gotten involved in this because of the events surrounding the Mafiaafire add-on and Homeland Security.) I asked the Congresswoman this question (paraphrased):

“Does the strategy of domain-name seizures even have a prayer of achieving the intended outcome, or is it going to lead to something similar to the Streisand effect, as we’ve seen happen repeatedly on the Internet? Tools for circumvention of censorship in dictatorial regimes, that we can all get behind and that the U.S. government has often funded, may be morally different from tools for circumvention of anti-infringement efforts, but they are technologically identical.” [Princeton professor and now FTC chief technologist Ed Felten has pointed this out in a related context.]

In response, Rep. Lofgren pivoted to the point that seemed to be her favorite theme of the day—the tech world needs to come up with ways to monetize online content, she said. Unless that happens, it’s not looking good for our side in the long run.

At first I was slightly annoyed by her not addressing my question, but after she pivoted a couple of more times to the same point in answer to other questions I started to pay close attention.

What the Congresswoman was saying was this:

  1. The only way to convince Washington to drop this issue for good is to show that artists and musicians can get paid on the Internet.
  2. Currently they are not seeing any evidence of this. The Congresswoman believes that new technology needs to be developed to let artists get paid. I believe she is entirely wrong about this; see below.
  3. The arguments that have been raised by tech companies and civil liberties groups in Washington all center around free speech; there is nothing wrong with that but it is not a viable strategy in the long run because the issue is going to keep coming back.

Let’s zoom in on point 2 above. We techies all say we have the answers. New technology is not needed, we say. The dinosaurs of the content industries need to adapt their business models. Piracy is not correlated with a decrease in sales. Piracy happens not because it is cheaper, but because it is more convenient. Businesses need to compete with piracy rather than trying to outlaw it. Artists who’ve understood this are already thriving.

Washington is willing to listen to this. But no one is telling it to them.

There are a million blog posts that make the points above. But those don’t have an impact in Congress. “You vote up articles on Reddit all day,” Rep. Lofgren said. “Guess what, we don’t check Reddit in Washington.” Yes, she actually said that. The exact wording might be off but she used words to essentially that effect. She also pointed out that the tech industry spends by far the least amount of effort on lobbying. The entire industry has fewer representatives, apparently, than individual companies from many other sectors do.

A lot of information that we consider common knowledge is not available in Washington. It needs to be in a digestible form; for example, academic studies with concrete numbers that can be cited will be particularly useful. But a simple and important first step is to start communicating with policymakers. In my dealings with them, I’ve found them more willing to listen than I would have thought. So here’s my plea to the community to redirect some of the energy that we expend writing blog posts and expressing outrage into something more constructive.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 19, 2011 at 10:50 pm Leave a comment

The Master Switch and the Centralization of the Internet

One of the most important trends in the recent evolution of the Internet has been the move towards centralization and closed platforms. I’m interested in this question in the context of social networks—analyzing why no decentralized social network has yet taken off, whether one ever will, and whether a decentralized social network is important for society and freedom. With this in mind, I read Tim Wu’s ‘The Master Switch: The Rise and Fall of Information Empires,’ a powerful book that will influence policy debates for some time to come. My review follows.

‘The Master Switch’ has two parts. The former discusses the history of communications media through the twentieth century and shows evidence for “The Cycle” of open innovation → closed monopoly → disruption. The latter, shorter part is more speculative and argues that the same fate will befall the Internet, absent aggressive intervention.

The first part of the book is unequivocally excellent. There are so many grand as well as little historical facts buried in there. Wu makes his case well for the claim that radio, telephony, film and television have all taken much the same path.

A point that Wu drives home repeatedly is that while free speech in law is always spoken of in the context of Governmental controls, the private entities that own or control the medium of speech play a far bigger role in practice in determining how much freedom of speech society has. In the U.S., we are used to regulating Governmental barriers to speech but not private ones, and a lot of the book is about exposing the problems with this approach.

An interesting angle the author takes is to look at the motives of the key men that shaped the “information industries” of the past. This is apposite given the enormous impact on history that each of these few has had, and I felt it added a layer of understanding compared to a purely factual account.

But let’s cut to the chase—the argument about the future of the Internet. I wasn’t sure whether I agreed or disagreed until I realized Wu is making two different claims, a weak one and a strong one, and does not separate them clearly.

The weak claim is simply that an open Internet is better for society in the long run than a closed one. Open and closed here are best understood via the exemplars of Google and Apple. Wu argues this reasonably well, and in any case not much argument is needed—most of us would consider it obvious on the face of it.

The strong claim, and the one that is used to justify intervention, is that a closed Internet will have such crippling effects on innovation and such chilling effects on free speech that it is our collective duty to learn from history and do something before the dystopian future materializes. This is where I think Wu’s argument falls short.

To begin with, Wu doesn’t have a clear reason why the Internet will follow the previous technologies, except, almost literally, “we can’t be sure it won’t.” He overstates the similarities and downplays the differences.

Second, I believe Wu doesn’t fully understand technology and the Internet in some key ways. Bizarrely, he appears to believe that the Internet’s predilection for decentralization is due to our cultural values rather than technological and business realities prevalent when these systems were designed.

Finally, Wu has a tendency to see things in black and white, in terms of good and evil, which I find annoying, and more importantly, oversimplified. He quotes this sentence approvingly: “Once we replace the personal computer with a closed-platform device such as the iPad, we replace freedom, choice and the free market with oppression, censorship and monopoly.” He also says that “no one denies that the future will be decided by one of two visions,” in the context of iOS and Android. It isn’t clear why he thinks they can’t coexist the way the Mac and PC have.

Regardless of whether one buys his dystopian prognostications, Wu’s paradigm of the “separations principle” is to be taken seriously. It is far broader than even net neutrality. There appear to be two key pillars: a separation of platforms and content, and limits on corporate structures to faciliate this—mainly vertical, but also horizontal, such as in the case of media conglomerates.

Interestingly, Wu wants the separations principle to be more of a societal-corporate norm than Governmental regulation. That said, he does call for more powers to the FCC, which is odd given that he is clear on the role that State actors have played in the past in enabling and condoning monopoly abuse:

Again and again in the histories I have recounted, the state has shown itself an inferior arbiter of what is good for the information industries. The federal government’s role in radio and television from the 1920s to the 1960s, for instance, was nothing short of a disgrace. In the service of chain broadcasting, it wrecked a vibrant, decentralized AM marketplace. At the behest of the ascendant radio industry, it blocked the arrival and prospects of FM radio, and then it put the brakes on television, reserving it for the NBC-CBS duopoly. Finally, from the 1950s through the 1960s, it did everything in its power to prevent cable television from challenging the primacy of the networks.

To his credit, Wu does seem to be aware of the contradiction, and appears to argue that the Government agencies can learn and change. It does seem like a stretch, however.

In summary, Wu deserves major kudos both for the historical treatment and for some very astute insights about the Internet. For example, in the last 2-3 years, Apple, Facebook, and Twitter have all made dramatic moves toward centralization, control and closed platforms. Wu seems to have foreseen this general trend more clearly than most techies did.[1] The book does have drawbacks, and I don’t agree that the Internet will go the way of past monopolies without intervention. It should be very interesting to see what moves Wu will make now that he will be advising the FTC.

[1] While the book was published in late 2010, I assume that Wu’s ideas are much older.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2011 at 7:51 pm Leave a comment

The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity

It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.

The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.

The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.

Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?

Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)

The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.

I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.

Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.

There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.

November 28, 2009 at 5:22 am Leave a comment


About 33bits.org

I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 266 other subscribers