The Linkability of Usernames: a Step Towards “Uber-Profiles”

February 16, 2011 at 5:19 pm 2 comments

Daniele PeritoClaude Castelluccia, Mohamed Ali Kaafar, and Pere Manils have a neat paper How Unique and Traceable are Usernames? that addresses the following question:

Suppose you find the same username on different online services, what is the probability that these usernames refer to the same physical person?

The background for this investigation is that there is tremendous commercial value in linking together every piece of online information about an individual. While the academic study of constructing “uber-profiles” by linking social profiles is new (see Large Online Social Footprints—An Emerging Threat for another example), commercial firms have long been scraping profiles, aggregating them, and selling them on the grey market. Well-known public-facing aggregators such as Spokeo mainly use public records, but online profiles are quickly becoming part of the game.

Paul Ohm has even talked of a “database of ruin.” No matter what moral view one takes of this aggregation, the technical questions are fascinating.

The research on Record Linkage could fill an encyclopedia (see here for a survey) but most of it studies traditional data types such as names and addresses. This paper is thus a nice complement.

Usernames are particularly useful for carrying out linkage across different sites for two reasons:

  • They are almost always available, especially on systems with pseudonymous accounts.
  • When comparing two databases of profiles, usernames are a good way to quickly find candidate matches before exploring other attributes.

The mathematical heavy-lifting that the authors do is described by the following:

… we devise an analytical model to estimate the uniqueness of a user name, which can in turn be used to assign a probability that a single username, from two different online services, refers to the same user


we extend this model to cases when usernames are different across many online services … experimental data shows that users tend to choose closely related usernames on different services.

For example, my Google handle is ‘randomwalker’ and my twitter username is ‘random_walker’. Perito et al’s model can calculate how obscure the username ‘random_walker’ is, as well as how likely it is that ‘random_walker’ is a mutation of ‘randomwalker’, and come up with a combined score representing the probability that the two accounts refer to the same person. Impressive.

The authors also present experimental results. For example, they find that with a sample of 20,000 usernames drawn from a real dataset, their algorithms can find the right match about 60% of the time with a negligible error rate (i.e., 40% of the time it doesn’t produce a match, but it almost never errs.) That said, I find the main strength of the paper to be in the techniques more than the numbers.

Their models know all about the underlying natural language patterns, such as the fact that ‘random_walker’ is more meaningful than say ‘rand_omwalker’. This is achieved using what are called Markov models. I really like this class of techniques; I used Markov models many years ago in my paper on password cracking with Vitaly Shmatikov to model how people pick passwords.

The setting studied by Perito et al. is when two or more offline databases of usernames are available. Another question worth considering is determining the identity of a person behind a username via automated web searches. See my post on de-anonymizing Lending Club data for an empirical analysis of this.

There is a lot to be said about the psychology behind username choice. Ben Gross’s dissertation is a fascinating look at the choice of identifiers for self-representation. I myself am very attached to ‘randomwalker’; I’m not sure why that is.

A philosophical question related to this research is whether it is better to pick a unique username or a common one. The good thing about a unique username is that you stand out from the crowd. The bad thing about a unique username is that you stand out from the crowd. The question gets even more interesting (and consequential) if you’re balancing Googlability and anonymity in the context of naming your child, but that’s a topic for another day.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , , , , .

A Cryptographic Approach to Location Privacy One Click Frauds and Identity Leakage: Two Trends on a Collision Course

2 Comments Add your own

  • 1. W  |  March 16, 2011 at 8:49 pm

    One pretty big leak is gravatar. Even websites that tell you that they never publish your email publish it’s md5 hash. At minimum this associates the user with a unique ID corresponding to their email address. And if the email address has low enough complexity it can even be recovered from the hash.

    • 2. Arvind  |  March 16, 2011 at 10:53 pm

      Excellent point. I think the vast majority of email addresses are recoverable from the hash.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 265 other subscribers

%d bloggers like this: