Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs

May 13, 2009 at 6:42 am 24 comments

Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.

Even if we look at the much coarser granularity of a census tract — tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract — for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.

The paper is timely, because Location Based Services are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:

who has access to anonymized location data?
how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?

The authors don’t say much about these questions, but that’s probably because there are too many possibilities to list! In this post I will examine a few.

GPS navigation. This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as NYC cab drivers threatening to strike. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider — unlike in the context of social networks, people often don’t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn’t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.

The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there’s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.

Location based social networking. Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. Google Latitude and Loopt are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don’t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a pair of locations is uniquely identifying, a trail is emphatically so.

The same threat also applies to data being subpoena’d, so data retention policies need to take into consideration the uselessness of anonymizing location data.

I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?

Plain old web browsing. Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of free or paid databases for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.

A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. Skyhook, Inc. has spent years wardriving the country (and abroad) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook’s database and determine the user’s current location. Note that you don’t have to be browsing wirelessly — all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has teamed up with Skyhook and is “looking forward to a future where geolocation data is as assumed part of the browsing experience.” The protocol by which the browser communicates geolocation to the website is being standardized by the W3C.

The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a competing offering that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.

Image metadata. An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.

If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.

Now let us turn to the other major question: what are the sources of auxiliary data that might link location pairs to identities? The easiest approach is probably to buy data from Acxiom, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a “join,” i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by “DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.”

Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary — such as when you’re trying to browse anonymously — the adversary already has access to the auxiliary data. Google’s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.

Finally, while I’ve talked about image data being the target of de-anonymization, it may equally well be used as the auxiliary information that links a location pair to an identity — a non-anonymous Flickr account with sufficiently many geotagged photos probably reveals an identifiable user’s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people’s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)

Summary. A number of devices in our daily lives transmit our physical location to service providers whom we don’t necessarily trust, and who keep might keep this data around or transmit it to third parties we don’t know about. The average user simply doesn’t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.

Entry filed under: Uncategorized. Tags: anonymity, blog_dape, location, privacy, re-identification.

Is Anonymity Research Ethical? Graduation and plans

24 Comments Add your own

1. Domy Gryfino | May 13, 2009 at 12:09 pm

Detailed analysis. A little bit too long for me – I read few paragraphs and summary and I’m going for other articles. i like that geek stuff :)
Reply
2. Mark | May 13, 2009 at 6:07 pm

I’m not sure about uniquely identifying – I know several people who live together and work in the same office.

Similarly, I personally worked in the same building as another individual (different company/floor) who just coincidentally happens to live in the condo directly across from me.
Reply
- 3. Arvind | May 14, 2009 at 3:13 am
  
  Mark, read the post again — it’s not uniquely identifying for everyone, just for the average person (i.e, for more than 50% of the people.) The good thing about having a lot of data is that we can measure things instead of making guesses on our intuition informed by sample sizes of 5 or 10. The paper looked at the entire U.S. private sector working population, more than a 100 million people.
  Reply
4. Simon Hawkin | May 13, 2009 at 11:31 pm

Well, anonymity and privacy are being phased out in our society, which is too bad. It does hit at the core of the society. We will survive but what will come out of it is not clear at this point.
Reply
5. Sean Murphy | May 13, 2009 at 11:54 pm

Another well written and thought provoking post. What I take away from your blog is that there are a number of relatively simple “hashing functions” that will allow firms to uniquely identify us from data that on the surface wouldn’t seem to represent a privacy risk but is. Data that’s becoming easier to collect all of the time.
Reply
- 6. Arvind | May 14, 2009 at 3:09 am
  
  Sean, I wouldn’t call it a hashing function but other than that that’s a good summary. Sometimes I use the term fingerprint, which is similar to a hash function but not the same.
  Reply
7. Michael Hudson | May 14, 2009 at 12:27 am

According to the Census website, a CBSA is a segment of area usually spanning a few counties.
http://www.census.gov/population/www/metroareas/metroarea.html

I guess if you were a marketer augmenting your campaigns with census data, it would make sense to tie to some of these areas.
Reply
- 8. Arvind | May 14, 2009 at 4:52 am
  
  Interesting, thanks.
  Reply
9. Ashwin Nanjappa | May 14, 2009 at 7:57 am

Thanks for sharing. This is probably your most eye opening post personally for me yet :-)
Reply
10. Anonymity in an increasingly connected world | 'Pataphysical science in the home | May 14, 2009 at 11:22 am

[…] was reading this article Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs, by Arvind Narayanan, and found it quite interesting. (Thanks to @jamespage for pointing to this […]
Reply
11. gangbox | May 14, 2009 at 1:44 pm

I work in construction – so I typically change employers about 20 times a year. So my commutes are ALWAYS different – I might have one commuting pattern for a week, but then I’ll have a totally different pattern for the next three weeks.

Which means that I’m untrackable – even with GPS!

Casual labor WIN!
Reply
- 12. Anonymous Coward | August 15, 2010 at 9:22 pm
  
  Actually, the more you change your commute pattern, the more it would stand out among others’ patterns, thereby making you more unique and discernible from others in the crowd (assuming that different patterns can be associated with the same person). It may be best to adopt a pattern which is “maximally average”…
  Reply
13. professor | May 16, 2009 at 4:25 pm

I teach in a small college. I only commute twice a week to teach and the rest of the week I work from home and in cafes. I never wake up in time to even meet the morning commuters.

I take the new york city subways, so there’s no GPS to even talk about!

Labor of knowledge also WIN!
Reply
14. Kotlina Klodzka | May 19, 2009 at 7:04 am

Yeah, most people don’t care but I’m kinda paranoid about anonimity. Even my home is marked in the gps in my phone 2 blocks away, just in case.
Reply
15. foobar | May 22, 2009 at 1:07 pm

I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?

I’m pretty sure they do. For example in France they keep the logs for two years. Law enforcement agency have access to this logs without a subpoena.
It’s very suprising when a policeman knows every location you were in the last 2 years…

Yeah, most people don’t care but I’m kinda paranoid about anonimity. Even my home is marked in the gps in my phone 2 blocks away, just in case.

But you still have a phone…
Reply
- 16. Arvind | May 22, 2009 at 4:12 pm
  
  Yup, that indeed seems to be the case in Europe. In fact, there is an E.U. directive mandating retention in all member countries.
  
  I’m not sure what the situation is in the U.S. The general rule is that Americans trust their government less than Europeans do but their corporations more. Since the purpose of mandatory telecom data retention is to help law enforcement, I would expect things to be much privacy friendly on this side of the pond.
  Reply
17. Signalfire » Scrubbed geo-location data not so anonymous after all | May 25, 2009 at 6:43 pm

[…] data, it seems, might end up being a similar land mine, according to the 33 Bits of Entropy blog, which provides further analysis of the findings. A PDF of the original […]
Reply
18. Odchudzanie | June 17, 2009 at 11:56 am

Unfortunately this is the direction the world is heading – no anonymity and everything is government controlled. Now I’m no conspiracy theory freak, I don’t believe in bitter old men sitting in a dark room and plotting world domination.
This seems like a logical next step on our civilizational ladder.
Reply
- 19. Arvind | June 17, 2009 at 4:23 pm
  
  Everything is government controlled, huh? Are you sure you’re not a conspiracy theorist? :-)
  Reply
  - 20. Lars | November 23, 2009 at 9:56 am
    
    Sorry for disregarding your smiley, but having a heap of data all in one place is too huge a problem to be disregarded. The problem is not whether this is government or not, although being government, with all those pesky special powers helps a lot in misusing that data. The problem is the creation of single point of failures for data disclosure protection.
    
    Assuming we completely trust “the government” to always act benign, one cannot but observe that failures happen. The problem with disclosure failures, which makes them especially critical, is that they cannot be recovered from. Given sufficiently critical data, disclosure might be much worse than tainted or manipulated data. (I think secret services understood that part a long time ago.)
    
    The main problem why I am arguing thus is, that privacy arguments are often brushed aside by assigning them the “conspiracy theory”-label. One may avoid being labeled in that way by avoiding questions of government/corporation-trust and sticking to the technical consequences. (Does anybody need to mention wikileaks as one operation that emphasizes the difficulty of getting the information-djinni back in his bottle?)
    Reply
21. Álvaro Del Hoyo | March 19, 2010 at 9:49 am

Arvind,

News related to these ruminations on location data you posted

http://arstechnica.com/science/news/2010/02/cell-phones-show-human-movement-predictable-93-of-the-time.ars

Telcos and data locations….Europe…yes, mainly commercial and lawful interception uses, but other uses allowed by law

1. Commercial…informed previous consent…added value services, adverts based on location data…for instance, friends geotagging, daddy following kids on Saturday night ;-p

2. Lawful interception….interception and data retention…two similar, but different things

Differences:

a) interception is communications diversion to law enforcement agencies after subpoena; data retention is archiving of information given traffic and location data and handover to law enforcement agencies after subpoena

b) interception includes content -voice, text, images,…- plus traffic and location data of intercepted communications being diverted from telco network systems to law enforcement agencies -let´s say is on-line-; data retention does not include content, only traffic and location data

Similarities, resemblances:

a) Subpoena required for detection, investigation of serious crimes whatever this term means…there is legal definition but many doubts arise ;-p

b) Law enforcement agencies do not have access to telco systems…there are different interception and retained data handover interfaces designed by ETSI…have you seen Cryptome well know companies spy services for law enforcement agencies leaks, police accesing directly to data ;-p

Term for data retention varies from EU members to others…Directive allows a term in between 6-24 months…in Spain is 12 months…but after data should be maintained for addittional 3 years term applying access blocking, so only telco Security Manager has opportunity to access it in case users sue company in case of any privacy non-fulfilment

Recently, German Data Retention transposition law has been declared inconstitutional. Ireland is asking Court of Justice of European Union to validate Directive- But right now telcos are obeying national transposition laws.

Then, no related to location data, but to traffic data

a) Cell simulators, triggerfishers,… Scanners employed by law enforcement agencies, pretending to be a mobile cell tower around people and are stealing their IMSI, and after asking for subpoenas to make user identification available to them by telcos.

b) So called three strikes laws, after telcos are detecting you using P2P networks three times, they are obliged to cut your internet access. HADOPI law in France is the only baby so far. There is no Directive asking for it.

c) Deep Packet Inspection is the next battle for free speech, communications secrecy and private life

Other traffic and location data usages: invoicing, interconnection payments between operators, fraud detection, customer care, networks security -in terms of availability, SLAs and business continuity-.

And what about USA:

http://news.cnet.com/Gonzales-pressures-ISPs-on-data-retention/2100-1028_3-6077654.html?tag=st.nl

Data retention was an issue in the Obama-Hillary Clinton debates, but i did not follow since then.

A Spanish IT lawyer and infosec consultant grateful for your excellent work and blog.

Regards
Reply
- 22. Arvind | March 22, 2010 at 7:51 pm
  
  Thanks very much for the info.!
  Reply
23. Álvaro Del Hoyo | March 23, 2010 at 11:10 pm

Wil love a post around Google Analytics/Urchin cookies…utma plus users logged in for website managers….and Google itself?
Reply
24. Some thoughts on Color with a capital C | 'Pataphysical science in the home | June 14, 2011 at 3:54 pm

[…] Another surprising oversight given the data-driven nature of the founders is that “for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.” […]
Reply