Myths and Fallacies of “Personally Identifiable Information”

June 21, 2010 at 8:12 pm 6 comments

I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of “personally identifiable information” (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:

The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.

On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub “PII” in a way that prevents the possibility of re-identification. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why “PII” has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.

What we are gunning after is the get-out-of-jail-free card, a.k.a. “safe harbor,” particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic “de-identification” of the data (although this isn’t the spirit of the law). Even your genome is not considered identifying!

Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.

We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn’t going to be fixed overnight. However, the current situation is a farce and needs to change.

Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.

Thanks to Vitaly Shmatikov for comments on a draft of this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: HIPAA, law, PII, privacy, re-identification.

Conferences: The Good, the Bad and the Ugly aspects What Every Developer Needs to Know About “Public” Data and Privacy

6 Comments Add your own

1. Jeff Granger | June 22, 2010 at 2:28 am

Hi Arvind,

This is an interesting article. In my company’s experience it is a common misconception that simple de-identification of data protects privacy.

[One thing – your link to the PDF is broken. It references http://userweb.cs.utexas.edu/users//shmat/shmat_cacm10.pdf, which has one too many slashes.]

All the best,

Jeff
Reply
- 2. Arvind | June 22, 2010 at 2:34 am
  
  Thanks for your comment. The link should work — webservers generally ignore extra slashes. It does work for me. But for good measure I’ve removed the slash. Let me know if it works for you now.
  Reply
  - 3. Jeff Granger | June 22, 2010 at 12:01 pm
    
    Yep. Works ok now. Thanks.
    Reply
4. Alvaro Del Hoyo | August 26, 2010 at 10:00 am

Arvind,

Thank you so much.

In Europe we have same problem. We have a pair of legal concepts that matches your concept of personal identifiable information.

Art.2.a Directive 95/46/CE

‘personal data’ shall mean any information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity;

In Spain there is an exception: such identification -of the identifiable person- should not require disproportionate periods of time or activities, whatever these means.

But, what is our economic, cultural or social identity? Being classified as high, mid, low per year incomes , even if error is feasible? Being classified as illiterate or not? Our nickname or avatar on social networks or online game communities?

Really funny pair of legal concepts.

According to UE law, guess online advertising companies, as any stakeholder in online activities are treating personal data just because traffic data treatment and/or cookie deployment.

Apparently there is no need to use 33 bits of entropy formula to apply.

What do you think? ;-p

Regards
Reply
5. Alvaro Del Hoyo | August 26, 2010 at 11:44 am

May be this document is of your interest…does not resolve the problem in any case ;-p

Click to access wp136_en.pdf
Reply
- 6. Arvind | August 30, 2010 at 6:39 am
  
  Thanks! That’s interesting. I will give it a more thorough read when I get a chance.
  Reply

33 Bits of Entropy