For those following this blog closely, you know CDP is developing the concept of a datatrust. Slightly less obvious may be the fact that we’re actually planning on building a datatrust too. As such, we have been have been following interesting privacy ideas and technologies for a few years now, by attending conferences, reading papers and talking to interesting folks.
Privacy Challenges with Aggregates
One of the key realizations that lead to the creation of CDP was that the data that is valuable for analysis (generally aggregate statistical data) is not, in principle, the data that concerns privacy advocates (identifiable, personal information). While that is true in many cases, the details are a bit more complicated.
I often quote the following statistic from Latanya Sweeney’s famous 2000 paper:
87% of people in the US can be uniquely identified by a combination of their zip code, their gender and their date of birth.
(Since then there has been some debate around that fact – Philippe Golle at PARC said in 2006 its really 63%.) But the fact remains that often your seemingly innocuous demographic data can actually be as unique a fingerprint as your social security number. A ways back I wrote about how it is standard practice for pollsters to tie a set of “anonymous” survey responses to any number of public databases that also contain those demographic details and tie your name(s), address(es), income, title(s), car(s), to your “anonymous” survey results. (Anonymous is in quotes because it’s CDP humor – we believe the term is used incorrectly most of the time.) It’s like the Statistician’s Edition of Trivial Pursuit. In fact, zip code, gender and birth date are just an example – the more characteristics of any person (or place or object or anything really) you collect, the more likely it is that the set is unique. How unique is the set of purchases on your monthly credit card statement?
This reality poses a potentially showstopping problem for the datatrust: sure, aggregates are probably fine most of the time, but if we want to store and allow analysis of highly sensitive data, how can we be sure identities won’t be derived from even the aggregates?
PINQ: Privacy Integrated Queries
Enter PINQ, just made public by Frank McSherry at Microsoft Research under the standard MSR license. PINQ or Privacy Integrated Queries, is an implementation of a concept called Differential Privacy (Cynthia Dwork’s paper seems to be a good overview before diving into the math behind it, Frank’s paper speaks to the PINQ in particular. There’s also a tutorial for those who want to get their hands dirty.) PINQ provides a layer between the data analyst and the datastore that ensures no privacy disclosures.
Wait, one better: It guarantees no privacy disclosures. How could that be?
Here’s an example: Imagine you record the heights of each of the people in your subway car in the morning, and calculate the average height. Then imagine that you also recorded each person’s zip code, gender and birth date. According to Sweeney above, if you calculated the “average” height of each combination of zip code, gender and birth date, you would not only know the exact height of 87% of the people on the car, but, with the help of some other public databases, you’d also know who they were.
Here’s where differential privacy helps. Take the height data you recorded (zip codes and all) and put it behind a differential privacy software wall. By adding just the right amount of noise to the results, you the analyst can query for statistical representations of all different combinations of the characteristics, and you’ll get an answer and a measure of the accuracy of the response. Instead of finding out that the average height was 5′ 5.23″, you might find out that the average height was 5′ 4″ +/- 0.75″. (I’m making these numbers up and over-simplifying.)
A Programmatic Privacy Guarantee
The guarantee of differential privacy in the above example is that if you remove any one person from the subway car dataset, and asked PINQ again for the average height, the answer would be the same (same level of accuracy) as the answer when they were included in the set.
For the analyst trying to understand the big picture, PINQ offers accurate answers and privacy. For the attacker (to use the security-lingo) seeking an individual’s data, PINQ offers answers so inaccurate they are useless.
What every prototype needs: Road Miles
Does it work? I’ve chatted with Frank a lot, and there seems to be a growing consensus in the research communities that it does; based on what I have seen I am very optimistic. However, at least right now, the guarantee is less of a concern than usability: How hard is it to understand a dataset when all you can extract from it are noisy aggregates? We’re hoping that it is more useful than not having any access to certain sensitive datasets, but we don’t know yet.
So what will we be doing for the next few months? Taking PINQ out on the highway. And trying to figure out what role it can play in the datatrust. We’ll keep you posted!