PINQ Privacy Demo

January 7th, 2010 by Tony Gibbon

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).

Tags: , , , , ,

4 Responses to “PINQ Privacy Demo”

  1. Tim Schmidt says:

    PINQ and systems like it would be a paradigm shift in epidemiology. Preservation of privacy while providing access to large, census-like datasets would change the game of health-exposure association investigations. Instead of cumbersome and underpowered individual studies of, say, the association between proximity to a major roadway and asthma, a huge dataset would answer the question in one fell swoop.

    Of course, the technology will come long before the social change. In order for PINQ to truly shine, it needs large, high-definition datasets. I’m not sure the CDC even has those. Current methods of data collection may obscure definition early on out of concern for privacy (e.g. local departments of public health may aggregate neighborhood data and delete any address identifiers). Thus, the public is going to have to become comfortable with the idea that all their important personal information (including health information) will be held in a handful of computers somewhere. That comfort will come with an understanding of how privacy is safeguarded in such a system and that will take time.

    I have one question: Is it possible to allow any search query (of any specificity/definition) without opening up the possibility of eliminating the shroud of noise through a huge amount of sampling. For example, if I compare Tony’s house to his neighbors one billion times, would I get statistically significantly different averages? It seems like limiting the number of queries a particular user can make is burdensome. Might it be necessary to impose a limit on how far one can “zoom into the data”?

  2. Tony Gibbon says:

    Tim, your question is a good segue into one of our current challenges.

    For the sake of simplicity, in the post I described the result of the query as either ‘yes, we can draw a conclusion’ or ‘there’s too much noise.’ In actuality, we’re getting some information about “Tony’s house,” even though it may not be enough to answer our question. Also, when you ask PINQ a question, you specify the standard deviation of the noise distribution.

    If we just allowed people to go nuts with PINQ, they could ask for a very small amount of noise (e.g. likely range of +/- 0.5), or as you correctly suggested, a person could take a sample (though it doesn’t take a huge amount to start to see a trend) and combine the small inconclusive pieces of information into something conclusive.

    So, each time you ask a question some information is leaking out, and one of our big challenges with the datatrust is figuring out how to manage this flow, including among multiple users.

    Limiting how far one can zoom into the data, as you suggest, is problematic in that we have to have a knowledge of what data we’re trying to protect. For instance, a smaller area on a map, doesn’t necessarily mean more people (think Brooklyn vs. Alaska), so we can’t necessarily limit “rectangle size”. In the same vein, we can’t disallow queries with “not enough” people in them because you’d be able to learn about the dataset by what queries we weren’t allowing.

    There are some nice solutions to allow practically unlimited queries with datasets that are easily segmented (like maps). In one, you precompute the noisy answers at various resolutions, and when someone asks a question, the appropriate segments are summed. The answers are reused, so no addtional information is leaked with each subsequent query.

  3. Aaron Beach says:

    I’ve been working on integrating anonymization into social network applications (Facebook Apps). The common questions asked in these applications are “get friends”, “get movies”, or get profile (basic name/education/employment info). I was wondering how pinq/differential privacy would be applied to such questions. Most applications do not care about the data distribution or noise, they simply want a list of strings (movie names / friends) to integrate into their application.

    So for example, If I like movies A, B, C, and D which together identify me, could I use pinq to anonymize this set and return something to the application that doesn’t identify me?

  4. Tony Gibbon says:

    Hi Aaron, thanks for the question – I’m not sure I follow the scenario you’re referring to for the Facebook app.

    PINQ and differential privacy function by returning noisy answers on aggregate questions. If, for example, I’m creating the super-duper-movie-recommender app, my app (as you point out) needs specific users’ lists of movies in order to function, so I don’t see an application of differential privacy there. I also don’t really see any need for anonymization at the point of the “get movies” question. It reminds me of signing up for an account at Amazon—I give them my address and while I certainly don’t want them to release it, they obviously need my exact address in order to provide the service.

    Perhaps after you have a database of users and their favorite movies, you could use an anonymization method. Let’s say my app has a very simple, naïve recommender algorithm, which takes my two favorite movies and provides a list of recommendations by taking the set of people who share my favorites, and then for each movie, counting the number of people in that set who also like that movie.

    Let’s say your two favorite movies are Star Wars and The Muppet Movie. Those are in my list too! So your recommendations would include all of my favorite movies with frequency at least 1 (depending on how many other people in the set liked that movie).

    Here’s where something like differential privacy could possibly come into play. Let’s say one of my favorites is Follow That Bird and that creates a unique combination so it would only have frequency 1. If we use differential privacy to add noise to these counts, the noisy frequencies mean that a frequency near 1 is indistinguishable from a frequency near 0, thus making it impossible to deduce whether or not Follow That Bird is actually a recommendation. On the other hand, if 100 people in your set also like Return of the Jedi, its noisy frequency will be near 100. You won’t be able to tell if it’s actually 100 people vs. 99 or 101, but you’ll know that a lot more people also like it as compared to the movies with much lower noisy answers.

    That’s a toy example to demonstrate differential privacy vs. ‘make this set private.’ You should also check out Graham Cormode’s work on “Anonymization and Uncertainty in Social Network Data”, which deals with introducing uncertainty into a graph, and then allows for more ‘hand’s on’ analysis than just returning noisy aggregates.

Get Adobe Flash player