Posts Tagged ‘PINQ’

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying  privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

Recap and Proposal: 95/5, The Statistically Insignificant Privacy Guarantee

Wednesday, May 26th, 2010

Image from: xkcd.

In our search for a privacy guarantee that is both measurable and meaningful to the general public, we’ve traveled a long way in and out of the nuances of PINQ and differential privacy: A relatively new, quantitative approach to protecting privacy. Here’s a short summary of where we’ve been followed by a proposal built around the notion of statistical significance for where we might want to go.

The “Differential Privacy” Privacy Guarantee

Differential privacy guarantees that no matter what questions are asked and how answers to those questions are crossed with outside data, your individual record will remain “almost indiscernible” in a data set protected by differential privacy. (The corollary to that is that the impact of your individual record on the answers given out by differential privacy will be “negligeable.”)

For a “quantitative” approach to protecting privacy, the differential privacy guarantee is remarkably NOT quantitative.

So I began by proposing the idea that the probability of a single record being present in a data set should equal the probability of that single record not being present in that data set (50/50).

I introduced the idea of worst-case scenario where a nosy neighbor asks a pointed question that essentially reduces to a “Yes or no? Is my neighbor in this data set?” sort of question and I proposed that the nosy neighbor should get an equivocal (50/50) answer: “Maybe yes, but then again, (equally) maybe no.”

(In other words, “almost indiscernible” is hard to quantify. But completely indiscernible is easy to quantify.)

We took this 50/50 definition and tried to bring it to bear on the reality of how differential privacy applies noise to “real answers” to produce identity-obfuscating “noisy aswers.”

I quickly discovered that no matter what, differential privacy’s noisy answers always imply that one answer is more likely than another.

My latest post was a last gasp explaining why there really is no way to deliver on the completely invisible, completely non-discernible 50/50 privacy guarantee (even if we abandoned Laplace).

(But I haven’t given up on quantifying the privacy guarantee.)

Now we’re looking at statistical significance as a way to draw a quantitative boundary around a differential privacy guarantee.

Below is a proposal that we’re looking for feedback on. We’re also curious to know if anyone else tried to come up with a way to quantify the differential privacy guarantee?

What is Statistical Significance? Is it appropriate for our privacy guarantee?

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. Applied to our privacy guarantee, you might ask the question this way: When you get an answer about a protected data set, are the implications of that “differentially private” answer (as in implications about what the “real answer” might be) significant or are they simply the product of chance?

Is this an appropriate way to define a quantifiable privacy guarantee, we’re not sure.

Thought Experiment: Tossing a Weighted Coin

You have a coin. You know that one side is heavier than the other side. You have only 1 chance to spin the coin and draw a conclusion about which side is heavier.

At what weight distribution split does the result of that 1 coin spin start to be statistically significant?

Well, if you take the “conventional” definition of statistical significance where results start to be statistically significant when you have less than a 5% chance of being wrong, the boundary in our weighted coin example would be 95/5 where 95% of the weight is on one side of the coin and 5% is on the other.

What does this have to do with differential privacy?

Mapped onto differential privacy, the weight distribution split is the moral equivalent of the probability split between two possible “real answers.”

The 1 coin toss is the moral equivalent of being able to ask 1 question of the data set.

With a sample size of 1 question, the probability split between two possible, adjacent “real answers” would need to be at least 95/5 before the result of that 1 question was statistically significant.

That in turn means that at 95/5, the presence or absence of a single individual’s record in a data set won’t have a statistically significant impact on the noisy answer given out through differential privacy.

(Still 95% certainty doesn’t sound very good.)

Postscript Obviously, we don’t want to be a situation where asking just 1 question of a data set brings it to the brink of violating the privacy guarantee. However, thinking in terms of 1 question is helpful way to figure out the “total” amount of privacy risk the system can tolerate. And since the whole point of differential privacy is that it offers a quantitative way to track privacy risk, we can take that “total” amount and divide it by the number of questions we want to be able to dole out per data set and arrive at a per-question risk threshold.

Really? 50/50 privacy guarantee is truly impossible?

Monday, May 24th, 2010

At the end of my last post, we came to the rather sad conclusion that as far as differential privacy is concerned, it is not possible to offer a 50/50, “you might as well not be in the data set” privacy guarantee because, well, the Laplace distribution curves used to apply identity-obfuscating noise in differential privacy are too…curvy.

No matter how much noise you add, answers you get out of differential privacy will always imply that one number is more likely to be the “real answer” than another. (Which as we know from our “nosy-neighbor-worst-case-scenario,” can translate into revealing the presence of an individual in a data set: The very thing differential privacy is supposed to protect against.)

Still, “50/50 is impossible” is predicated on the nature of the Laplace curves. What would happen if we got rid of them? Are there any viable alternatives?

Apparently, no. 50/50 truly is impossible.

There are a few ways to understand why and how.

The first is a mental sleight of hand. A 50/50 guarantee is impossible because that would mean that the presence of an individual’s data literally has ZERO impact on the answers given out by PINQ, which would effectively cancel out differential privacy’s ability to provide more or less accurate answers.

Back to our worst-case scenario, in a 50/50 world, a PINQ answer of 3.7 would not only equally imply that the real answer was 0 as that it was 1, it would also equally imply that the real answer was 8, as that it was 18K or 18MM. Differential privacy answers would effectively be completely meaningless.

Graphically speaking, to get 50/50, the currently pointy noise distribution curves would have to be perfectly horizontal, stretching out to infinity in both directions on the number line.

What about a bounded flat curve?

(If pressed, this is probably the way most people would understand what is meant when someone says an answer has a noise level or margin of error of +/-50.)

Well, if you were to apply noise with a rectangular curve, in our worst-case scenario, with +/-50 noise, there would be a 1 in 100 chance that you get an answer that definitively tells you the real answer.

If the real answer is 0, with a rectangular noise level +/- 50 would yield answers from -50 to +50.

If the real answer is 1, a rectangular noise level +/-50 would yield answers from -49 to +51.

If you get a PINQ answer of 37, you’re set. It’s equally likely that the answer is 0 as that the answer is 1. 50/50 achieved.

If you get a PINQ answer of 51, well you’ll know for sure that the real answer is 1, not 0. And there’s a 1 in a 100 chance that you’ll get an answer of 51.

Meaning there’s a 1% chance that in the worst-case scenario you’ll get 100% “smoking gun” confirmation of that someone is definitely present in a data set.

As it turns out, rectangular curves are a lot dumber than those pointy Laplace things because they don’t have asymptotes to plant a nagging seed of doubt. In PINQ, all noise distribution curves have an asymptote of zero (as in zero likelihood of being chosen as a noisy answer).

In plain English, that means that every number on the real number line has a chance (no matter how tiny) of being chosen as a noisy answer, no matter what the “real answer” is. In other words, there are no “smoking guns.”

So now we’re back to where we left off in our last post, trying to pick an arbitrary arbitrary probability split for our privacy guarantee.

Or maybe not. Could statistical significance come and save the day?

Could we quantify our privacy guarantee by saying that the presence or absence of a single record will not affect the answers we give out to a statistically significant degree?

Update: PINQ Demo Revisited

Tuesday, May 4th, 2010

Here’s Take Two on our PINQ “Differential Privacy In Action” Demo.

Along with a general paring down of the visual interface, we’ve refined how you interact with the application as well as tried to visualize how PINQ is applying noise to each answer.

  • The demo app is no longer modal. Meaning, you don’t have to click a button to switch between zooming in and out of the map, panning around the map and drawing boxes to define query areas. All of this functionality is accessible from the keyboard.
  • You no longer draw boxes to define query areas. Instead, clicking “Ask a Question” plops a box on the map that you can move and resize with the mouse.
  • Additionally, the corresponding PINQ answers update in real-time as you move and resize the query boxes.
  • New thumbnail graphics next to each answer reflect how PINQ generates noisy answers and provide a more immediate sense of the “scale of noise” being applied. (A more detailed explanation of these pointy curves is forthcoming.)

The demo has proven enormously helpful as an aid in explaining our work and our goals. We continue to improve it every time we make use of it, so stay tuned for more to come!

Live Demo:


In the mix — open data issues, bad econ stats, Facebook gaydar, and fraud detection in data

Friday, April 30th, 2010

1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:

  • It’s important that data not only be open but be available in real time.  In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure.  This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
  • Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps.  It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons.  Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend.  This is a really fascinating issue, and I’m curious to see where else this gets picked up.

2) Existing economic statistics are riddled with problems.  I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.

3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves.  Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?

4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot.  There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten.  We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things.  There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names?  Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?

Can differential privacy be as good as tossing a coin?

Tuesday, April 20th, 2010

At the end of my last post, I had reasoned my way to understanding how differential privacy is capable of doing a really good job of erasing almost all traces of an individual in a dataset, no matter how much “external information” you are armed with and no matter how pointed your questions are.

Now, I’m going to attempt to explain why we can’t quite clear the final hurdle to truly and completely eradicate an individual’s presence from a dataset.

  • If coins are actually weighted such that one side is just ever-so-slightly heavier than the other side.
  • And such a coin is spun by a platonically balanced machine.
  • And the coin falls with the head’s side facing up.
  • And I only get one “spin” to decide which side is heavier.
  • Probabilistically, (by an extremely slim margin) I’m better off claiming that the tail’s side is heavier.

Translate this slightly weighted coin toss example into the world of differential privacy and PINQ and we have an explanation for why complete non-discernibility is also non-possible.

I have a question. I know ahead of time that the only two valid answers are 0 and 1. PINQ gives me 1.7.

Probabilistically, I’m better off betting that 1 is the real answer.

In fact, PINQ doesn’t even have to give me an answer so close to the real answer. Even if I were to ask my question with a lot of noise, if PINQ says -10,000,000,374, then probabilistically, I’m still better off claiming that 0 is the real answer. (I’d be a gigantic fool for thinking I’ve actually gotten any real information out of PINQ to help me make my bet. But lacking any other additional information, I’d be an even gigantic-er fool to bet in the other direction, even if only by a virtually non-existent slim margin.)

The only answer that would give me absolutely zero “new information” about the “real answer” is 0.5 (where the two distribution curves for 0 and 1 intersect). An answer of 0.5 makes no implications about whether 0 or 1 is the “real answer.” Both are equally likely. 50/50 odds.

But most of the time…and I really mean most of the time, PINQ is going to give me an answer that implies either 0 or 1, no matter how much noise I add.

Does this matter? you ask.

It’s easy to argue that if PINQ gives out answers that imply the “real answer” over “the only other possible answer” by a margin of, say, 0.000001%, who could possibly accuse us of false advertising if we claimed to guarantee total non-discernibility of individual records?

(As it turns out, coin tosses aren’t really a 50/50 proposition. they’re actually more of 51/49 proposition. So perhaps the way you would answer the “Does it matter?” question depends on whether you’d be the kind of person to take “The Strategy of Coin Flipping” seriously.)

Nevertheless, a real problem arises when you try to actually draw a definitive line in the sand about when it’s no longer okay for us to claim total non-discernibility in our privacy guarantee.

If 50/50 odds are the ideal when it comes to true and complete non-discernibility, then is 49/51 still okay? 45/55? What about 33/66? That seems like too much. 33/66 means that if the only two possible answers are 0 and 1, PINQ is going to be twice as likely to give me an answer that implies 1 than as to give me answer that implies 0.

Yet still I wonder, does this really count as discernment?

Technically speaking, sure.

But what if discernment in the real world can really only happen over time with multiple tries?

If I ask a question and I get 4 as an answer. Rationally, I can know that a “real answer” of 1 is twice as likely to yield a PINQ answer of 4 as a “real answer” of 0. But I’m not sure if viewed through the lens of human psychology, that makes a whole lot of sense.

After all, there are those psychology studies that show that people need to see 3 options before they feel comfortable making a decision. Maybe it takes “best out of 3” for people to ever feel like they can “discern” any kind of pattern. (I know I’ve read this in multiple places, but Google is failing me right now.)

Here’s psychologist Dan Gilbert on how we evaluate numbers (including odds and value) based on context and repeated past experience.

These two threads on the difference between the probability of a coin landing heads n-times versus the probability of the next coin landing heads after it has already landed n-times further illustrates how context and experience cloud our judgement around probabilities.

If my instincts are correct, what does all this mean for our poor, beleaguered privacy guarantee?

Completely not there versus almost not there.

Wednesday, April 14th, 2010

Picture taken by Stephan Delange

In my last post where I tried to quantify the concept of “discernibility” I left off at the point where I said I was going to try out my “50/50” definition on the PINQ implementation of differential privacy.

It turned out to be a rather painful process. Both because I can be rather literal-minded in an unhelpful way at times and because it is plain hard to figure this stuff out.

To backtrack a bit, let’s first make some rather obvious statements to get a running start in preparation for wading through some truly non-obvious ones.

Crossing the discernibility line.

In the extreme case, we know that if there was no privacy protection whatsoever and the datatrust just gave out straight answers, then we would definitely cross the “discernibility line” and violate our privacy guarantee. So if we go back to my pirate friend again and ask, “How many people with skeletons in their closet wear an eye-patch and live in my building?” If you (my rather distinctive eye-patch wearing neighbor) exist in the data set, the answer will be 1. If you are not in the data set, the answer will be 0.

With no privacy protection, the presence or absence of your record in the data set makes a huge difference to the answers I get and are therefore extremely discernible.

Thankfully, PINQ doesn’t give straight answers. It adds “noise” to answers to obfuscate them.

Now when I ask, “How many people in this data set of people with skeletons in their closet wear an eye-patch and live in my building?” PINQ counts the number of people who meet these criteria and then decides to either “remove” some of those people or “add” some “fake” people to give me a “noisy” answer to my question.

How it chooses to do so is governed by a distribution curve developed and named for the French marquis Pierre-Simon La Place. (I don’t know why it has to be this particular curve, but I am curious to learn why.)

You can see the curve illustrated below in two distinct postures that illustrate very little privacy protection and quite a lot of privacy protection, respectively.

  • The point of the curve is centered on the “real answer.”
  • The width of the curve shows the range of possible “noisy answers” PINQ will choose from.
  • The height of the curve shows the relative probability of one noisy answer being chosen over another noisy answer.

A quiet curve with few “fake” answers for PINQ to choose from:

A noisy curve with many “fake” answers for PINQ to choose from:

More noise equals less discernibility.

It’s easy to wave your hands around and see in your mind’s eye how if you randomly add and remove people from “real answers” to questions, as you turn up the amount of noise you’re adding, the presence or absence of a particular record becomes increasingly irrelevant and therefore increasingly indiscernible. This in turn means that it will also be increasingly difficult to confidently isolate and identify a particular individual in the data set precisely because you can’t really ever get a “straight” answer out of PINQ that is accurate down to the individual.

With differential privacy, I can’t ever know that my eye-patch wearing neighbor has a skeleton in his closet. I can only conclude that he might or might not be in the dataset to varying degrees of certainty depending on how much noise is applied to the “real answer.”

Below, you can see how if you get a noisy answer of 2, it is about 7x more likely that the “real answer” is 1, than that the “real answer” is 0. A flatter, more noisy curve would yield a substantially smaller margin.

But wait a minute, we started out saying that our privacy guarantee, guarantees that individuals will be completely non-discernible. Is non-discernible the same thing as hardly discernible?

Clearly not.

Is complete indiscernibility even possible with differential privacy?

Apparently not…

On the question of “Discernibility”

Tuesday, April 13th, 2010

Where's Waldo?Where’s Waldo?

In my last post about PINQ and meaningful privacy guarantees, we defined “privacy guarantee” as a guarantee that the presence or absence of a single record will not be discernible.

Sounds reasonable enough, until you ask yourself, what exactly do we mean by “discernible”? And by “exactly”, I mean, “quantitatively” what do we mean by “discernible”? After all, differential privacy’s central value proposition is that it’s going to bring quantifiable, accountable math to bear on privacy, an area of policy that heretofore has been largely preoccupied with placing limitations on collecting and storing data or fine-print legalese and bald-faced marketing.

However, PINQ (a Microsoft Research implementation of differential privacy we’ve been working with) doesn’t have a built-in mathematical definition of “discernible” either. A human being (aka one of us) has to do that.

A human endeavors to come up with a machine definition of discernibility.

At our symposium last Fall, we talked about using a legal-ish framework for addressing this very issue of discernibility: Reasonable Suspicion, Probable Cause, Preponderence of Evidence, Clear and Convincing Evidence, Beyond a Reasonable Doubt.

Even if we decided to use such a framework, we would still need to figure out how these legal concepts translate into something quantifiable that PINQ can work with.

“Not Discernible” means seeing 50/50.

My initial reaction when I first starting thinking about this problem was that clearly, discernibility or lack thereof needed to revolve around some concept of 50/50, as in “odds of,” “chances are.”

Whatever answer you got out of PINQ, you should never get even a hint of an idea that any one number was more likely to be the real answer than the numbers to either of side of that number. (In other words, x and x+/-1 should be equally likely candidates for “real answerhood.”)

Testing discernibility with a “Worst-Case Scenario”

I ask a rather “pointed” question about my neighbor, one that essentially amounts to “Is so-and-so in this data set? Yes or no?” without actually naming names (or social security numbers, email addresses, cell phone numbers or any other unique identifiers). e.g. “How many people in this data set of ‘people with skeletons in their closet’ wear an eye-patch and live in my building?” Ideally, I should walk away with an answer that says,

“You know what, your guess is as good as mine, it is just as likely that the answer is 0, as it is that the answer is 1.”

In such a situation, I would be comfortable saying that I have received ZERO ADDITIONAL INFORMATION on the question of a certain eye-patched individual in my building and whether or not he has skeletons in his closets. I may as well have tossed a coin. My pirate neighbor is truly invisible in the dataset, if indeed he’s in there at all.

Armed with this idea, I set out to understand how this might be implemented with differential privacy...

Would PINQ solve the problems with the Census data?

Friday, February 5th, 2010

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Did the NYTimes Netflix Data Graphic Reveal the Netflix Preferences of Individual Users?

Tuesday, January 12th, 2010

Slate has an interesting slant on the New York Times graphic everyone’s been raving about — the most popular Netflix movies by zip code all over the country.  It really is great and fun to play with, but as Slate points out, some of the zip codes with rather anomalous lists may be pointing to individual users.  For example, 11317 has this top-ten list:

  1. Wall-E
  2. Indiana Jones and the Temple of Doom
  3. Oz: Season 3: Disc 1
  4. Watchmen
  5. The Midnight Meat Train
  6. Man, Woman, and the Wall
  7. Traffic
  8. Romancing the Stone
  9. Crocodile Dundee 2
  10. Godzilla’s Revenge

11317 is the zip code for LaGuardia Airport, which doesn’t have any residents.  That means this list may very well represent the Netflix renting habit of a small group or even a single subscriber who has his or her DVDs mailed there.

Slate finds some other zip codes that may represent a single subscriber, but doesn’t point out the privacy problem here, despite the fact that Netflix is already in hot water about its data releases.

We’ve said a lot about what “anonymization” means and what a privacy guarantee should include, so I won’t say more here.  Instead, I just want to point out that the Slate article helps illustrate the problem PINQ is trying to avoid.  As Tony points out in his post, PINQ won’t give you answers that would be changed by the presence of a single record.  Of course, because PINQ gives aggregate answers, you wouldn’t be asking questions phrased exactly as, “What are the top ten most popular Netflix movies for 11317?”  But if you tried to ask, “How many people in 11317 had viewed “The Midnight Meat Train?”, it would add sufficient noise that you would never know that the single person using LaGuardia airport as an address had viewed it.

Get Adobe Flash player