Would PINQ solve the problems with the Census data?

February 5th, 2010 by Grace Meng

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Is Public the new Private?

February 3rd, 2010 by Mimi Yin

Publicy (Publi[c] + [Priva]cy)
When the public, not the private, is the default.

In a world where so much more is out in the public, will people just stop worrying about privacy completely? Maybe in another five years, people simply won’t care if their names and addresses come up when someone searches for “people who have STDs” or “people who are 40 year-old virgins.”

40 Year-Old Virgin

For some of us, that’s hard to believe.

But I wonder if even for the people who are “most public” about their lives, the end of all privacy is equally scary. After all, aren’t Twitter, MySpace and Facebook simply opportunities for all of us to craft public personas we want others to see? Which implicitly includes controlling what people don’t see.

At the end of the day, the line between public and private has to do with control. Just because we’re now all sharing volumes more than we used to, doesn’t mean that we’re any more willing to share the skeletons in our closets.

Can we trust Census data?

February 3rd, 2010 by Grace Meng

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

January 29th, 2010 by Grace Meng

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

Yay, it’s Data Privacy Day!

January 28th, 2010 by Grace Meng

As sponsored by, among others, Google, Microsoft, Lexis-Nexis, and AT&T.

Lexis-Nexis, for those of you who are not lawyers and journalists, is an amazing tool for doing research on court decsions, regulations, statutes, and other legal matters.  It is also a great way to investigate people, comb through property records, and more!  In a way, though, the information it stores is pretty private, at least to the extent that it’s so expensive to access, it’s not available to the vast majority of people.  Which makes me wonder, how much is Lexis-Nexis worried that its product is becoming less valuable because more and more of their information is available elsewhere for free?

Which leads me to the crux of the problem.  Privacy, a word for which very few people can agree on a definition, is nevertheless a real issue these days.  But the reason it’s become such a pressing concern isn’t only because surveillance technology has gotten better or more pervasive.  It’s also because more information is available everywhere.  Re-identification from supposedly anonymized databases wouldn’t be so easy if other data sources, like DMV records, weren’t so readily available.  In addition, the Internet is teeming with information we want to provide ourselves, through Facebook, PatientsLikeMe, Mint.com, which we do not just because we’re exhibitionists, but because we get value from sharing that information and seeing what others have shared as well.

We want privacy.  We want information.  How are we going to reconcile these two very legitimate desires?  Will there be trade-offs?  Can we really have it all?

We’re definitely not in the camp of “We’ll never have privacy, let’s throw out the data!”, nor the camp of “Privacy’s gone anyway.”  So yes, we do think we can have a lot, if not “all.”  And to do that, we need to move beyond talking about privacy and information in the abstract.  We need to look at specific areas — like electronic health records, campaign finance, government transparency — and be concrete about what we lose and what we gain with every decision we make.

Data Privacy Day may be “an international celebration of the dignity of the individual expressed through personal information,” but let’s be honest.  Dealing with these questions will be interesting, but it isn’t going to be a party.

In the mix

January 26th, 2010 by Grace Meng

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

In the mix

January 19th, 2010 by Grace Meng

Unboxed — A Data Explosion Is Remaking Retailing (NYTimes)

Microsoft Cuts Bing IP Address Storage to 6 Months (CNET)

Starbucks Receipts Used for NYC Calorie Study (NYTimes)

Did the NYTimes Netflix Data Graphic Reveal the Netflix Preferences of Individual Users?

January 12th, 2010 by Grace Meng

Slate has an interesting slant on the New York Times graphic everyone’s been raving about — the most popular Netflix movies by zip code all over the country.  It really is great and fun to play with, but as Slate points out, some of the zip codes with rather anomalous lists may be pointing to individual users.  For example, 11317 has this top-ten list:

  1. Wall-E
  2. Indiana Jones and the Temple of Doom
  3. Oz: Season 3: Disc 1
  4. Watchmen
  5. The Midnight Meat Train
  6. Man, Woman, and the Wall
  7. Traffic
  8. Romancing the Stone
  9. Crocodile Dundee 2
  10. Godzilla’s Revenge

11317 is the zip code for LaGuardia Airport, which doesn’t have any residents.  That means this list may very well represent the Netflix renting habit of a small group or even a single subscriber who has his or her DVDs mailed there.

Slate finds some other zip codes that may represent a single subscriber, but doesn’t point out the privacy problem here, despite the fact that Netflix is already in hot water about its data releases.

We’ve said a lot about what “anonymization” means and what a privacy guarantee should include, so I won’t say more here.  Instead, I just want to point out that the Slate article helps illustrate the problem PINQ is trying to avoid.  As Tony points out in his post, PINQ won’t give you answers that would be changed by the presence of a single record.  Of course, because PINQ gives aggregate answers, you wouldn’t be asking questions phrased exactly as, “What are the top ten most popular Netflix movies for 11317?”  But if you tried to ask, “How many people in 11317 had viewed “The Midnight Meat Train?”, it would add sufficient noise that you would never know that the single person using LaGuardia airport as an address had viewed it.

Why do we need a datatrust? Part II

January 11th, 2010 by Grace Meng

In my first post on available public data sets, I described some of the limitations of Data.gov and the U.S. Census website.  There’s not as much as you’d like on Data.gov, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

PINQ Privacy Demo

January 7th, 2010 by Tony Gibbon

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).

Get Adobe Flash playerPlugin by wpburn.com wordpress themes