Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau. When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population. As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.
When I read this, the first thought I had was, “Hallelujah!” Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.
People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians. Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.
But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.
To the Bureau’s credit, it does acknowledge where inaccuracies exist. But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity. Which gets to the heart of the problem:
Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.
The problems with current methods of data collection aren’t limited to the Census PUMS files either. The weaknesses outlined by this former employee could apply to so many organizations.
This is why we have to work on new ways to collect, analyze, and release sensitive data.