Posts Tagged ‘census’

In the mix…nonprofit technology failures; not counting religion; medical privacy after death; and the business of open data

Friday, August 20th, 2010

1) Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support.  If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.

2) I was really surprised to find out the U.S. Census doesn’t ask about religion.  It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly?  The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?

3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea.  Is there a way that this information can be made available without revealing individual identity?  That’s the essential problem the datatrust is trying to solve.

4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction.  As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.

We at CDP have been talking about how the datatrust might change the business of data.  We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold.  Already, the business of buying and selling data has changed so much in the past 10 years.  Exciting years ahead.

Would PINQ solve the problems with the Census data?

Friday, February 5th, 2010

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

Why do we need a datatrust? Isn’t there so much data out there already?

Tuesday, January 5th, 2010

In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online.  The Obama administration has announced a commitment to transparency with the Open Government Initiative, including, a central clearinghouse for raw data sets made available by federal agencies.  Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data.  There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and for local information on everything from crime reports to restaurant health code violations.

So why do we need a “datatrust”?

Because the data isn’t actually so accessible.

Don’t get me wrong, there is certainly more data available than there ever has been before.  But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.

Recently, I decided to do a survey of U.S. federal agency websites and the data they provide.  As an ordinary, interested citizen with reasonable research skills, this is what I found:

  • Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
  • Largely available only as aggregates and statistics, which may or may not answer the questions we have.
  • When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
  • No easy query interface for non-researchers.

So let’s take a look at some specific sites.

1. Well-intentioned but incomplete. is supposed to be centralized place for “raw,” downloadable federal government data.  But there are an uneven number of datasets, as well as uneven participation among agencies.  Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA).  This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else. does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security).  But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS)., an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.

Most of all, there is no easy way to query these datasets.  They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.

2.  U.S. Census: A LOT of Data, but Completely Disorganized

Let’s start with the home page, which looks like this.  A lot of words, and not much guidance to what means what.

Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood.  I might decide to go to “People & Households,” which takes me here:

I’ll try “Data by Subject,” which takes me here:

It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood.  I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out.  “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.

If I click on Access Data, it gives me these choices:

And if I click on American FactFinder, I end up here:

Okay, I don’t really know what any of this means.  Thematic maps, reference maps, custom table???

But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column.  If I’d started there, I would have found this:

I can see there’s a little window at the top where I can get a Fact Sheet for my community.  Hmm, that seems easy!  Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:

Ack, where am I?  Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw

The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site.  You can lose your way really quickly and find yourself unable to even retrace your steps.  The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder.  Why?

There was a part of me that hoped I was just stupid because I was inexperienced.  But my friends who use Census data regularly for work tell me they also have trouble finding what they need.  I’m sure there are reasons why you can’t just query all the data, but what are they?  And how should we deal with them?  Should we just put up with them or try to find a solution to make data more available?

In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.

The politics of being counted

Monday, July 13th, 2009

The 2010 U.S. Census has been in the news a lot lately.

A national association of Latino clergy recently announced a campaign to persuade a million of its members to boycott the 2010 U.S. Census.  They hope their boycott puts pressure on the federal government to pass legalization legislation, but they also claim that they don’t want federal money allocated and used to harass illegal immigrants.

Republican Representative Michele Bachmann of Minnesota also announced she and her family will be boycotting the census.  In her case, she’s refusing to answer questions because she thinks it’s outrageous that the government wants to know how long it takes you to get to work, and that ACORN, along with many other organizations and businesses, is involved in helping to carry out the 2010 census.  Also, she’s angry that the census doesn’t ask you if you’re a U.S. citizen.  (Which isn’t quite right.  It does ask you if you’re a U.S. citizen; it doesn’t ask you if you have legal status or not.)

In contrast, one group got their wish to be counted.  The Census Bureau recently announced the 2010 U.S. Census will release data on same-sex marriage.  Data on same-sex marriages has been collected for a long time, but the last administration interpreted the Defense of Marriage Act as prohibiting the release of that data.  The initial plans for the 2010 census were to “edit” the responses to recategorize same-sex marriages as “unmarried partners.”  In 1990, the bureau simply changed the gender of one person.  So the new policy means responses will be accepted as they are.

Some people want to be counted.  Some people don’t.

I’m firmly in the “count me” camp.

As Republican Representatives Patrick McHenry, Lynne Westmoreland and John Mica pointed out to Rep. Bachmann, refusing to respond to the Census is “illogical, illegal, and not in the best interest of our country.” The League of United Latin American citizens, in contrast to the Latino clergy group, is participating in a coalition of media, community groups, labor unions, and churches to urge participation in the census.  Clearly, being on the same side of the political spectrum or sharing a specific policy agenda doesn’t mean you’ll agree about the census.

It’s not that numbers are apolitical.  The Census determines how apportion federal funds and representatives for the House.  The LA Times article cites a study that argues the illegal population in California led to California gaining 3 seats in the House of Representatives, while Indiana, Mississippi, and Michigan to lose seats.  It’s all about power, which means it’s all about politics.

But political debates shouldn’t be about whether or not to be counted.  Debates should be about whether certain proposals will do what they claim, or even about whether the numbers are accurate.  To refuse to be counted altogether, when the numbers will determine so much?  It’s like refusing to vote.

Get Adobe Flash player