In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online. The Obama administration has announced a commitment to transparency with the Open Government Initiative, including Data.gov, a central clearinghouse for raw data sets made available by federal agencies. Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data. There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and Everyblock.com for local information on everything from crime reports to restaurant health code violations.
So why do we need a “datatrust”?
Because the data isn’t actually so accessible.
Don’t get me wrong, there is certainly more data available than there ever has been before. But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.
Recently, I decided to do a survey of U.S. federal agency websites and the data they provide. As an ordinary, interested citizen with reasonable research skills, this is what I found:
- Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
- Largely available only as aggregates and statistics, which may or may not answer the questions we have.
- When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
- No easy query interface for non-researchers.
So let’s take a look at some specific sites.
1. Data.gov: Well-intentioned but incomplete.
Data.gov is supposed to be centralized place for “raw,” downloadable federal government data. But there are an uneven number of datasets, as well as uneven participation among agencies. Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA). This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else.
Data.gov does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security). But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS). Fedstats.gov, an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.
Most of all, there is no easy way to query these datasets. They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.
2. U.S. Census: A LOT of Data, but Completely Disorganized
Let’s start with the home page, which looks like this. A lot of words, and not much guidance to what means what.
Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood. I might decide to go to “People & Households,” which takes me here:
I’ll try “Data by Subject,” which takes me here:
It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood. I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out. “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.
If I click on Access Data, it gives me these choices:
And if I click on American FactFinder, I end up here:
Okay, I don’t really know what any of this means. Thematic maps, reference maps, custom table???
But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column. If I’d started there, I would have found this:
I can see there’s a little window at the top where I can get a Fact Sheet for my community. Hmm, that seems easy! Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:
Ack, where am I? Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw.
The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site. You can lose your way really quickly and find yourself unable to even retrace your steps. The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder. Why?
There was a part of me that hoped I was just stupid because I was inexperienced. But my friends who use Census data regularly for work tell me they also have trouble finding what they need. I’m sure there are reasons why you can’t just query all the data, but what are they? And how should we deal with them? Should we just put up with them or try to find a solution to make data more available?
In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.