Posts Tagged ‘open government’

Common Data Project looking for a partner organization to open up access to sensitive data.

Wednesday, June 30th, 2010

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

In the mix — open data issues, bad econ stats, Facebook gaydar, and fraud detection in data

Friday, April 30th, 2010

1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:

  • It’s important that data not only be open but be available in real time.  In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure.  This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
  • Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps.  It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons.  Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend.  This is a really fascinating issue, and I’m curious to see where else this gets picked up.

2) Existing economic statistics are riddled with problems.  I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.

3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves.  Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?

4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot.  There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten.  We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things.  There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names?  Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

In the mix

Tuesday, January 26th, 2010

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

Why do we need a datatrust? Part II

Monday, January 11th, 2010

In my first post on available public data sets, I described some of the limitations of and the U.S. Census website.  There’s not as much as you’d like on, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

Why do we need a datatrust? Isn’t there so much data out there already?

Tuesday, January 5th, 2010

In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online.  The Obama administration has announced a commitment to transparency with the Open Government Initiative, including, a central clearinghouse for raw data sets made available by federal agencies.  Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data.  There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and for local information on everything from crime reports to restaurant health code violations.

So why do we need a “datatrust”?

Because the data isn’t actually so accessible.

Don’t get me wrong, there is certainly more data available than there ever has been before.  But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.

Recently, I decided to do a survey of U.S. federal agency websites and the data they provide.  As an ordinary, interested citizen with reasonable research skills, this is what I found:

  • Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
  • Largely available only as aggregates and statistics, which may or may not answer the questions we have.
  • When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
  • No easy query interface for non-researchers.

So let’s take a look at some specific sites.

1. Well-intentioned but incomplete. is supposed to be centralized place for “raw,” downloadable federal government data.  But there are an uneven number of datasets, as well as uneven participation among agencies.  Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA).  This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else. does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security).  But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS)., an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.

Most of all, there is no easy way to query these datasets.  They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.

2.  U.S. Census: A LOT of Data, but Completely Disorganized

Let’s start with the home page, which looks like this.  A lot of words, and not much guidance to what means what.

Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood.  I might decide to go to “People & Households,” which takes me here:

I’ll try “Data by Subject,” which takes me here:

It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood.  I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out.  “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.

If I click on Access Data, it gives me these choices:

And if I click on American FactFinder, I end up here:

Okay, I don’t really know what any of this means.  Thematic maps, reference maps, custom table???

But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column.  If I’d started there, I would have found this:

I can see there’s a little window at the top where I can get a Fact Sheet for my community.  Hmm, that seems easy!  Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:

Ack, where am I?  Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw

The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site.  You can lose your way really quickly and find yourself unable to even retrace your steps.  The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder.  Why?

There was a part of me that hoped I was just stupid because I was inexperienced.  But my friends who use Census data regularly for work tell me they also have trouble finding what they need.  I’m sure there are reasons why you can’t just query all the data, but what are they?  And how should we deal with them?  Should we just put up with them or try to find a solution to make data more available?

In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.

Get Adobe Flash player