In the mix…nonprofit technology failures; not counting religion; medical privacy after death; and the business of open data

Friday, August 20th, 2010

1) Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support.  If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.

2) I was really surprised to find out the U.S. Census doesn’t ask about religion.  It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly?  The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?

3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea.  Is there a way that this information can be made available without revealing individual identity?  That’s the essential problem the datatrust is trying to solve.

4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction.  As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.

We at CDP have been talking about how the datatrust might change the business of data.  We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold.  Already, the business of buying and selling data has changed so much in the past 10 years.  Exciting years ahead.

Common Data Project looking for a partner organization to open up access to sensitive data.

Wednesday, June 30th, 2010

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

In the mix…Linkedin v. Facebook, online identities, and diversity in online communities

Friday, May 14th, 2010

1) Is Linkedin better than Facebook with privacy? I’m not sure this is the right question to ask. I’m also not sure the measures Cline uses to evaluate “better privacy” get to the heart of the problem.  The existence of a privacy seal of approval, the level of detail in the privacy policy, the employment of certified privacy professionals … none of these factors address what users are struggling to understand, that is, what’s happening to their information.  73% of adult Facebook users think they only share content with friends, but only 42% have customized their privacy settings.

Ultimately, Linkedin and Facebook are apples to oranges.  As Cline points out himself, people on Linkedin are in a purely professional setting.  People who share information on Linkedin do so for a specific, limited purpose — to promote themselves professionally.  In contrast, people on Facebook have to navigate being friends with parents, kids, co-workers, college buddies, and acquaintances.  Every decision to share information is much more complicated — who will see it, what will they think, how will it reflect on the user?  Facebook’s constant changes to how user information makes these decisions even more complicated — who can keep track?

In this sense, Linkedin is definitely easier to use.  If privacy is about control, then Linkedin is definitely easier to control.  But does this mean something like Facebook, where people share in a more generally social context, will always be impossible to navigate?

2) Mark Zuckerberg thinks everyone should have a single identity (via Michael Zimmer).  Well, that would certainly be one way to deal with it.

3) But most people, even the “tell-all” generation, don’t really want to go there.

4) In a not unrelated vein, Sunlight Labs has a new app that allows you to link data on campaign donations to people who email you through Gmail.  At least with regards to government transparency, Sunlight Labs seems to agree with Mark Zuckerberg.  I think information about who I’ve donated money to should be public (go ahead, look me up), but it does unnerve me a little to think that I could email someone on Craigslist about renting an apartment and have this information just pop up.  I don’t know, does the fact that it unnerves me mean that it’s wrong?  Maybe not.

5) Finally, a last bit on the diversity of online communitiesit may be more necessary than I claimed, though with a slightly different slant on diversity.  A new study found that the healthiest communities are “diverse” in that new members are constantly being added.  Although they were looking at chat rooms, which to me seems like the loosest form of community, the finding makes a lot of sense to me.  A breast cancer survivors’ forum may not care whether they have a lot of men, but they do need to attract new participants to stay vibrant.

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

Friday, January 29th, 2010

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

In the mix

Tuesday, January 26th, 2010

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

Why do we need a datatrust? Part II

Monday, January 11th, 2010

In my first post on available public data sets, I described some of the limitations of and the U.S. Census website.  There’s not as much as you’d like on, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

Why do we need a datatrust? Isn’t there so much data out there already?

Tuesday, January 5th, 2010

In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online.  The Obama administration has announced a commitment to transparency with the Open Government Initiative, including, a central clearinghouse for raw data sets made available by federal agencies.  Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data.  There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and for local information on everything from crime reports to restaurant health code violations.

So why do we need a “datatrust”?

Because the data isn’t actually so accessible.

Don’t get me wrong, there is certainly more data available than there ever has been before.  But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.

Recently, I decided to do a survey of U.S. federal agency websites and the data they provide.  As an ordinary, interested citizen with reasonable research skills, this is what I found:

  • Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
  • Largely available only as aggregates and statistics, which may or may not answer the questions we have.
  • When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
  • No easy query interface for non-researchers.

So let’s take a look at some specific sites.

1. Well-intentioned but incomplete. is supposed to be centralized place for “raw,” downloadable federal government data.  But there are an uneven number of datasets, as well as uneven participation among agencies.  Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA).  This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else. does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security).  But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS)., an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.

Most of all, there is no easy way to query these datasets.  They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.

2.  U.S. Census: A LOT of Data, but Completely Disorganized

Let’s start with the home page, which looks like this.  A lot of words, and not much guidance to what means what.

Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood.  I might decide to go to “People & Households,” which takes me here:

I’ll try “Data by Subject,” which takes me here:

It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood.  I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out.  “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.

If I click on Access Data, it gives me these choices:

And if I click on American FactFinder, I end up here:

Okay, I don’t really know what any of this means.  Thematic maps, reference maps, custom table???

But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column.  If I’d started there, I would have found this:

I can see there’s a little window at the top where I can get a Fact Sheet for my community.  Hmm, that seems easy!  Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:

Ack, where am I?  Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw

The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site.  You can lose your way really quickly and find yourself unable to even retrace your steps.  The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder.  Why?

There was a part of me that hoped I was just stupid because I was inexperienced.  But my friends who use Census data regularly for work tell me they also have trouble finding what they need.  I’m sure there are reasons why you can’t just query all the data, but what are they?  And how should we deal with them?  Should we just put up with them or try to find a solution to make data more available?

In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.

In the mix

Wednesday, September 9th, 2009

OpenID Pilot Program to be Announced by U.S. Government (ReadWriteWeb)

Stimulus Funding Map is “Slick as Hell” (FlowingData)

Why Anonymized Data Isn’t (Slashdot)

In the mix

Thursday, July 2nd, 2009

Got a Minute? Set Some Government Data Free with Transparency Corps (ReadWriteWeb)

Social Network Users Reportedly Concerned About Priacy, But Behavior Says Otherwise (ReadWriteWeb)

Bloomberg Releasing City Data Online in Hopes Developers Will Create New and Better Mobile Apps (NY Daily News)

Ad industry groups agree to privacy guidelines (CNET News)

