Posts Tagged ‘data’

In the mix

Tuesday, January 26th, 2010

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

Why do we need a datatrust? Part II

Monday, January 11th, 2010

In my first post on available public data sets, I described some of the limitations of Data.gov and the U.S. Census website.  There’s not as much as you’d like on Data.gov, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

Crowdsourcing data?

Thursday, September 10th, 2009

Sometimes, news just seems to coalesce around one topic.

A few weeks ago, the New York Times has a thoughtful piece on patients sharing their data online to push for more efficient research.  Dr. Amy Farber, after being diagnosed with a rare but deadly disease called LAM, founded the LAM Treatment Alliance and LAMsight, “a Web site that allows patients to report information about their health, then turns those reports into databases that can be mined for observations about the disease.”

In a completely different arena, we also had news that Google Maps is using GPS information from mobile phones to improve traffic data.  Google had used data from local highway authorities for traffic data on major highways, but now, GPS data from users of Google Maps with the My Location feature will provide data for local roads as well.

Pretty exciting stuff. Crowdsourcing isn’t new.  But thus far, it’s been used mostly for things that are subjective. Like Hot or Not.  Customer reviews.  It’s also been primarily voluntary. You choose to write a review and shared your data. Or if it’s involuntary, it’s not something that is accessible to the public (e.g. search results, credit card data, mortgage data, etc.).

What’s exciting now is that we’re starting to get into discussions about crowdsourcing for stuff like

  1. Medical research – where people are trying to extract objective conclusive results from data.
  2. Traffic data – where data is automatically collected (opt-in/opt-out, whatever) and made available to the public.

The two most common objections are around the supposed inaccuracy of self-reported data and the privacy risks of providing so much individualized information.

But as Ian Eslick, the MIT doctoral student developing LAMsight points out,

No one expects that observational research using online patient data will replace experimental controlled trials…“There’s an idea that data collected from a clinic is good and data collected from patients is bad,” he said. “Different data is effective at different purposes, and different data can lead to different kinds of error.”

And as the people behind Google Maps explain, they worked hard to increase accuracy by making participation as easy as possible.

The issue of privacy is a little trickier.  Google says you can opt-out of contributing your data easily, and Google promises that even those who contribute data can trust that their data will remain anonymous, “Even though the vehicle carrying a phone is anonymous, we don’t want anybody to be able to find out where that anonymous vehicle came from or where it went — so we find the start and end points of every trip and permanently delete that data so that even Google ceases to have access to it.”

There are certain to be some people who won’t feel comfortable with Google’s promises. Yet I doubt they will have much impact on Google’s ability to deliver this service. The bigger issue for me is  how privacy may be holding back smaller, less established players from developing potentially valuable services based on crowdsourced data collection?

In other words, is our currently ad-hoc and unsatisfactory approach to privacy inadvertently stifling competition by making it nearly impossible for startups to compete with the establishment wherever sensitive personal information is involved?

What data would you like to gain access to that might face similar privacy challenges?

In the mix

Wednesday, September 9th, 2009

OpenID Pilot Program to be Announced by U.S. Government (ReadWriteWeb)

Stimulus Funding Map is “Slick as Hell” (FlowingData)

Why Anonymized Data Isn’t (Slashdot)

Bringing the power of data to nonprofit organizations

Tuesday, August 4th, 2009

Over the last six months, I’ve had the privilege to interview a dozen people working with various nonprofit organizations, as well as a few agencies, about how they work with data.  They’ve candidly shared with me the data they collect (or try to collect) and the challenges they face in getting as much out of data as possible.  I’ve talked to people who work locally, nationally, and internationally; with people who do everything from workforce development to HIV/AIDS treatment in Africa.

Businesses have always known that data is valuable.  They’ve also had the money and resources to use the latest tools to collect and analyze data.  Walmart was at the forefront of using computers to track its inventory; Google and other internet companies are now at the forefront of using cookies to gather more than we ever believed was possible.

Nonprofits have been a little slower to recognize that data is not just for people who are trying to make a profit.  But as nonprofits compete for funding and donors seek more accountability for what nonprofits do with their money, it’s become almost trendy for nonprofits to try to think more like businesses about their data.  Whether they’re national advocacy organizations or more localized neighborhood groups providing basic services, nonprofits are starting to realize that data might be valuable for their missions, too.

In the course of interviewing these nonprofits, though, it’s become increasingly obvious to me that nonprofits might have a chance to one-up business in changing the way data gets collected, analyzed, and used.

The reason we at CDP are interested in learning more about the ways nonprofits use data now is because we think they could be major users and contributors to the “datatrust,” a safe and secure place to share, and not just hoard, sensitive information.

This would be probably come as a surprise to many of the people I talked to.  The few that were very proud of their data collection felt as proprietary towards their data as Google or Microsoft would. And the ones that aren’t so proud are struggling with yellowing paper files or inflexible Excel spreadsheets.  The thought of being at the forefront of anything would be mind-boggling.

But the one thing they all had in common was that they wanted more data.  Almost everyone could think of some data source that wasn’t available to them, whether from government agencies or administrative courts. In many cases, the reason for withholding that data was to protect the privacy of individuals in that data set.  Many of them could also think of things they wanted to count but were having trouble counting now, from the best ways to improve outcomes to better understanding their target populations.

We at CDP believe that the best way to get data is to give data (see our experimental online data collection forum!)  And many nonprofit organizations are in a great position to get data by giving data.

First, nonprofits have limited resources.  One organization, unless it’s incredibly wealthy, can only collect so much information.  A safe, secure place for allies to share information, i.e., crowd-source, could help nonprofits get answers to long-standing questions.

Is that immigration judge really denying 99% of all asylum cases before him?  How long is it taking the New York State Department of Labor to process wage claims?  Given that much of this information isn’t available anyway, any information would be better than no information.

And more information could give nonprofits increased leverage to demand more information from government agencies.  Some nonprofits already have strong networks of members or allies.  A better way to collect data is all they need to maximize resources they already have.

Second, nonprofits have fundamentally different goals than businesses.  Their mission, whether it’s to save the whales or to provide job training to former inmates, is about the public good.  Given that they are run with donations from the public, many nonprofits have taken this to heart and decided that they need to be more transparent.  Even though transparency often seems to be limited to disclosure of annual IRS filings, a datatrust could bring transparency to a new level.  A nonprofit could choose not only to declare their job training program a success in its annual report, it could also choose to disclose the data through the datatrust for others to analyze.  Transparency could push nonprofits to be better at what they do, which would benefit all of us.

Certainly, a CDP datatrust won’t solve all nonprofit data problems.  We’re not trying to get into the business of nonprofit data management.  But there are amazing opportunities to harness the power of online data collection to make the world a better place, and not just target advertising more accurately.

We’re still thinking it through.  We’re continuing our interviews and learning with each one more about the particular goals and challenges nonprofits face in using data.  And the more we learn, the more exciting it is to think about what could happen when the power of data is available for all of us, and not just major corporations.

In the mix

Wednesday, July 15th, 2009

Hacker Exposes Private Twitter Documents (NYT Bits Blog)

Code Red: How software companies could screw up Obama’s healthcare reform (Washington Monthly)

Collect Data About Yourself with Twitter (Flowing Data)

The Nike Experiment: How the Shoe Giant Unleashed the Power of Personal Metrics (Wired)

In the mix

Thursday, July 2nd, 2009

Got a Minute? Set Some Government Data Free with Transparency Corps (ReadWriteWeb)

Social Network Users Reportedly Concerned About Priacy, But Behavior Says Otherwise (ReadWriteWeb)

Bloomberg Releasing City Data Online in Hopes Developers Will Create New and Better Mobile Apps (NY Daily News)

Ad industry groups agree to privacy guidelines (CNET News)

In the mix

Wednesday, June 24th, 2009

Online participatory study of bipolar disorder.  (MoodChart)

The Day Facebook Changed Forever. (ReadWriteWeb)

Unhealthy Accounting of the Uninisured. (Wall Street Journal)

In the mix

Wednesday, May 27th, 2009

Data.gov: Unlocking the Federal Filing Cabinets. (NYT Bits)

On the Anonymity of Home/Work Location Pairs. (Schneier on Security)

Do People Care About Data Correlation?. (Kim Cameron’s Identity Blog)

In the mix

Wednesday, May 20th, 2009

Site Lets Writers Sell Digital Copies. (NY Times)

Linked Data is Blooming: Why You Should Care (ReadWriteWeb)

Mint Considers Selling Anonymized Data From Its Users (ReadWriteWeb)

The Growing Popularity of Popularity Lists (The Numbers Guy/Wall Street Journal)

Get Adobe Flash playerPlugin by wpburn.com wordpress themes