Archive for the ‘Interesting Uses of Data’ Category

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

Friday, January 29th, 2010

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

In the mix

Tuesday, January 26th, 2010

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

In the mix

Tuesday, January 19th, 2010

Unboxed — A Data Explosion Is Remaking Retailing (NYTimes)

Microsoft Cuts Bing IP Address Storage to 6 Months (CNET)

Starbucks Receipts Used for NYC Calorie Study (NYTimes)

Why do we need a datatrust? Part II

Monday, January 11th, 2010

In my first post on available public data sets, I described some of the limitations of Data.gov and the U.S. Census website.  There’s not as much as you’d like on Data.gov, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

Why do we need a datatrust? Isn’t there so much data out there already?

Tuesday, January 5th, 2010

In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online.  The Obama administration has announced a commitment to transparency with the Open Government Initiative, including Data.gov, a central clearinghouse for raw data sets made available by federal agencies.  Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data.  There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and Everyblock.com for local information on everything from crime reports to restaurant health code violations.

So why do we need a “datatrust”?

Because the data isn’t actually so accessible.

Don’t get me wrong, there is certainly more data available than there ever has been before.  But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.

Recently, I decided to do a survey of U.S. federal agency websites and the data they provide.  As an ordinary, interested citizen with reasonable research skills, this is what I found:

  • Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
  • Largely available only as aggregates and statistics, which may or may not answer the questions we have.
  • When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
  • No easy query interface for non-researchers.

So let’s take a look at some specific sites.

1.  Data.gov: Well-intentioned but incomplete.

data.gov

Data.gov is supposed to be centralized place for “raw,” downloadable federal government data.  But there are an uneven number of datasets, as well as uneven participation among agencies.  Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA).  This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else.

Data.gov does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security).  But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS).  Fedstats.gov, an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.

Most of all, there is no easy way to query these datasets.  They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.

2.  U.S. Census: A LOT of Data, but Completely Disorganized

Let’s start with the home page, which looks like this.  A lot of words, and not much guidance to what means what.

Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood.  I might decide to go to “People & Households,” which takes me here:

I’ll try “Data by Subject,” which takes me here:

It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood.  I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out.  “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.

If I click on Access Data, it gives me these choices:

And if I click on American FactFinder, I end up here:

Okay, I don’t really know what any of this means.  Thematic maps, reference maps, custom table???

But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column.  If I’d started there, I would have found this:

I can see there’s a little window at the top where I can get a Fact Sheet for my community.  Hmm, that seems easy!  Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:

Ack, where am I?  Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw

The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site.  You can lose your way really quickly and find yourself unable to even retrace your steps.  The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder.  Why?

There was a part of me that hoped I was just stupid because I was inexperienced.  But my friends who use Census data regularly for work tell me they also have trouble finding what they need.  I’m sure there are reasons why you can’t just query all the data, but what are they?  And how should we deal with them?  Should we just put up with them or try to find a solution to make data more available?

In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.

What kind of institution do we want to be? Part II

Tuesday, December 15th, 2009

As described in the first post, banks and credit unions could be useful models for the datatrust because of their function of holding valuable assets for account holders.  Public libraries and museums are very different, but their function, of providing the public access to valuable social assets, is also relevant to the datatrust.

A. We want to be an online public library of useful, personal data, because no democracy can function properly without broad access to information.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Although public libraries now have the fuzzy-warm feeling status right up there with puppies and babies, the public library system was not established in the U.S. without controversy.  The only people who owned books were the rich, and many argued that the poor would not know how to take care of the books they borrowed.  The system was largely established through the efforts of Andrew Carnegie and others who believed in both public libraries and public schools, and that democracy could not function without public access to information.

Librarians are now champions for intellectual freedom.  As a profession, librarians have developed strong principles around the confidentiality of library users, and they were on the front lines in resisting the USA PATRIOT Act’s provisions around FBI access to library records. Although they are often underfunded and can seem out of date, the current recession has made obvious what has been going on for awhile, that people really do use the library. And when they do, they don’t abuse the privilege.  Many communities feel invested in their local branches, and the respect people have for libraries translates into a respect for their holdings.

We hope the establishment of our datatrust can follow a similar path.  Everyone may not agree now that this kind of access to information is necessary.  But we strongly believe that the status quo, where large corporations and government agencies have access but the public does not, stifles the free flow of information that really is crucial for a functioning democracy.  We hope that the datatrust can grow to engender the same kind of respect and to be a valuable member of many communities.

Of course, the information in books is qualitatively different from personal data about an individual.  If a book gets lost, it’s not as great a loss as if personal data gets misused.  Which leads us to the next point.

B. We want to make data available to the public because it is too valuable to be kept in a locked safe, the way museums make great art available.

(davinci)-mona-lisa

Museums are interesting institutions to us because they showcase extremely valuable pieces that would be safest from damage and theft if kept locked up in a vault, yet are put on public display because the value afforded to the public outweighs the risk of damage and theft.  Although they have a greater reputation for elitism than public libraries, museums also operate on the belief that certain assets, like great art or historical artifacts, should belong to society at large rather than to a private collector.  Thus, when a private collector does donate his or her collection to a museum, he or she gains the reputational benefit of having done something altruistic.  At the same time, access to the public comes with protective measures for security—guards, technology, velvet ropes, and more.

Personal data, to us at CDP, is also too valuable to keep locked up.  Arguably, personal data is currently kept by many private collectors or corporations.  They gain value from that data, but that value is not shared with the public.  Unlike art, which is usually made by an individual, personal data is collected from a large swaths of the general population, and yet we don’t have access to that data.  Like museums, we will want to think of security measures to minimize any risk, but we do acknowledge that there will be some risk, known and unknown, in our project.  But that risk is so much outweighed by the potential benefits to society, we think it’s a worthwhile experiment.

Museums also add value to their holdings by curating them.  That’s an important challenge for us, as information is only valuable when it’s organized.

In the mix

Wednesday, December 2nd, 2009

EFF Launches New “Terms of (Ab)use” Page (EFF)

Eight Million Reasons for New Surveillance Oversight (Slight Paranoia)

Everyman Offers New Directions in Online Maps (NYTimes)

Creative Commons-style licenses for personal information, Part III: What are the challenges?

Monday, November 30th, 2009

In the first two posts, I described how personal information licenses might work and why they might be useful in shifting the debate around how personal information is collected and used. Sharing information could be cool!  People could exercise choices!  Companies could be pressured to offer similar choices!

Unfortunately, it wouldn’t be that easy.  There are certainly obstacles and challenges to creating a system of personal information licenses for common use, which I describe below.

1) Personal information isn’t property—why do you want to propertize it?

The short answer is, we don’t. We’re well aware that there is a history of academic debate on this issue, pro and con around whether making personal information personal property would make it easier to protect individual privacy.  Although the issues are certainly interesting, we don’t want to step into that debate and we don’t think we have to for the licenses we’re imagining.

First, let’s examine how personal information is viewed today.  I can’t own a fact.  I can’t own the fact that I’m 32, but I can have copyright in an essay in which I state I am 32 and I can have copyright in a database that includes the fact I am 32 if I’m creative in building the structure of that database (in the U.S.).  We can understand the reasoning behind this.  We want to live in a world where facts are “free” to be used and reused without any need to pay a licensing fee.

But the simple declaration, “You can’t own a fact” doesn’t begin to describe the many ways in which people are collecting data, selling it, renting it, and otherwise making money off of it.  When a company sells a mailing list, it may not “own” the fact that I live at XYZ Avenue in Brooklyn, but it certainly is using it to its advantage.  Why, then, should the fact that I can’t own the fact of where I live keep me from sharing that data as I like and trying to control it in new ways?

The digital revolution is forcing us to think beyond property/not property.  Facts have become valuable even when they’re not technically “owned” by anyone.  I haven’t come up with some snappy new terms to use, but the issue should no longer be defined solely around “property/not property.”

BlueKai

Some new businesses seem to be working off this model. Blue Kai and KindClicks, while collecting personal information for market research, provide individuals with a way of stating their preferences and monetizing their data.  KindClicks, for example, allows individuals who contribute data to then donate the money they make off their data to the charity of their choice.  BlueKai collects data through cookies, but provides a link on their site by which users can see what information has been gathered about them. Those who want to opt out can.  Those that choose to participate in BlueKai’s registry can then choose to donate a portion of their “earnings” to charity.

We don’t actually want to model these companies’ ways of valuing data.  CDP’s mission isn’t to make sure that everyone gets a dollar here or a dollar there every time their data is accessed.  To us, the value of such information is immense to the public and yet not easily measurable in dollars.  But we do want to explore the idea that we could just take control of our data and obtain value from it, even if it’s the non-monetary, social value of providing something useful to the public.

2) How would these licenses be enforceable? What about existing terms of use on online forums, social networks?

This is a big question.  I’m not sure what kind of dataset could be licensable and the extent to which a license could cover facts within that dataset.  Could the license really encourage new forms of sharing if there was no way to prevent people from using individual facts within that dataset outside of the license terms?  How useful would such a license be?

Arguably, Creative Commons licenses are not easily enforced.  They certainly have an easier case for arguing that they are enforceable; there has been one case I know of where a court upheld the terms of the license.  But the vast majority of people using photographs, art, and other work outside the terms of the license do it without impunity.  Most CC license holders never find out their Flickr photo was used outside the license terms, and most wouldn’t have the resources to do anything about it even if they did find out.  Yet CC licenses have still managed to impact societal norms on intellectual property.

Personal information licenses may still have an effect, then, on societal norms about how information is collected and shared regardless of how much the licenses are litigated.  Even the process of litigation may help us as a society have a smarter conversation about current practices.

As to the objection that the licenses wouldn’t work in the face of existing terms of use for social networks and other sites — the fact that I might not be able to “license” my own information that I put myself on Facebook just underscores why creative, proactive, even aggressive strategies might be necessary.

3) Why would you encourage people to put their personal information out in public?  Isn’t it irresponsible to encourage people to provide information that could increase risk of identity theft and fraud?

I don’t want to dismiss this concern off-hand.  But as my father likes to say, everything in life has good and bad.  There are many things we do that are risky, and we try our best to minimize those risks, both as a society when we pass laws and as individuals when we take more particular, personal measures.  Driving is a very dangerous activity.  It is also a very valuable one.  Many governments have decided to legislatively require the wearing of seatbelts.  Many of us personally make the decision to practice other safe driving techniques that aren’t legally required.

We think it’s imperative that we, as a society, think hard and carefully about how to minimize the risks of personal information being used, collected, and exchanged.  Creative Commons-style licenses for personal information sharing may or may not be the best way to address today’s privacy problems.  I’m curious to hear if you think the risks outweigh the benefits and why.  But to shut down the idea solely because the risk exists — that is not going to help push the conversation forward.

CONCLUSION

Licenses for making personal information more widely available for research and public use—would they work?  Maybe, maybe not.  Worth exploring?  Most definitely.

We’d love to hear your questions and comments.

In the mix

Wednesday, November 25th, 2009

Firefox’s Plan to Kick the Login’s Butt (ReadWriteWeb)

The Cost of Getting Sick: GE (GE)

Happy Thanksgiving, everyone!

Get Adobe Flash playerPlugin by wpburn.com wordpress themes