Posts Tagged ‘data’

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)


Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

In the mix

Tuesday, January 26th, 2010

Is a nonprofit structure better than a for-profit one for preserving mission? Or vice versa? (SocialEdge)

How did the Department of Interior determine the “high value” of national volunteer opportunities, recreation opportunities, wildland fires and acres burned, and herd data on wild horses and burros? (Sunlight Foundation Reporting)

Again, how did government agencies determine which data sets were “high value”? (CDT)

It’s so much harder to count flu cases than you would think (WSJ The Numbers Guy)

And apparently, more fun to count things in your life than you would think (via FlowingData)

Why do we need a datatrust? Part II

Monday, January 11th, 2010

In my first post on available public data sets, I described some of the limitations of Data.gov and the U.S. Census website.  There’s not as much as you’d like on Data.gov, and the Census site is shockingly tiresome to navigate.

Other government agencies, though, do things a little differently, albeit with varying degrees of success.

3.  The Internal Revenue Service: They take so much, yet give so little.

The IRS website, compared to the Census site, is very well organized and easy to follow.  Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole.  A small link for “Tax Stats” on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury’s Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files.  Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required.   The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.

The IRS obviously has access to a wealth of information, and it’s published some interesting numbers.  One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.

As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000.  (Click on the image for a larger version.)  Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000.  The average tax rate has gone from 26.4% to 22.3%.  All very interesting, useful data from which one can draw a range of conclusions or start new research.

But there’s a lot we don’t know.

  • How has data changed from 2000 to now?
  • How might the returns correlate with specific changes in legislation?
  • How do the trends in the top 400 returns compare to the bottom 400?

Not to mention, any other questions we might have of underlying microdata.  The SOI program is clearly doing a great deal of work calculating and packaging data to be “anonymous” for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old.  Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society — why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters.  Yet we, the public. don’t have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.

4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format

The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we’re all struggling to understand as Congress proposes healthcare reform.  The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers.  The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.

But that doesn’t mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?”  “How much is reasonable to pay for XYZ procedure in Minneapolis?”  I assume MEPSnet is useful for researchers who are skilled at working with data, but it’s not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.

MEPS also includes data that isn’t publicly released for reasons of confidentiality.  To access that data, you must be a qualified researcher and travel to a data center.

5.  EPA: Great Tools for Personalized Queries if You Don’t Need Personal Information

The EPA’s site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like.  It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:

There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I’ve seen where data doesn’t only flow in one direction.

But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA’s microdata is not “personal.”  They’re measures of things like air quality and locations of regulated facilities.  They don’t have to worry about revealing personal tax information or personal medical expenditure information.  We’d love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.

Our dreams for data

So what would we love to see?

  • More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
  • More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
  • More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
  • More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
  • More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.

As I’ve stated previously, I don’t mean to poo-poo the data that these agencies and others have made available.  It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it “private” for public consumption), which is why it’s such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.

We may not be able to make all our data dreams come true immediately, but we definitely don’t want to let up on the push for better data.

Crowdsourcing data?

Thursday, September 10th, 2009

Sometimes, news just seems to coalesce around one topic.

A few weeks ago, the New York Times has a thoughtful piece on patients sharing their data online to push for more efficient research.  Dr. Amy Farber, after being diagnosed with a rare but deadly disease called LAM, founded the LAM Treatment Alliance and LAMsight, “a Web site that allows patients to report information about their health, then turns those reports into databases that can be mined for observations about the disease.”

In a completely different arena, we also had news that Google Maps is using GPS information from mobile phones to improve traffic data.  Google had used data from local highway authorities for traffic data on major highways, but now, GPS data from users of Google Maps with the My Location feature will provide data for local roads as well.

Pretty exciting stuff. Crowdsourcing isn’t new.  But thus far, it’s been used mostly for things that are subjective. Like Hot or Not.  Customer reviews.  It’s also been primarily voluntary. You choose to write a review and shared your data. Or if it’s involuntary, it’s not something that is accessible to the public (e.g. search results, credit card data, mortgage data, etc.).

What’s exciting now is that we’re starting to get into discussions about crowdsourcing for stuff like

  1. Medical research – where people are trying to extract objective conclusive results from data.
  2. Traffic data – where data is automatically collected (opt-in/opt-out, whatever) and made available to the public.

The two most common objections are around the supposed inaccuracy of self-reported data and the privacy risks of providing so much individualized information.

But as Ian Eslick, the MIT doctoral student developing LAMsight points out,

No one expects that observational research using online patient data will replace experimental controlled trials…“There’s an idea that data collected from a clinic is good and data collected from patients is bad,” he said. “Different data is effective at different purposes, and different data can lead to different kinds of error.”

And as the people behind Google Maps explain, they worked hard to increase accuracy by making participation as easy as possible.

The issue of privacy is a little trickier.  Google says you can opt-out of contributing your data easily, and Google promises that even those who contribute data can trust that their data will remain anonymous, “Even though the vehicle carrying a phone is anonymous, we don’t want anybody to be able to find out where that anonymous vehicle came from or where it went — so we find the start and end points of every trip and permanently delete that data so that even Google ceases to have access to it.”

There are certain to be some people who won’t feel comfortable with Google’s promises. Yet I doubt they will have much impact on Google’s ability to deliver this service. The bigger issue for me is  how privacy may be holding back smaller, less established players from developing potentially valuable services based on crowdsourced data collection?

In other words, is our currently ad-hoc and unsatisfactory approach to privacy inadvertently stifling competition by making it nearly impossible for startups to compete with the establishment wherever sensitive personal information is involved?

What data would you like to gain access to that might face similar privacy challenges?

In the mix

Wednesday, September 9th, 2009

OpenID Pilot Program to be Announced by U.S. Government (ReadWriteWeb)

Stimulus Funding Map is “Slick as Hell” (FlowingData)

Why Anonymized Data Isn’t (Slashdot)

Bringing the power of data to nonprofit organizations

Tuesday, August 4th, 2009

Over the last six months, I’ve had the privilege to interview a dozen people working with various nonprofit organizations, as well as a few agencies, about how they work with data.  They’ve candidly shared with me the data they collect (or try to collect) and the challenges they face in getting as much out of data as possible.  I’ve talked to people who work locally, nationally, and internationally; with people who do everything from workforce development to HIV/AIDS treatment in Africa.

Businesses have always known that data is valuable.  They’ve also had the money and resources to use the latest tools to collect and analyze data.  Walmart was at the forefront of using computers to track its inventory; Google and other internet companies are now at the forefront of using cookies to gather more than we ever believed was possible.

Nonprofits have been a little slower to recognize that data is not just for people who are trying to make a profit.  But as nonprofits compete for funding and donors seek more accountability for what nonprofits do with their money, it’s become almost trendy for nonprofits to try to think more like businesses about their data.  Whether they’re national advocacy organizations or more localized neighborhood groups providing basic services, nonprofits are starting to realize that data might be valuable for their missions, too.

In the course of interviewing these nonprofits, though, it’s become increasingly obvious to me that nonprofits might have a chance to one-up business in changing the way data gets collected, analyzed, and used.

The reason we at CDP are interested in learning more about the ways nonprofits use data now is because we think they could be major users and contributors to the “datatrust,” a safe and secure place to share, and not just hoard, sensitive information.

This would be probably come as a surprise to many of the people I talked to.  The few that were very proud of their data collection felt as proprietary towards their data as Google or Microsoft would. And the ones that aren’t so proud are struggling with yellowing paper files or inflexible Excel spreadsheets.  The thought of being at the forefront of anything would be mind-boggling.

But the one thing they all had in common was that they wanted more data.  Almost everyone could think of some data source that wasn’t available to them, whether from government agencies or administrative courts. In many cases, the reason for withholding that data was to protect the privacy of individuals in that data set.  Many of them could also think of things they wanted to count but were having trouble counting now, from the best ways to improve outcomes to better understanding their target populations.

We at CDP believe that the best way to get data is to give data (see our experimental online data collection forum!)  And many nonprofit organizations are in a great position to get data by giving data.

First, nonprofits have limited resources.  One organization, unless it’s incredibly wealthy, can only collect so much information.  A safe, secure place for allies to share information, i.e., crowd-source, could help nonprofits get answers to long-standing questions.

Is that immigration judge really denying 99% of all asylum cases before him?  How long is it taking the New York State Department of Labor to process wage claims?  Given that much of this information isn’t available anyway, any information would be better than no information.

And more information could give nonprofits increased leverage to demand more information from government agencies.  Some nonprofits already have strong networks of members or allies.  A better way to collect data is all they need to maximize resources they already have.

Second, nonprofits have fundamentally different goals than businesses.  Their mission, whether it’s to save the whales or to provide job training to former inmates, is about the public good.  Given that they are run with donations from the public, many nonprofits have taken this to heart and decided that they need to be more transparent.  Even though transparency often seems to be limited to disclosure of annual IRS filings, a datatrust could bring transparency to a new level.  A nonprofit could choose not only to declare their job training program a success in its annual report, it could also choose to disclose the data through the datatrust for others to analyze.  Transparency could push nonprofits to be better at what they do, which would benefit all of us.

Certainly, a CDP datatrust won’t solve all nonprofit data problems.  We’re not trying to get into the business of nonprofit data management.  But there are amazing opportunities to harness the power of online data collection to make the world a better place, and not just target advertising more accurately.

We’re still thinking it through.  We’re continuing our interviews and learning with each one more about the particular goals and challenges nonprofits face in using data.  And the more we learn, the more exciting it is to think about what could happen when the power of data is available for all of us, and not just major corporations.

In the mix

Wednesday, July 15th, 2009

Hacker Exposes Private Twitter Documents (NYT Bits Blog)

Code Red: How software companies could screw up Obama’s healthcare reform (Washington Monthly)

Collect Data About Yourself with Twitter (Flowing Data)

The Nike Experiment: How the Shoe Giant Unleashed the Power of Personal Metrics (Wired)

In the mix

Thursday, July 2nd, 2009

Got a Minute? Set Some Government Data Free with Transparency Corps (ReadWriteWeb)

Social Network Users Reportedly Concerned About Priacy, But Behavior Says Otherwise (ReadWriteWeb)

Bloomberg Releasing City Data Online in Hopes Developers Will Create New and Better Mobile Apps (NY Daily News)

Ad industry groups agree to privacy guidelines (CNET News)

In the mix

Wednesday, June 24th, 2009

Online participatory study of bipolar disorder.  (MoodChart)

The Day Facebook Changed Forever. (ReadWriteWeb)

Unhealthy Accounting of the Uninisured. (Wall Street Journal)

In the mix

Wednesday, May 27th, 2009

Data.gov: Unlocking the Federal Filing Cabinets. (NYT Bits)

On the Anonymity of Home/Work Location Pairs. (Schneier on Security)

Do People Care About Data Correlation?. (Kim Cameron’s Identity Blog)


Get Adobe Flash player