Archive for the ‘Building the Datatrust’ Category

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)

Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Whitepaper 2.0: A moral and practical argument for public access to private data.

Monday, April 4th, 2011

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

Google buys Metaweb: Can corporations acquire the halo effect of the underdog?

Monday, July 26th, 2010

Google recently bought Metaweb, a major semantic web company.  The value of Metaweb to Google is obvious — as ReadWriteWeb notes, “For the most part,…Google merely serves up links to Web pages; knowing more about what is behind those links could allow the search giant to provide better, more contextual results.” But what does the purchase mean for Metaweb?

Big companies buy small companies all the time.  Some entrepreneurs create their start-ups with that goal in mind — get something going and then make a killing when Google buys it.  But what do you think of a company when it seems to be doing something different and then is bought by Google?

Metaweb was never a nonprofit, but like Wikipedia, it has had a similar, community-driven vibe.  Freebase, its database of entities, is crowd-sourced, open, and free.  Google promises that Freebase will remain free, but will the community of people who contribute to Freebase feel the same contributing free labor to a mega-corporation?  Is there anything keeping Google from changing its mind in the future about keeping Freebase free?  How will the culture of Metaweb change as its technologies evolve within Google?

This isn’t to say that Metaweb’s goals have necessarily been compromised by its purchase by Google.   Many people feel like this is the best thing that could have happened to the semantic web.

(Though a few feel, “They didn’t make it big. In fact, this means they failed at their mission of better organizing the world’s information so that rich apps could be built around it. They never got to the APPS part. FAIL!”, and at least one person is concerned Google bought Freebase to kill it.)

But what did you think when Google bought the nonprofit Gapminder, of Hans Rosling’s famous TED talk?

Or when eBay bought a 25% stake in Craigslist?

Or outside the tech world, when Unilever bought Ben & Jerry’s?

Can a company or organization maintain any high-minded mission to be driven by principles other than profit when they’re bought by a major publicly held corporation?

This isn’t just an abstract question for us.  One of the biggest reasons why we chose to be a 501(c)(3) nonprofit organization is that we wanted to make sure no one running the Common Data Project would be tempted to sell its greatest asset, the data it hopes to bring together, for short-term profit.  As a nonprofit, CDP is not barred from making profits, but no profits can inure to the benefit of any private individual or shareholder.  Also as a nonprofit, should CDP dissolve, it cannot merely sell its assets to the highest bidder but must transfer them to another nonprofit organization with a similar mission.

We’re still working on understanding the legal distinctions between IRS-recognized tax-exempt organizations and for-profit businesses.  We were surprised when we first found out that Gapminder, a Swedish nonprofit, had been bought by Google.  Swedish nonprofit law may differ from U.S. nonprofit law.  But it appears Hans Rosling did not stand to make a dime.  Google only bought the software and the website, and whatever that was worth went to the organization itself.  So in a way, the experience of Gapminder supports the idea that being a nonprofit does make a difference in restricting the profit motives of individuals.  Alex Selkirk, as the founder and President of CDP, will never make a windfall through CDP.

The fact that CDP is not profit-driven, and will never be profit-driven makes a difference to us.  Does it make a difference to you?

The Datatrust Product: What it is. What it’s not.

Monday, June 21st, 2010

The datatrust has always been a big-tent project, but over the last few months, we’ve done a lot of paring down.  We’re getting closer to something that feels like a product and less like a vague hope for a better future!

The following is an attempt to describe the datatrust “technology product” by way of comparison with existing websites and services. The “Governance and Policies” aspect of the datatrust was covered in a separate post.

Sensitive information about us. They have it, we don’t.

Today, most of the sensitive data about us (e.g. medical records, personal finance data, online search history) is inaccessible to us and to those who represent the public: elected officials, government agencies, advocacy groups, researchers.

Our Mission: Democratizing Access to Sensitive Data

While a significant movement has grown up around opening up government data, there are few efforts to gain public access to sensitive “personal information” data, most of it held in the private sector.

Our goal for the datatrust is to create an open marketplace for information to democratize access to some of the most sensitive and valuable data there is, to help us answer difficult policy and societal questions. .

Which brings us to the question: What is a datatrust?

A datatrust will be an online service that allows organizations to make sensitive data available to the public and provides researchers, policymakers and application developers with a way to directly query that data.

We believe the datatrust is only possible with 1) technical innovations that will allow us to provide a quantifiable and enforceable privacy guarantee; and 2) governance and policy innovations that will inspire public confidence.

The datatrust will include a data catalog, a registry of queries and their privacy risks, and a collaboration network for both data donors and data users.

We realize that as a new breed of service, the datatrust is difficult to conceptualize. So, we thought it might be helpful to compare it to some existing websites and services.

A Data Catalog

Like, the datatrust will provide ways to browse and search a “catalog” of available data.

A Query-able Database of “Raw Data”

Unlike, datatrust data will be released in “raw” form, not in pre-digested aggregate reports.

Unlike, datatrust data will not be viewable or downloadable.

Instead, the datatrust will provide a way to directly query raw data.

An “Automated” Privacy Filter

Unlike most open government data releases, the datatrust will not rely on labor-intensive and subjective anonymization methods. Existing methods like scrubbing, swapping or synthesizing data limit the accuracy and usefulness of the data.

By contrast, the datatrust will makes use of new privacy technologies to provide a measurable and enforceable privacy guarantee that treats individual privacy as a value-able asset with a quantifiable limit on re-use.

As a result, the datatrust will keep track of the amount of privacy risk incurred by each query

Privacy protection will happen on-the-fly, thereby automating the “anonymization” aspect of releasing data.

An Open Collaboration Network

Because the datatrust will maintain an open history of all queries and data users, it will also become an important open registry of how data is being used and analyzed. This in turn can become the foundation for a community of data donors and data users, who will collaborate on collecting and analyzing data for research and data-driven software applications.

Like Amazon, the datatrust will do a better job of describing and browsing data sets as well as eliciting user feedback and data-mining actual usage (as opposed to self-reported usage) to help users find relevant data sets.

Like Wikipedia, the datatrust will depend on an invested and active community to curate and manage the data.

Unlike Wikipedia and Yelp (but like Facebook and LinkedIn), the datatrust will require its users to maintain real and active identities in order to build a quality rating system for evaluating data and data use, based on actual usage and individual reputations (as opposed to explicit user ratings).

Not A Generic Set of Tools for Working With Data

Unlike Swivel, the datatrust is not a generic tool set for working with and visualizing data.

Unlike Ning (a consumer platform for creating your own social network), the datatrust is not a consumer platform for creating your own data-sharing networks. It is also not a developer toolkit for building data-driven services.

Unlike Freebase, the datatrust is not Wikipedia for structured data.

Not A Data-Driven Service for Consumers

You should not expect to come to the datatrust to find out if people like you are also experiencing worse than average allergies this year.

Unlike Mint or Patients Like Me, the datatrust is not a personal data-sharing service focused on offering a consumer service (personal finance manager in the case of Mint) or sharing a specific kind of data (tracking chronic diseases in the case of Patiens Like Me)..

But application builders like as well as researchers may find the datatrust useful in allowing them to provide services and collect data in new ways from larger groups of people, due to the measurable privacy guarantee provided by the datatrust.

The datatrust is just about data.

The datatrust is a sensitive data release engine and we will build tools insofar as it helps our Data Donors get more data to Data Users. However, it stops short of directly serving consumers. We think that is better left to those with a passion for a specific cause and the domain expertise to serve their constituents well.

Governing the Datatrust: Answering the question, “Why should I trust you with my data?”

Thursday, June 3rd, 2010

Progress on defining the datatrust is accelerating–we can almost smell it!

For a refresher, the datatrust is an online service that will allow organizations to open sensitive data to the public and provide researchers, policymakers and application developers with a way to directly query the data, all without compromising individual privacy. Read more.

For the past two years, we’ve been working on figuring out exactly what the datatrust will be, not just in technical terms, but also in policy terms.

We’ve been thinking through what promises the datatrust will make, how those promises will be enforced, and how best we can build a datatrust that is governed, not by the whim of a dictator, but by a healthy synergy between the user community, the staff, and the board.

The policies we’re writing and the infrastructure we’re building are still a work in progress.  But for an overview of the decisions we’ve made and outstanding issues, take a look at “Datatrust Governance and Policies: Questions, Concerns, and Bright Ideas”.

Here’s a short summary of our overall strategy.

  1. Make a clear and enforceable promise around privacy.
  2. Keep the datatrust simple. We will never be all things to all people. The functions it does have will be small enough to be managed and monitored easily by a small staff, the user community, and the board.
  3. Have many decision-makers. It’s more important that we do the right thing than that we do them quickly. We will create a system of checks and balances, in which authority to maintain and monitor the datatrust will be entrusted to several, separate parties, including the staff, the user community, and the board.
  4. Monitor, report and review, regularly. We will regularly review what we’re monitoring and how we’re doing it. Release results to the public.
  5. Provide an escape valve. Develop explicit, enforceable policies on what the datatrust can and can’t do with the data. Prepare a “living will” to safely dispose of the data if the organization can no longer meet its obligations to its user community and the general public.

We definitely have a lot of work to do, but it’s exciting to be narrowing down the issues.  We’d love to hear what you think!

P.S. You can read more about the technical progress we’re making on the datatrust by visiting our Projects page.

Ten Things We Learned About Communities

Tuesday, June 1st, 2010

After 8 posts and several thousand words on how communities encourage participation, define membership, sustain networks, and govern themselves, what have we learned?

Dimitri Damasceno Creative Commons Attribution ShareAlike 2.0 (Generic)

We started this study because the datatrust we are working to build will depend on an invested and active community.  We want data donors, data borrowers, and data curators to interact as members of a community that are empowered to manage data, monitor the community, and hold the datatrust accountable to its mission.

So here are the findings we think are most relevant to the datatrust:

What motivates high-quality participation?

1. People are motivated to participate by rewards, but also by a desire to enhance their reputations.

Do communities need to have a mission?

2.  A shared ethos, culture, or mission are important if you want members of the community to be invested in the community and its survival as an institution.

3.  A shared ethos, culture, or mission also make it harder to have a very large and diverse community of people with different tastes and goals.

Should we require real identities?

4. People care more about their reputations when their real identities are on the line.

Can a community get too big?

5. If a large social network is to maintain a sense of small-scale community, it needs to reinforce a feeling of smaller communities within the social network.

Does diversity matter, in what way and why?

6. Diversity isn’t necessary for a successful community, but it’s important if the community’s goals require participation from a broad and diverse range of people.

Should you have to “pay to play”?

7.  We have always anticipated instituting a clear quid pro quo in the datatrust community – if you donate data, you get access to data.  Although we value the clarity of that exchange, will it limit our ability to grow?

Do more privacy controls=more control over privacy?

8.  People need to understand intuitively where information is going and to whom for privacy controls to be meaningful.

Is self-governance worth it?

9.  Decentralization of power and transparency can go a long way in helping an organization build trust.

10.  But you will have to put up with people who argue about what color to paint the bike shed.

Get Adobe Flash player