There’s been a lot going on at the Common Data Project, and it can be hard to keep track. Here’s a quick recap.
The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.
We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open. But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.
We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.
We’ve refined what that means, what the datatrust is and what the datatrust is not.
We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust. The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.
In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released. Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective. The datatrust, in other words, could make real-time data possible.
Furthermore, the data that is released can be accessed in flexible, creative ways. Right now, sensitive data is aggregated and released as statistics. A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.
We have a demo of PINQ
An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.
We’ve also developed an accompanying privacy risk calculator.
For CDP, improved privacy technology is only one part of the datatrust concept.
We’ve also been working on a number of organizational and policy issues:
A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)
Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.
We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission. We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.
Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?
Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.
Education: We regularly publish in-depth essays and news commentary on our blog: myplaceinthecrowd.org covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.
We have a lot to work on, but we’re excited to move forward!