Posts Tagged ‘Datatrust’

Whitepaper 2.0: A moral and practical argument for public access to private data.

Monday, April 4th, 2011

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying  privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

The meaning of membership

Thursday, April 1st, 2010

BSA Member Card, Focht, Flickr/Creative Commons License Attribution-Noncommercial-No Derivative Works

We’ve been talking about a “datatrust” for awhile now, why we think we need one, how we envision it as a long-lasting institution, what kind of technologies we might employ for it to provide measurable guarantees around privacy.

But we’re now starting to get down to the nitty-gritty.  How will it actually work?  What will it mean to an actual researcher, nonprofit organization, policy-maker?  To you?

First and foremost, we imagine the datatrust as a member-based data bank where organizations and individuals can safely contribute personal information to inform research and public policy.

The member-based part is key.  We plan to be both non-partisan and absolutely transparent.  We have no particular academic or policy ax to grind. Our only goal is to maximize the quantity, quality and diversity of sensitive data that is made available to the public.  To ensure that decisions aren’t made even with an unconscious bias, we plan to build a decentralized structure that relies on the participation and contribution of members to build and sustain the datatrust.

But the word membership can mean a lot of different things.  When my local public radio station exhorts me to be a member, membership doesn’t seem to come with something more than a tote bag.  In contrast, if you’re a Wikipedian, it means you’ve actually written or edited an entry, and the more you participate, the more access and privileges you get, including the right to vote for members of the Wikimedia Foundation Board.

So for the past couple of months, I’ve been looking at member-based communities.  Not all of them would call themselves member-based communities, but they all have in common a structure that requires participation from a large group of people.  Some are nonprofits, some are businesses running social networks; most are online, a few are not.  Over the next couple of posts, I’m going to summarize how these communities work, what motivates the members, how the communities monitor themselves, and how diverse they are, because all of these issues will inform the decisions we make in creating our datatrust.

Here are the ten communities included in this study:

MySpace is one of the world’s largest social networks with about 125 million users, though Facebook has in the last year surpassed MySpace with the number of users and pageviews both in the U.S. and the rest of the world.  The look and feel of MySpace is very different from Facebook, since MySpace users are allowed to customize their pages.  There’s also been a lot of press about the demographic differences between MySpace and Facebook, but those differences are probably disappearing as Facebook simply grows and grows.  MySpace remains more popular than Facebook as a site for bands and music.

Facebook is the world’s largest social network with about 400 million users.  Despite its popularity and recent news that it even surpassed Google in Internet traffic, it’s also been the center of controversy, particularly regarding user privacy and terms of use, with each major change made to the site.

Yelp is a social network-based user review site for local businesses in multiple cities in the U.S.  It’s growing much faster than older sites like Citysearch, and its spawned offline events where really avid reviewers meet and socialize.  It has also gotten controversy with accusations that it extorts businesses to take out ads in return for highlighting good reviews or pulling bad ones.  Although Yelp has denied these accusations, a class-action lawsuit was recently filed against Yelp.

Flickr is a popular social network-based photo-sharing site.  Unlike many photo-sharing sites like Kodak Gallery or Photobucket, Flickr has emphasized sharing photos with the general public and organization by crowdsourcing via tags. Although it does have some services for printing photos and mugs, its main service is photo-hosting and storage, particularly for bloggers and photographers.  In addition to hosting photos, Flickr also manages projects like “The Commons” with the Library of Congress and other institutions interested in putting their public domain photos in wider circulation.

Slashdot is a news aggregator for self-professed nerds with estimated traffic of 5.5 million users per month.  It shares news stories contributed by its users, who also comment on the stories and moderate the comments.  Useful contribution is rewarded with karma points, which increases the privileges each user gets.

Wikipedia is “the free encyclopedia anyone can edit,” run by the nonprofit Wikimedia Foundation.  The number of named accounts for writers and editors is at about 11 million; about 300,000 have edited Wikipedia more than ten times.  Despite early skepticism, Wikipedia has become one of the most trafficked sites online and has expanded into multiple countries around the world.  Wikipedia has clearly developed a community of avid and enthusiastic users who contribute without monetary compensation, but in its tenth year, it is evaluating the lack of diversity among Wikipedians (only 13% of contributors are women, for one) and what steps it should take to provide access to a free encyclopedia all over the world. Wikipedia has also instituted a number of changes over the years to deal with vandalism and inaccuracies.

Open Source Software – rather than look at one particular open source project, for this study, I focused on the book Producing Open Source Software by Karl Fogel, which describes how projects should work.  Obviously, actual projects will vary widely, but we decided this was an area worth looking at because the open source movement has spent years figuring out how to structure shared work.

The Sierra Club is one of the oldest grassroots environmental organizations in the U.S.  It has 1.3 million members, but because it is not a primarily online organization, it isn’t easy to evaluate the activities of its members online.  However, it recently created a series of social media sites for online networking among Sierra Club members and supporters and our report focuses primarily on this aspect of their member activities.

The Park Slope Food Coop is a local cooperative grocery store in Park Slope Brooklyn.  (DISCLAIMER: I’ve been a member since 2005, and my research on how it works is based on my experiences there.)  Unlike many coops, membership is predicated on work.  All of its approximately 150,000 members are required to work a two hour-45 minute shift every four weeks, which reduces labor costs and thus reduces prices.  Despite being a place many people love to hate, it continues to thrive and attract new members.

Habitat for Humanity International is a major nonprofit organization that seeks to eliminate poverty housing and homelessness by building decent housing around the world.  (DISCLAIMER: I volunteered for Habitat for Humanity in high school and college and participated in a fundraising bike trip in 1999.)  Like the Sierra Club, it is also an offline organization, but its website provided more detailed information on how its affiliates work and I drew on my personal experience in trying to understand how Habitat encourages and retains volunteers.

In the mix

Wednesday, March 31st, 2010

1) Exciting news!  A diverse coalition of left-leaning and right-leaning organizations, as well as a bunch of big corporations, has formed around the goal of revising the Electronic Communications Privacy Act.  This law, from 1986, clearly didn’t anticipate the world we live in now, the extent to which we use emails, the “expectation of privacy” we have in email, and the extent to which we store our data and our documents in the cloud.  This law will greatly impact our work at the Common Data Project, but even without a professional stake in this, I’d be pretty excited.  After all, we all (except my mom who doesn’t use computers) have a personal stake in this.

2) The full text of danah boyd’s talk at SXSW is available on her blog.  This is my favorite line:

For the parents and educators in the room… Many of you are struggling to help young people navigate this new world of privacy and publicity, but many of you are confused yourself. The worst thing you can do is start a sentence with “back in my day.” Back in your day doesn’t matter.

It’s an obvious but useful point for privacy and information issues in general.  The ECPA from back in the day of 1986 can’t deal with today.  It’s time to really think, which of our assumptions about privacy still hold true?

3) David Brooks’s column this week got me thinking.  If we agree with him, which I do, that a country’s success cannot be measured simply with things like GDP, what else should we measure and how? My friends who work in social sciences are initially skeptical when I talk about the data collection potential of something like the Common Data Project’s datatrust.  They’re distrustful of self-reported data, even as they acknowledge that their existing methodologies are imperfect.  But with things that are hard to measure, self-reporting is often the only way to go.  The datatrust, the Internet, and its measurable guarantees of privacy could dramatically change how self-reported data is collected, analyzed, and published.

4) Facebook data destroyed: Pete Warden, who had created a database from 210 million public Facebook profiles, was prepared to release the data to social scientists who were fascinated by the potential to research social connections, particularly as mashed up with census data on income, mobility and employment.  But then Facebook said he had violated its terms of use, and unable to defend a potential lawsuit, he destroyed the data.

Argh, isn’t there a better way?  The decision to make one’s profile public on profile may not equal a decision to consent to be in such a database, and that Warden’s planned “anonymization” was unlikely to be very robust, but this situation is a perfect example of why the Common Data Project was founded: to create a new norm, with strong privacy and sharing standards, that makes such data truly, safely available.

What kind of institution do we want to be? Part II

Tuesday, December 15th, 2009

As described in the first post, banks and credit unions could be useful models for the datatrust because of their function of holding valuable assets for account holders.  Public libraries and museums are very different, but their function, of providing the public access to valuable social assets, is also relevant to the datatrust.

A. We want to be an online public library of useful, personal data, because no democracy can function properly without broad access to information.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Although public libraries now have the fuzzy-warm feeling status right up there with puppies and babies, the public library system was not established in the U.S. without controversy.  The only people who owned books were the rich, and many argued that the poor would not know how to take care of the books they borrowed.  The system was largely established through the efforts of Andrew Carnegie and others who believed in both public libraries and public schools, and that democracy could not function without public access to information.

Librarians are now champions for intellectual freedom.  As a profession, librarians have developed strong principles around the confidentiality of library users, and they were on the front lines in resisting the USA PATRIOT Act’s provisions around FBI access to library records. Although they are often underfunded and can seem out of date, the current recession has made obvious what has been going on for awhile, that people really do use the library. And when they do, they don’t abuse the privilege.  Many communities feel invested in their local branches, and the respect people have for libraries translates into a respect for their holdings.

We hope the establishment of our datatrust can follow a similar path.  Everyone may not agree now that this kind of access to information is necessary.  But we strongly believe that the status quo, where large corporations and government agencies have access but the public does not, stifles the free flow of information that really is crucial for a functioning democracy.  We hope that the datatrust can grow to engender the same kind of respect and to be a valuable member of many communities.

Of course, the information in books is qualitatively different from personal data about an individual.  If a book gets lost, it’s not as great a loss as if personal data gets misused.  Which leads us to the next point.

B. We want to make data available to the public because it is too valuable to be kept in a locked safe, the way museums make great art available.


Museums are interesting institutions to us because they showcase extremely valuable pieces that would be safest from damage and theft if kept locked up in a vault, yet are put on public display because the value afforded to the public outweighs the risk of damage and theft.  Although they have a greater reputation for elitism than public libraries, museums also operate on the belief that certain assets, like great art or historical artifacts, should belong to society at large rather than to a private collector.  Thus, when a private collector does donate his or her collection to a museum, he or she gains the reputational benefit of having done something altruistic.  At the same time, access to the public comes with protective measures for security—guards, technology, velvet ropes, and more.

Personal data, to us at CDP, is also too valuable to keep locked up.  Arguably, personal data is currently kept by many private collectors or corporations.  They gain value from that data, but that value is not shared with the public.  Unlike art, which is usually made by an individual, personal data is collected from a large swaths of the general population, and yet we don’t have access to that data.  Like museums, we will want to think of security measures to minimize any risk, but we do acknowledge that there will be some risk, known and unknown, in our project.  But that risk is so much outweighed by the potential benefits to society, we think it’s a worthwhile experiment.

Museums also add value to their holdings by curating them.  That’s an important challenge for us, as information is only valuable when it’s organized.

Datatrust Prototype

Tuesday, December 8th, 2009

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

The talk can be broken up into three tersely titled sections:

Why? Basic motivation for the project.

What? What exactly the prototype will be.

Not Potential features of a Datatrust that are out of scope for this first prototype.


We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

  • SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
  • PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.


Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

NHANES Survey Content Cover Page

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.


Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

Remixing Creative Commons licenses for personal information, Part II — What good would that do?

Wednesday, November 25th, 2009

The scenarios of data sharing I outlined in my first blog post may not sound too exciting to you.  So what if one person uploads a dataset on her blog, making it public, and then says it’s available for reuse?  How does that make the world a better place?

It’s possible that although personal information licenses, a la Creative Commons, wouldn’t solve all data-collection problems today, it could shape and shift the debate in several important ways:

1) Create a proactive way for people to take control of their information.

Right now, we as users generally are told, “Take it or leave it.”  We can agree with the terms of use that govern the use of our personal information, or not. A few companies are trying to offer more choices—Firefox has a “Private Browsing” option, Google offers some choices in what interests are tracked.  But a user almost never gets a choice in how his or her information is used once it’s collected.  A set of licenses could be a way to assert control instead of waiting for the choices to be offered.  As many privacy advocates have noted, it’s problematic that most privacy choices are offered as an opt-out rather than an opt-in.  A set of licenses would create a way to “opt-in” before being asked.  Even if the licenses turned out to be difficult to enforce, if the licenses became popular and widespread, it would be harder to ignore that people do have preferences that are not being considered or honored.

2) Create a grassroots way for people to actively share their information for causes they explicitly support.

Obama's Healthcare Stories for America

We’ve all seen campaigns that are organized around human-interest stories, true stories about real people that are meant to humanize a campaign and give it urgency.  The current healthcare debate, for example, inspired a host of organizations to ask people to “share their stories,” the Obama administration’s site being one of the best-organized ones.

It had the following “Submission Terms“:

submission terms

By submitting your story, you agree that the story, along with any pictures or video you submit along with the story (the “Submission”), is non-confidential and may be freely used and disclosed, in whole or in part and in any manner or media, by or on behalf of Democratic National Committee (“DNC”) in support of health care reform.

You acknowledge that such use will be without acknowledgment or compensation to you.

You grant DNC a perpetual, irrevocable, sublicensable, royalty-free license to publish, reproduce, distribute, display, perform, adapt, create derivative works of and otherwise use the Submission.

Despite the all-or-nothing language, the Obama site was still able to solicit a great number of stories.  But the terms underscore a perennial problem for lesser-known organizations.  How do people trust an organization with their stories?

A more decentralized set of licenses could allow people to essentially tag their information across the internet and flag that it’s been provided in support of a specific cause, without giving their stories explicitly to another organization.  Individuals could also choose to tag their information in support of specific research projects.

The licenses could be an organizing tool, a way for organizations or people without established reputations to gather useful information without asking people to sign away the rights to their stories.  Or the licenses could be a research tool, enabling new forms of data collection.  Already, sociologists are exploring the possibilities of broadening research beyond the couple hundred subjects that can be managed through more traditional methods.  At Harvard, a graduate student in psychology created an iPhone application that allows research subjects in a study on happiness to rate their happiness in real time, rather than through recollection with an interviewer later.

Would the existence of standard licenses for sharing personal information make organizing around real stories easier?  Could it make personal information-based research easier?  Could it encourage people who support such causes or research but are uncertain about existing privacy guarantees more willing to try?  We think it’s certainly worth exploring.

3) Make sharing cool (and good).


Creative Commons is not without controversy, but almost everyone would agree, what the organization did manage to do was making sharing work cool.  The licenses created an easy way for people who shared the same view of intellectual property to band together and display their commitment.  They also made it easier to advertise and sell this ethos of IP to others.

We wonder if a set of licenses for sharing personal information might not be able to do the same.  We want to promote sharing information as a virtue, a civic act of generosity, and a way to enable all of us to have more information for decisions.  We want donating information to feel like donating blood.

4) Raise the bar on use of personal information in research, marketing, and other contexts.

It may seem like we’re encouraging less use and reuse of information by imagining a system where people put licenses on information they already make public (see screenshots from the first post.)  But what the licenses would make clear, which is not clear now, is that there is a difference between something being put out for the public, for general use and enjoyment, and something being put out for someone else’s reuse, gain, and potential profit.  Those who use the license would be signaling clearly their willingness to make their information available for research and other public uses.

About a year ago, researchers at the Berman Center for the Internet and Society at Harvard released a dataset of Facebook profile information for an entire class of college students at an “an anonymous, northeastern American university.”  As Michael Zimmer pointed out, however, the dataset was hardly “anonymous.”  He was quickly able to deduce that the university in question was Harvard.  Although some have argued that some of these profiles were already “public,” Zimmer argues (and we agree) that having a public profile does not equal consent to being a research subject:

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

By creating a license that allows people to clearly signal when they do consent to being “scraped, aggregated, coded, dissected, and distributed,” we would also make clearer that when people don’t clearly signal their consent, that consent cannot be assumed.

5) Ultimately create new scenarios in which licenses can be used.

So far, the scenarios I’ve outlined in which a license could be applied are where information is being displayed openly, as on a website.  But the licenses could eventually apply to more closed systems, where the individual’s decision to share data is not itself public.

CDP is working on building a datatrust, a new kind of institution and trusted entity to store sensitive, personal information and make it publicly accessible for research.  Individuals and institutions could choose to donate data to the datatrust, knowing that they are contributing to public knowledge on a range of issues.  CDP will likely use a system of licenses that allow each data donor to pre-determine his or her preferences on how their data is accessed rather than a single “terms of use” tha applies to everyone, take it or leave it.

Similarly, if the licenses were to become popular, other organizations and companies that collect information from their members or account holders would be under pressure to offer these set choices or licenses when people sign up for accounts that require them to provide personal information.

What have we been doing?

Monday, October 19th, 2009

We’ve been silent for a while on the blog, but that’s because we’ve been distracted by actual work building out the datatrust (both the technology and the organization).

Here’s a brief rundown of what we’re doing.

Grace is multi-tasking on 3 papers.

Personal Data License We’re conducting a thought experiment to think through what the world might look like if there was an easy way for individuals to release personal information on their own terms.

Organizational Structures We’ve conducted a brief survey of a few organizational structures we think are interesting models for the datatrust “trusted” entities from Banks to Public Libraries and “member-based” organizations from Credit Unions to Wikipedia. We tried to answer the question: What institutional structures can be practical defenses against abuses of power as the datatrust becomes a significant repository of highly sensitive personal information?

Snapshot of Publicly Available Data Sources A cursory overview of some of the more interesting data sets that are available to the public from government agencies to answer the question: How is the datatrust going to be different / better than the myriad data sources we already have access to today?

We also now have 2 new contributors to CDP: Tony Gibbon and Grant Baillie.

A couple of months ago, Alex wrote about a new anonymization technology coming out of Microsoft Research: PINQ. It’s an elegant, simple solution, but perhaps not the most intuitive way for most people to think about guaranteeing privacy.

Tony is working on a demonstration of PINQ in action so that you and I can see how our privacy is protected and therefore believe *that* it works. Along the way, we’re figuring out what makes intuitive sense about the way PINQ works and what doesn’t and what we’ll need to extend so that researchers using the datatrust will be able to do their work in a way that makes sense.

Grant is working on a prototype of the datatrust itself which involves working out such issues as:

  • What data schemas will we support? We think this one to begin with: Star Schema.
  • How broadly do we support query structures?
  • Managing anonymizing noise levels.

To help us answer some of these questions, we’ve gathered a list of data sources we think we’d like to support in this first iteration. (e.g. IRS tax data, Census data) (More to come on that.)

We will be blogging about all of these projects in the coming week, so stay tuned!

Get Adobe Flash player