Archive for the ‘CDP Announcements’ Category

Common Data Project looking for a partner organization to open up access to sensitive data.

Wednesday, June 30th, 2010

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: myplaceinthecrowd.org covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

Governing the Datatrust: Answering the question, “Why should I trust you with my data?”

Thursday, June 3rd, 2010

Progress on defining the datatrust is accelerating–we can almost smell it!

For a refresher, the datatrust is an online service that will allow organizations to open sensitive data to the public and provide researchers, policymakers and application developers with a way to directly query the data, all without compromising individual privacy. Read more.

For the past two years, we’ve been working on figuring out exactly what the datatrust will be, not just in technical terms, but also in policy terms.

We’ve been thinking through what promises the datatrust will make, how those promises will be enforced, and how best we can build a datatrust that is governed, not by the whim of a dictator, but by a healthy synergy between the user community, the staff, and the board.

The policies we’re writing and the infrastructure we’re building are still a work in progress.  But for an overview of the decisions we’ve made and outstanding issues, take a look at “Datatrust Governance and Policies: Questions, Concerns, and Bright Ideas”.

Here’s a short summary of our overall strategy.

  1. Make a clear and enforceable promise around privacy.
  2. Keep the datatrust simple. We will never be all things to all people. The functions it does have will be small enough to be managed and monitored easily by a small staff, the user community, and the board.
  3. Have many decision-makers. It’s more important that we do the right thing than that we do them quickly. We will create a system of checks and balances, in which authority to maintain and monitor the datatrust will be entrusted to several, separate parties, including the staff, the user community, and the board.
  4. Monitor, report and review, regularly. We will regularly review what we’re monitoring and how we’re doing it. Release results to the public.
  5. Provide an escape valve. Develop explicit, enforceable policies on what the datatrust can and can’t do with the data. Prepare a “living will” to safely dispose of the data if the organization can no longer meet its obligations to its user community and the general public.

We definitely have a lot of work to do, but it’s exciting to be narrowing down the issues.  We’d love to hear what you think!

P.S. You can read more about the technical progress we’re making on the datatrust by visiting our Projects page.

Update: PINQ Demo Revisited

Tuesday, May 4th, 2010

Here’s Take Two on our PINQ “Differential Privacy In Action” Demo.

Along with a general paring down of the visual interface, we’ve refined how you interact with the application as well as tried to visualize how PINQ is applying noise to each answer.

  • The demo app is no longer modal. Meaning, you don’t have to click a button to switch between zooming in and out of the map, panning around the map and drawing boxes to define query areas. All of this functionality is accessible from the keyboard.
  • You no longer draw boxes to define query areas. Instead, clicking “Ask a Question” plops a box on the map that you can move and resize with the mouse.
  • Additionally, the corresponding PINQ answers update in real-time as you move and resize the query boxes.
  • New thumbnail graphics next to each answer reflect how PINQ generates noisy answers and provide a more immediate sense of the “scale of noise” being applied. (A more detailed explanation of these pointy curves is forthcoming.)

The demo has proven enormously helpful as an aid in explaining our work and our goals. We continue to improve it every time we make use of it, so stay tuned for more to come!

Live Demo: http://demos.commondataproject.org/PINQDemo.html

Screenshots:

CDP @ Open Knowledge Conference 2010: A Recap

Thursday, April 29th, 2010

Going into the Open Knowledge Conference, I didn’t know what to expect. Grace had read about them earlier, and we hoped we’d find like-minded people and Open Knowledge Foundationorganizations at the conference, but we didn’t have any personal contacts or references.

As it turned out, the talks and people’s interests overlapped significantly with the work we do, and vice-versa. Here are a few highlights:

  • Rufus Pollock started things off with some background on the Open Knowledge Foundation’s work, which is working towards making knowledge in a broad sense available publicly. This turns out to extend quite a bit beyond our interests in sensitive data (for example, it turns out that lack of bibliographic information often prevents copyright expiration from a practical perspective, because no one can apply the statutes which often include calculations based on author birthdate, author death date and other lesser-known facts, as well as lots of rules that vary by already inconsistent jurisdictions.)
  • Chris Taggart is championing an effort to bring more local government data on-line in the UK.
  • Peter Murray-Rust from Cambridge University made a case for sharing data, and for publishing scientists to clearly state their desire to publishers for the data to be available (which is apparently another copyright issue). He was involved in the creation of the Panton Principles for Open Data in Science (named after a pub in Cambridge).
  • Sören Auer gave a couple of talks on DBpedia.org, which is extracting structured data from Wikipedia. Apparently in Germany, for historical reasons open government data, and open data in general, does not have the public support that it has in the UK and US.
  • After chatting with Sören, I got a chance to chat with Hugh Williams, another attendee from OpenLink Software to learn more about how DBpedia’s 300 million RDF triples is hosted on a single instance of their Virtuoso server, an RDFDB variant, possible thanks to 64-bit architectures – something that was not feasible in the early days of RDFDB when I was working at Epinions. I’m curious to learn more about how a MapReduce-type mechanism sitting on top of an RDFDB store.
  • Jordan Hatcher gave a really interesting talk (a shorter version of this talk from the OSSAT) namely that the way in which we’re proposing to “release” sensitive data to the public is more akin to the way online companies use of data to drive their services and less like open government efforts where the data is literally given away. We’re never going to actually hand over any data. We’re only ever going to provide “noisy” descriptions of the data in response to queries. (This topic deserves it’s own post and we’ll definitely want to chat with him once we have our thoughts better organized.)
  • Jeni Tennison gave an interesting talk on the technical/practical challenges of scaling Open Data, which made me think (in relation to Jordan Hatcher’s talk) that we should consider a scenario where we allow for distributed storage of data behind the datatrust API, as this may simplify some of the legal constraints that we will run into.
  • Thomas Schandl gave a neat demo of Pool Party, which is a nice thesaurus system for managing linked data, and could be useful for managing a datastore with distinct and diverse data sources.
  • Stuart Harrison gave a talk on the data that the local UK government that he works for (Lichfield District) is releasing to try and help engage with the community. They have been able to release a fair bit of data, although privacy and sensitivity of data does seem to be becoming part of the challenges they are facing in doing so. It would be interesting to follow-up with him as well.
  • Victor Henning & Jan Reichelt gave an interesting presentation about Mendeley – a self-proclaimed Last.fm for research papers. It seems to me that they are already or will soon be running into interesting questions around who owns the data they collect from their users, as well as expectations around user privacy. Their site says “academic software for research papers” but they seemed to be saying that they would be selling their data in some form.
  • Karin Christiansen gave an interesting talk about the issue of transparency in international aid. Apparently there are real challenges identifying corruption, redundant aid and measuring impact because there’s no centralized view of where everyone’s aid goes. For example, apparently there are 27 different departments/commissions/etc within the US government dispersing international development aid. Apparently a major donation will change hands 6 times before reaching its intended destination, so tracking the money can be very hard. She is the director of http://publishwhatyoufund.org/ which is hoping to address this. This was an interesting talk and an interesting problem, though I didn’t see an immediate CDP-relevancy.
  • Helen Turvy from the Shuttleworth Foundation made an announcement that to my ears said “if you are involved in making data available to the public somewhere on this planet, we want to help you”. Unfortunately I didn’t get a chance to chat with her at the conference, but we definitely need to follow-up with her. Her characterization of the kinds of projects the Shuttleworth Foundation funded contrasted with sharply with other foundations we’ve looked at in that they are happy to support general purpose “the more data the better” solutions, as opposed to projects that address a specific problem (e.g. homelessness, pollution). As an all-purpose solution to making sensitive safe for public access, we’ve been hard-pressed to find funders like Shuttleworth.
  • Another item that came up during the day, possibly more than once though I’m not sure from where, was the idea that increasingly organizations, and/or parts of the government are starting to think about having data be “open by default” – in order to save money dealing with Freedom of Information Act requests!! (The UK has a similar concept to the US one by the sound of things.) This is exciting because if the datatrust can provide a cheap way for organizations to meet disclosure obligations, cost might actually help drive adoption.

Finally, my talk went well (many thanks to Mimi and Grace) and the new demo looked great (many thanks there to Tony) – we’ll have a post up on the new demo shortly. The fact that we were talking about releasing sensitive data made us fairly unique at the conference, and to many very interesting for future stages of the open data initiatives.

I got a chance to chat with several different people running into sensitive data disclosure challenges, most of which today run into an all or nothing decision point: some governing body ends up deciding whether the data in question can be disclosed or not. Allowing a differential-privacy style analysis of the data, with no actual records being disclosed is not part of the discussion. As a result, valuable data is not being opened up for reasons that we hope to soon show are no longer technically valid.

To fellow OKCon folks, we look forward to being a more active part of the community, and to bring more attention to the sensitive data scenarios! As I said during my short talk, anyone with interesting sensitive data sharing scenarios, please contact us so we can see if our work can be of use to you.

PINQ Privacy Demo

Thursday, January 7th, 2010

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).

The Common Data Project’s first symposium

Friday, November 13th, 2009

P1020642

This past weekend, the Common Data Project held its first “symposium,” an informal but very productive gathering of friends who are involved in various projects related to CDP’s work.  We’re scattered across the country, so we felt lucky that we were able to convene in San Francisco, share what we’ve been working on, and learn from each other.

P1020656In two days, we managed to cover a dizzying array of topics:

  • Public data sets today and how they could better, presented by Grace Meng;
  • Discussion on lessons from other institutions in organizational trust-building;
  • Demo of PINQ, a new technology implementing differential privacy, built by Tony Gibbon;
  • Review of what would be needed to build a datatrust prototype by Grant Baille;
  • Intense Q&A session with Frank McSherry, Tony Gibbon, and Grant Baille on how PINQ does and doesn’t protect privacy;
  • Brainstorms around CDP’s potential participation in the BigApps contest and the Conference on Ethical Guidance for Research and Application of Pervasive and Autonomous Information Technology (PAIT) in Cincinnati in March, led by Alex Selkirk and Mimi Yin;
  • An overview of how far CDP has come and where it might go, learning from case studies of other organizations, by business professor Geoff Desa

Not all of us are working on projects that will be immediately used by CDP, but we’re all thinking about issues and ideas that we’re sure will eventually be extremely relevant to CDP’s mission.  In many ways, we have a daunting list of things we need to accomplish.  But after this weekend, we also have a surer sense of where we need to go next.  Stay tuned over the next couple of weeks on more detailed blog posts on some of our ideas.

A big thanks to Rebecca Widiss for donating her talent in group facilitation, and to everyone for contributing their time and effort to help push CDP forward.

“How to Read a Privacy Policy” published in IAPP newsletter

Tuesday, October 20th, 2009

IAPPnews

We’re pleased to announce that our report, “How to Read a Privacy Policy,” has been published in the October newsletter of Inside 1to1: Privacy, a publication produced by the International Association of Privacy Professionals (IAPP) and the Peppers & Rogers Group.

Our report, first published on our website in July, provides a “how to read” guide for the user who is curious about what is happening to his or her data online, but has little understanding of the technological and legal mechanisms at work.  The report walks through seven questions meant to pinpoint the issues CDP believes are most crucial for a user’s privacy, from questions on how “personal information” is defined to the kind of choices offered to users regarding how their information is shared.

We’d love to hear what you think!

What have we been doing?

Monday, October 19th, 2009

We’ve been silent for a while on the blog, but that’s because we’ve been distracted by actual work building out the datatrust (both the technology and the organization).

Here’s a brief rundown of what we’re doing.

Grace is multi-tasking on 3 papers.

Personal Data License We’re conducting a thought experiment to think through what the world might look like if there was an easy way for individuals to release personal information on their own terms.

Organizational Structures We’ve conducted a brief survey of a few organizational structures we think are interesting models for the datatrust “trusted” entities from Banks to Public Libraries and “member-based” organizations from Credit Unions to Wikipedia. We tried to answer the question: What institutional structures can be practical defenses against abuses of power as the datatrust becomes a significant repository of highly sensitive personal information?

Snapshot of Publicly Available Data Sources A cursory overview of some of the more interesting data sets that are available to the public from government agencies to answer the question: How is the datatrust going to be different / better than the myriad data sources we already have access to today?

We also now have 2 new contributors to CDP: Tony Gibbon and Grant Baillie.

A couple of months ago, Alex wrote about a new anonymization technology coming out of Microsoft Research: PINQ. It’s an elegant, simple solution, but perhaps not the most intuitive way for most people to think about guaranteeing privacy.

Tony is working on a demonstration of PINQ in action so that you and I can see how our privacy is protected and therefore believe *that* it works. Along the way, we’re figuring out what makes intuitive sense about the way PINQ works and what doesn’t and what we’ll need to extend so that researchers using the datatrust will be able to do their work in a way that makes sense.

Grant is working on a prototype of the datatrust itself which involves working out such issues as:

  • What data schemas will we support? We think this one to begin with: Star Schema.
  • How broadly do we support query structures?
  • Managing anonymizing noise levels.

To help us answer some of these questions, we’ve gathered a list of data sources we think we’d like to support in this first iteration. (e.g. IRS tax data, Census data) (More to come on that.)

We will be blogging about all of these projects in the coming week, so stay tuned!

What does it mean to be a 501(c)(3) nonprofit organization?

Thursday, August 13th, 2009

IRS

The Common Data Project is pleased to announce that we have been officially recognized by the IRS as a 501(c)(3) tax-exempt organization! In other words, your donation to CDP is now tax-deductible, and you can donate to us here.

But what does it really mean to be a nonprofit organization? And what does it mean for us at the Common Data Project?

A nonprofit organization is an organization that is motivated by goals other than the making of a profit. It’s a pretty circular definition, I know. Being a nonprofit doesn’t mean that an organization can’t make money. Yale University, for example, is a nonprofit, and the people who manage its endowment have made it very, very rich. And an organization certainly isn’t a nonprofit just because it offers services for free. Many of Google’s services are free, and it is clearly not a nonprofit.

The definition of the type of nonprofit recognized by the Internal Revenue Service as tax-exempt under Section 501(c)(3) of the Internal Revenue Code is a little more specific. The organization must be organized for one or more tax exempt purposes, which include “charitable, religious, educational, scientific, literary, testing for public safety, fostering national or international amateur sports competition, and preventing cruelty to children or animals.”

We applied for recognition as an organization with a primarily educational purpose, as we work to change public perception and understanding of privacy issues and how they impact our ability to share information. But the IRS’s definition of a 501(c)(3) tax-exempt organization doesn’t quite encompass who we are and why we are a nonprofit.

First and foremost, the datatrust we envision must have a public-serving mission, rather than a profit-driven motive.

We know that we could be a business that provides information services. We have no illusions that a nonprofit necessarily does more “good” than a business.  But we plan to build a datatrust, a repository for anonymized datasets that are available for useful and innovative applications by the general public.  We are trying to create a completely new model for data collection, where the people who donate data also get the value of data in return.  A datatrust that is built on the goals of sharing, transparency, and accountability cannot accept the donations of people and organizations and then monetize that data for profit.

Of course, it’s not enough to declare that the datatrust will benefit the public, nor is it enough to be recognized as a 501(c)(3) organization by the IRS.

We’ll have to work hard to create a datatrust that everyone can believe in. In the same way museums, public libraries, and even online spaces like Wikipedia imbue its users with a feeling of public sharing and respect, we hope the datatrust will engender a sense of community.

You can read more about our goals here, but ultimately, we hope to have a continuing dialogue with you on our goals and our plans as we keep working to create a trustworthy, transparent datatrust organization.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes