Archive for the ‘CDP Announcements’ Category

The Common Data Project’s first symposium

Friday, November 13th, 2009

P1020642

This past weekend, the Common Data Project held its first “symposium,” an informal but very productive gathering of friends who are involved in various projects related to CDP’s work.  We’re scattered across the country, so we felt lucky that we were able to convene in San Francisco, share what we’ve been working on, and learn from each other.

P1020656In two days, we managed to cover a dizzying array of topics:

  • Public data sets today and how they could better, presented by Grace Meng;
  • Discussion on lessons from other institutions in organizational trust-building;
  • Demo of PINQ, a new technology implementing differential privacy, built by Tony Gibbon;
  • Review of what would be needed to build a datatrust prototype by Grant Baille;
  • Intense Q&A session with Frank McSherry, Tony Gibbon, and Grant Baille on how PINQ does and doesn’t protect privacy;
  • Brainstorms around CDP’s potential participation in the BigApps contest and the Conference on Ethical Guidance for Research and Application of Pervasive and Autonomous Information Technology (PAIT) in Cincinnati in March, led by Alex Selkirk and Mimi Yin;
  • An overview of how far CDP has come and where it might go, learning from case studies of other organizations, by business professor Geoff Desa

Not all of us are working on projects that will be immediately used by CDP, but we’re all thinking about issues and ideas that we’re sure will eventually be extremely relevant to CDP’s mission.  In many ways, we have a daunting list of things we need to accomplish.  But after this weekend, we also have a surer sense of where we need to go next.  Stay tuned over the next couple of weeks on more detailed blog posts on some of our ideas.

A big thanks to Rebecca Widiss for donating her talent in group facilitation, and to everyone for contributing their time and effort to help push CDP forward.

“How to Read a Privacy Policy” published in IAPP newsletter

Tuesday, October 20th, 2009

IAPPnews

We’re pleased to announce that our report, “How to Read a Privacy Policy,” has been published in the October newsletter of Inside 1to1: Privacy, a publication produced by the International Association of Privacy Professionals (IAPP) and the Peppers & Rogers Group.

Our report, first published on our website in July, provides a “how to read” guide for the user who is curious about what is happening to his or her data online, but has little understanding of the technological and legal mechanisms at work.  The report walks through seven questions meant to pinpoint the issues CDP believes are most crucial for a user’s privacy, from questions on how “personal information” is defined to the kind of choices offered to users regarding how their information is shared.

We’d love to hear what you think!

What have we been doing?

Monday, October 19th, 2009

We’ve been silent for a while on the blog, but that’s because we’ve been distracted by actual work building out the datatrust (both the technology and the organization).

Here’s a brief rundown of what we’re doing.

Grace is multi-tasking on 3 papers.

Personal Data License We’re conducting a thought experiment to think through what the world might look like if there was an easy way for individuals to release personal information on their own terms.

Organizational Structures We’ve conducted a brief survey of a few organizational structures we think are interesting models for the datatrust “trusted” entities from Banks to Public Libraries and “member-based” organizations from Credit Unions to Wikipedia. We tried to answer the question: What institutional structures can be practical defenses against abuses of power as the datatrust becomes a significant repository of highly sensitive personal information?

Snapshot of Publicly Available Data Sources A cursory overview of some of the more interesting data sets that are available to the public from government agencies to answer the question: How is the datatrust going to be different / better than the myriad data sources we already have access to today?

We also now have 2 new contributors to CDP: Tony Gibbon and Grant Baillie.

A couple of months ago, Alex wrote about a new anonymization technology coming out of Microsoft Research: PINQ. It’s an elegant, simple solution, but perhaps not the most intuitive way for most people to think about guaranteeing privacy.

Tony is working on a demonstration of PINQ in action so that you and I can see how our privacy is protected and therefore believe *that* it works. Along the way, we’re figuring out what makes intuitive sense about the way PINQ works and what doesn’t and what we’ll need to extend so that researchers using the datatrust will be able to do their work in a way that makes sense.

Grant is working on a prototype of the datatrust itself which involves working out such issues as:

  • What data schemas will we support? We think this one to begin with: Star Schema.
  • How broadly do we support query structures?
  • Managing anonymizing noise levels.

To help us answer some of these questions, we’ve gathered a list of data sources we think we’d like to support in this first iteration. (e.g. IRS tax data, Census data) (More to come on that.)

We will be blogging about all of these projects in the coming week, so stay tuned!

What does it mean to be a 501(c)(3) nonprofit organization?

Thursday, August 13th, 2009

IRS

The Common Data Project is pleased to announce that we have been officially recognized by the IRS as a 501(c)(3) tax-exempt organization! In other words, your donation to CDP is now tax-deductible, and you can donate to us here.

But what does it really mean to be a nonprofit organization? And what does it mean for us at the Common Data Project?

A nonprofit organization is an organization that is motivated by goals other than the making of a profit. It’s a pretty circular definition, I know. Being a nonprofit doesn’t mean that an organization can’t make money. Yale University, for example, is a nonprofit, and the people who manage its endowment have made it very, very rich. And an organization certainly isn’t a nonprofit just because it offers services for free. Many of Google’s services are free, and it is clearly not a nonprofit.

The definition of the type of nonprofit recognized by the Internal Revenue Service as tax-exempt under Section 501(c)(3) of the Internal Revenue Code is a little more specific. The organization must be organized for one or more tax exempt purposes, which include “charitable, religious, educational, scientific, literary, testing for public safety, fostering national or international amateur sports competition, and preventing cruelty to children or animals.”

We applied for recognition as an organization with a primarily educational purpose, as we work to change public perception and understanding of privacy issues and how they impact our ability to share information. But the IRS’s definition of a 501(c)(3) tax-exempt organization doesn’t quite encompass who we are and why we are a nonprofit.

First and foremost, the datatrust we envision must have a public-serving mission, rather than a profit-driven motive.

We know that we could be a business that provides information services. We have no illusions that a nonprofit necessarily does more “good” than a business.  But we plan to build a datatrust, a repository for anonymized datasets that are available for useful and innovative applications by the general public.  We are trying to create a completely new model for data collection, where the people who donate data also get the value of data in return.  A datatrust that is built on the goals of sharing, transparency, and accountability cannot accept the donations of people and organizations and then monetize that data for profit.

Of course, it’s not enough to declare that the datatrust will benefit the public, nor is it enough to be recognized as a 501(c)(3) organization by the IRS.

We’ll have to work hard to create a datatrust that everyone can believe in. In the same way museums, public libraries, and even online spaces like Wikipedia imbue its users with a feeling of public sharing and respect, we hope the datatrust will engender a sense of community.

You can read more about our goals here, but ultimately, we hope to have a continuing dialogue with you on our goals and our plans as we keep working to create a trustworthy, transparent datatrust organization.

NYC Big Apps Competition Brainstorm

Thursday, July 30th, 2009
1. Most Persistent Theme: My Neighborhood
– What *do* people consider their neighborhood?
– What services do they use in their neighborhood?
– What’s lacking in their neighborhood?
– What’s aggravating people in their neighborhood? (Complaints, violations, crime.)
2. “Real-time” experience of data.
– What’s around me right now?
– How are my tax dollars being spent, right here?
– Optimize my biking / subway route given conditions right now.
3. Maps are cool.
4. Crossing data sets is cool.
– ER wait times with crime blotter?
– Local immunization with school attendance records?
– Congestion maps with air quality monitors?
5. Design the application around the assumption that the city data is going to be incomplete and use that as an opportunity to invite individuals to contribute data to help fill in the gaps.
WE ALSO ENDED UP WITH SOME GREAT QUESTIONS TO ASK OF THE CITY.
Here’s the beginning of our list. Please feel free to add more.
1. Will there be a “guide” of sorts for what data is available? What data has already been released? What data hasn’t been released, but could be released? (For the competition, will there be a point person for answering questions related to releasing data?) What data is most up-to-date and complete?
2. What are the boundaries for what data is and isn’t available? What about data that crosses city boundaries? e.g. MTA data (MTA). Hospital data (NY State).
3. How “raw” will the data be? (Unprocessed database files? Summary reports?)
4. What are the city’s privacy standards for releasing data?
– How will you ensure that individual identities won’t be revealed, even if city data is cross-referenced with other data sets.
– Many public records include individual identities and addresses. In a world where collating public records about any given individual across city agencies and the public domain was time-consuming and difficult, this was not a privacy issue. However, given the ease with which such a task is done today, how will the city protect individuals from having “complete” profiles of their lives assembled and made available online for public scrutiny?
NEXT STEPS
We will be submitting our questions to the city along with everybody else this Wednesday (22nd) and waiting with baited breath for answers on the morning of Friday (24th)
We have also decided to take the idea of “Subscribe to updates about the services I use in my neighborhood” and push it to the next level of design to see what road blocks we run into. It will also help us zero in on what data to request of the city in September. We still don’t know what we’ll actually end up submitting to the competition, but either way, we will learn much by pushing one idea to the next stage of development, even if we end up having to go back to the drawing board.

NYC announced that it too is holding a competition to build applications using (to-be-released) city data. We’re thinking of entering our own app. You can see some of the ideas that are floating around here.

As of yet, it is unclear what criteria will be used to judge the applications, but we decided to go ahead and hold a brainstorm session anyway with a handful of people “in-the-know” about city agencies and who make use of city data in their day-to-day work.

Here are the notes from our meet-up.

P1010001

Mind Map: Environmental Health

1. Most Persistent Theme: “My Neighborhood”

– What do people actually think of as “in their neighborhood?”

– What services do they use in their neighborhood?

– What’s lacking in their neighborhood?

– What’s aggravating people in their neighborhood? (Complaints, violations, crime.)

2. “Real-time” experience of data is cool.

– What’s around me right now?

– How are my tax dollars being spent, right here?

– Optimize my biking / subway route given conditions right now.

3. Maps are cool.

4. Crossing data sets is cool.

– ER wait times with crime blotter?

– Local immunization with school attendance records?

– Congestion maps with air quality monitors?

5. Design the application around the assumption that the city data is going to be incomplete; and use that as an opportunity to invite individuals to contribute data to help fill in the gaps.

WE ALSO CAME UP SOME GREAT QUESTIONS TO ASK OF THE CITY.

1. Will there be a “guide” of sorts for what data is available? What data has already been released? What data hasn’t been released, but could be released? (For the competition, will there be a point person for answering questions related to releasing data?) What data is most up-to-date and complete?

2. What are the boundaries for what data is and isn’t available? What about data that crosses city boundaries? e.g. MTA data (MTA). Hospital data (NY State).

3. How “raw” will the data be? (Unprocessed database files? Summary reports?)

4. What are the city’s privacy standards for releasing data?

– How will you ensure that individual identities won’t be revealed, even if city data is cross-referenced with other data sets.

– Many public records include individual identities and addresses. In a world where collating public records about any given individual across city agencies and the public domain was time-consuming and difficult, this was not a privacy issue. However, given the ease with which such a task is done today, how will the city protect individuals from having “complete” profiles of their lives assembled and made available online for public scrutiny?

NEXT STEPS

We have submitted our questions to NYCDataRFEI@nycedc.com, but got a “server rejected your message” error email in response. We called and left a message but haven’t heard back. Answers, which were promised for Friday Jul 24th, have yet to be posted to their website.

We have also decided to take the idea of “Subscribe to updates about the services I use in my neighborhood” and push it to the next level of design to see what road blocks we run into. It will also help us zero in on what data to request of the city in September. We still don’t know what we’ll actually end up submitting to the competition, but either way, we will learn much by pushing one idea to the next stage of development, even if we end up having to go back to the drawing board.

Healthcare Stories (and Data) for America

Wednesday, July 22nd, 2009

Healthcare For Everyone: An online data collection forum.

As part of our work, we’re experimenting with ways to motivate people to donate data about themselves, either to further a cause they believe in or simply to better understand their situation relative to others. We’ve built a demo site around the issue of healthcare and would love for you to try it out give us some feedback!

Of course, we’re not the first people to have a go at organizing people around healthcare online. The Obama administration’s recently launched Healthcare Stories for America is particularly well-done, with an interactive map and community-driven mechanism for highlighting especially interesting contributions.

Healthcare Stories for America

Still, we wonder if we shouldn’t take it up a level.

Our belief is that to make a compelling case for any issue, especially one as complex and multi-faceted as healthcare, you need both:

  1. Real stories to humanize the problem; and
  2. Hard data to contextualize those stories and provide handles for understanding the size and shape of the problem.

It is no accident that the interactive map is most prominent way to browse “Healthcare Stories for America.” We think there should be even more “data-driven” ways to soak up what people are contributing. Far from being a turn-off, we believe that asking people to give more data can make community forums more compelling and more engaging, provided people understand what they’re giving up, who they’re giving it up to, how it will be used and see an immediate pay-off for themselves.

The basic assumption of the site is simple: Give data. Get more data.

The visualization below is a static example graph of how we imagine more “data-driven” ways to consume and make sense of the kind of information people are sharing on forums like “Healthcare Stories.” See full-sized graph here.

Static Example Visualization

Hard data is also especially important when there is profound disagreement over the nature of a problem. This is the the drama that is playing out in Washington right now over healthcare. Hard-won, at times grudging agreement that reform is necessary counteracted by entrenched disagreement over exactly what the problem is and how it should be addressed.

So please take a look at what we’ve done. Browse around. Contribute your story. Our privacy guarantee is simple. Either you choose to make your information public, or you don’t. If you don’t, we will never release your data except in aggregate form, like this and this.

Let us know, what’s compelling to you? What’s not? What would you like to see more of? less of? What questions do you have about how the site works? Feel free to post your comments and questions here or email us directly at info [at] commondataproject [dot] org.

Introducing a new blogger, Ilya Marritz

Tuesday, January 20th, 2009

We’re pleased to announce that Ilya Marritz will be contributing to our blog.  Ilya is a journalist based in Brooklyn, and he reports for public radio on energy, the environment, and the economy.  We’ve always planned for this blog to become a forum for engaged and thoughtful debate on how information-sharing and privacy issues are relevant to all of us. We’re excited to be adding a new voice and perspective, and we look forward to hearing your thoughts on Ilya’s posts as well.

Trying to “show, not tell” CDP’s values

Tuesday, January 6th, 2009

Let’s be honest—it’s not easy to explain what we at the Common Data Project are trying to do.

It’s been a year since we incorporated as a nonprofit organization, and over the past year, we’ve had conversations with a lot of people, from media professors to actuaries, about why we decided to found this organization.  Different people have been excited about possibilities in different areas.  A friend who works in housing advocacy saw possibilities in addressing the subprime mortgage crisis; a law professor saw possibilities in analyzing federal tax policy.  It’s what makes our work exciting—that it can be applicable to so many contexts—but it’s also what makes it difficult to explain in simple terms.

So we’ve decided to follow our grade school English teacher’s advice: “SHOW, don’t tell.”  Instead of trying to describe what we want to do, we hope to demonstrate our information and privacy values through the launch of a new web-based application.

The site will be focused on the issue of healthcare reform, and we will be giving people a new way to voice their support for comprehensive, effective healthcare reform in this country.  It’s an issue that we’re passionate about, and we know other people are passionate about.  Even before the Obama transition team began holding community discussions on healthcare, we’ve been amazed how much people were already talking about healthcare in deeply personal ways.  There is already so much organized energy around this issue, groups and communities working together to accomplish their goals, that we could see a real value to providing a new outlet for that energy.  Although the issue touches upon health, one of the most sensitive and private areas of people’s lives, it’s also an area in which the value of sharing information is so obvious, people have been trying new, imaginative things to make that sharing happen.

So what does all this have to do with “real privacy, more data”?  Stay tuned for more.

DIMACS Workshop on Internet Privacy

Thursday, September 25th, 2008

Intuitive as a door

Slide from our presentation; image from Harpeth Presbyterian Church

The Common Data Project recently attended the DIMACS Workshop on Internet Privacy at Rutgers University.  Since we’d already introduced the basic idea of a datatrust at the last DIMACS workshop we attended in February, we decided to do a presentation on a more specific aspect of our work—how an individual user might interact with the datatrust.  We want to create a new paradigm, a completely new way for individuals to collect their own personal information and share it with others—whether friends, researchers, or businesses—in ways individuals dictate.  Alex emphasized how such a model must be more intuitive than the opt-in/opt-out models available today, and walked through how this might be possible.

Given that the topic “Internet Privacy” covers a range of issues, the workshop drew a diverse group of participants. We heard a presentation by Adam Smith at Penn State University on differential privacy, a new area of research that we’ve been interested in for some time now, with the hope that it could be useful to our datatrust.  Daniel Howe from NYU and Felipe Saint-Jean from Yale presented on TrackMeNot and Private Web Search, two different approaches to obscuring identification by search engines, leading to an intense discussion on the ethics of purposefully messing with the business model of Google and the other search engines.  EJ Jung from the University of Iowa gave a fascinating talk on the ways controls have been placed on access to data in the Medical Image File Archive (MIFAR) at the Radiology Department.  We found her talk particularly compelling, as her project deals very practically with existing data and the obvious needs of doctors, researchers, and patients.  Solon Barocas at NYU, who also spoke on our panel, shared his research on how data-mining is used by political campaigns for voter profiling, which raises interesting and possibly troubling implications for democracy.

We were also struck by Naftaly Minsky’s presentation on preventing servers from abusing their clients, as he discussed the possibility of hypothetical “trusted third parties” to act as intermediaries between individuals with information and businesses and other organizations that seek information.  His description of the ”trusted third party” seemed to us somewhat similar to our conception of a datatrust.  We’re looking forward to exploring further how his research, as well as the other research we learned about, could shape our work.

Upcoming CDP Presentation at DIMACS

Tuesday, September 16th, 2008

The Common Data Project is excited to announce we will be presenting at the DIMACS conference this week.  Officially called the “Workshop on Internet Privacy: Facilitating Seamless Data Movement with Appropriate Control,” the conference is organized by Dan Boneh, Ed Felten, and Helen Nissenbaum.

Alex Selkirk will be speaking on a panel on Thursday, September 18, called, “Aggregation, Mining, Profiling: Who should be in control?”  We’re looking forward to the feedback we’ll get at the conference, as we’re eager to share our ideas and learn from others who are on the program.  We’ll provide more information on our presentation after the conference, and we look forward to hearing your thoughts.


Get Adobe Flash player