Archive for the ‘CDP Announcements’ Category

Common Data Project Enters Knight News Challenge

Thursday, June 21st, 2012

Knight Foundation

 

 

 

The Knight News Challenge from the Knight Foundation with the following goal:

The Knight News Challenge accelerates media innovation by funding breakthrough ideas in news and information. Winners receive a share of $5 million in funding and support from Knight’s network of influential peers and advisors to help advance their ideas.

The proposal process is all about brevity, so we had to encapsulate the next phase of datatrust development in a few short sentences.

 

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)


Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Whitepaper 2.0: A moral and practical argument for public access to private data.

Monday, April 4th, 2011

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

Common Data Project at IAPP Privacy Academy 2010

Monday, September 13th, 2010

We will be giving a Lightning Talk on “Low-Impact Data-Mining” and running two breakout sessions at the IT Privacy Foo Camp – Preconference Session, Wednesday Sept 29.

Below is a preview of our slides and handout for the conference. Unlike our previous presentations, we won’t be talking about CDP and the Datatrust at all. Instead, we’ll be focused on presenting on how SGM helps companies minimize the privacy impact of their data-mining.

More specifically, we’ll be stepping through the symbiotic documentation system we’ve created between the product development/data science folks collecting and making use of the data and the privacy/legal folks trying to regulate and monitor compliance with privacy policies. We will be using the SGM Data Dictionary as a case study in the breakout sessions.

Still, we expect that many of issues we’ve been grappling with from the datatrust perspective (e.g. public perception, trust, ownership of data, meaningful privacy guarantees) will come up as they are universal issues that are central to any meaningful discussion about privacy today.


Handout

What is data science?

An introduction to data-mining from O’Reilly Radar that provides a good explanation of how data-mining is distinct from previous uses of data and provides plenty of examples of how data-mining is changing products and services today.

The “Anonymous” Promise and De-indentification

  1. How you can be re-identified: Zip code + Birth date + Gender = Identity
  2. Promising new technologies for anonymization: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm.

Differential Privacy: A Programmatic Way to Enforce Your Privacy Guarantee?

  1. A Microsoft Research Implementation: PINQ
  2. CDP’s write-up about PINQ.
  3. A deeper look at how differential privacy’s mathematical guarantee might translate into laymen’s terms.

Paradigms of Data Ownership: Individuals vs Companies

  1. Markets and Privacy by Kenneth C. Laudon
  2. Privacy as Property by Lawrence Lessig
  3. CDP explores the advantages and challenges to a “Creative Commons-style” model for licensing personal information?
  4. CDP’s Guide to How to Read a Privacy Policy

Common Data Project looking for a partner organization to open up access to sensitive data.

Wednesday, June 30th, 2010

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying  privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: myplaceinthecrowd.org covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

Governing the Datatrust: Answering the question, “Why should I trust you with my data?”

Thursday, June 3rd, 2010

Progress on defining the datatrust is accelerating–we can almost smell it!

For a refresher, the datatrust is an online service that will allow organizations to open sensitive data to the public and provide researchers, policymakers and application developers with a way to directly query the data, all without compromising individual privacy. Read more.

For the past two years, we’ve been working on figuring out exactly what the datatrust will be, not just in technical terms, but also in policy terms.

We’ve been thinking through what promises the datatrust will make, how those promises will be enforced, and how best we can build a datatrust that is governed, not by the whim of a dictator, but by a healthy synergy between the user community, the staff, and the board.

The policies we’re writing and the infrastructure we’re building are still a work in progress.  But for an overview of the decisions we’ve made and outstanding issues, take a look at “Datatrust Governance and Policies: Questions, Concerns, and Bright Ideas”.

Here’s a short summary of our overall strategy.

  1. Make a clear and enforceable promise around privacy.
  2. Keep the datatrust simple. We will never be all things to all people. The functions it does have will be small enough to be managed and monitored easily by a small staff, the user community, and the board.
  3. Have many decision-makers. It’s more important that we do the right thing than that we do them quickly. We will create a system of checks and balances, in which authority to maintain and monitor the datatrust will be entrusted to several, separate parties, including the staff, the user community, and the board.
  4. Monitor, report and review, regularly. We will regularly review what we’re monitoring and how we’re doing it. Release results to the public.
  5. Provide an escape valve. Develop explicit, enforceable policies on what the datatrust can and can’t do with the data. Prepare a “living will” to safely dispose of the data if the organization can no longer meet its obligations to its user community and the general public.

We definitely have a lot of work to do, but it’s exciting to be narrowing down the issues.  We’d love to hear what you think!

P.S. You can read more about the technical progress we’re making on the datatrust by visiting our Projects page.

Update: PINQ Demo Revisited

Tuesday, May 4th, 2010

Here’s Take Two on our PINQ “Differential Privacy In Action” Demo.

Along with a general paring down of the visual interface, we’ve refined how you interact with the application as well as tried to visualize how PINQ is applying noise to each answer.

  • The demo app is no longer modal. Meaning, you don’t have to click a button to switch between zooming in and out of the map, panning around the map and drawing boxes to define query areas. All of this functionality is accessible from the keyboard.
  • You no longer draw boxes to define query areas. Instead, clicking “Ask a Question” plops a box on the map that you can move and resize with the mouse.
  • Additionally, the corresponding PINQ answers update in real-time as you move and resize the query boxes.
  • New thumbnail graphics next to each answer reflect how PINQ generates noisy answers and provide a more immediate sense of the “scale of noise” being applied. (A more detailed explanation of these pointy curves is forthcoming.)

The demo has proven enormously helpful as an aid in explaining our work and our goals. We continue to improve it every time we make use of it, so stay tuned for more to come!

Live Demo: http://demos.commondataproject.org/PINQDemo.html

Screenshots:

CDP @ Open Knowledge Conference 2010: A Recap

Thursday, April 29th, 2010

Going into the Open Knowledge Conference, I didn’t know what to expect. Grace had read about them earlier, and we hoped we’d find like-minded people and Open Knowledge Foundationorganizations at the conference, but we didn’t have any personal contacts or references.

As it turned out, the talks and people’s interests overlapped significantly with the work we do, and vice-versa. Here are a few highlights:

  • Rufus Pollock started things off with some background on the Open Knowledge Foundation’s work, which is working towards making knowledge in a broad sense available publicly. This turns out to extend quite a bit beyond our interests in sensitive data (for example, it turns out that lack of bibliographic information often prevents copyright expiration from a practical perspective, because no one can apply the statutes which often include calculations based on author birthdate, author death date and other lesser-known facts, as well as lots of rules that vary by already inconsistent jurisdictions.)
  • Chris Taggart is championing an effort to bring more local government data on-line in the UK.
  • Peter Murray-Rust from Cambridge University made a case for sharing data, and for publishing scientists to clearly state their desire to publishers for the data to be available (which is apparently another copyright issue). He was involved in the creation of the Panton Principles for Open Data in Science (named after a pub in Cambridge).
  • Sören Auer gave a couple of talks on DBpedia.org, which is extracting structured data from Wikipedia. Apparently in Germany, for historical reasons open government data, and open data in general, does not have the public support that it has in the UK and US.
  • After chatting with Sören, I got a chance to chat with Hugh Williams, another attendee from OpenLink Software to learn more about how DBpedia’s 300 million RDF triples is hosted on a single instance of their Virtuoso server, an RDFDB variant, possible thanks to 64-bit architectures – something that was not feasible in the early days of RDFDB when I was working at Epinions. I’m curious to learn more about how a MapReduce-type mechanism sitting on top of an RDFDB store.
  • Jordan Hatcher gave a really interesting talk (a shorter version of this talk from the OSSAT) namely that the way in which we’re proposing to “release” sensitive data to the public is more akin to the way online companies use of data to drive their services and less like open government efforts where the data is literally given away. We’re never going to actually hand over any data. We’re only ever going to provide “noisy” descriptions of the data in response to queries. (This topic deserves it’s own post and we’ll definitely want to chat with him once we have our thoughts better organized.)
  • Jeni Tennison gave an interesting talk on the technical/practical challenges of scaling Open Data, which made me think (in relation to Jordan Hatcher’s talk) that we should consider a scenario where we allow for distributed storage of data behind the datatrust API, as this may simplify some of the legal constraints that we will run into.
  • Thomas Schandl gave a neat demo of Pool Party, which is a nice thesaurus system for managing linked data, and could be useful for managing a datastore with distinct and diverse data sources.
  • Stuart Harrison gave a talk on the data that the local UK government that he works for (Lichfield District) is releasing to try and help engage with the community. They have been able to release a fair bit of data, although privacy and sensitivity of data does seem to be becoming part of the challenges they are facing in doing so. It would be interesting to follow-up with him as well.
  • Victor Henning & Jan Reichelt gave an interesting presentation about Mendeley – a self-proclaimed Last.fm for research papers. It seems to me that they are already or will soon be running into interesting questions around who owns the data they collect from their users, as well as expectations around user privacy. Their site says “academic software for research papers” but they seemed to be saying that they would be selling their data in some form.
  • Karin Christiansen gave an interesting talk about the issue of transparency in international aid. Apparently there are real challenges identifying corruption, redundant aid and measuring impact because there’s no centralized view of where everyone’s aid goes. For example, apparently there are 27 different departments/commissions/etc within the US government dispersing international development aid. Apparently a major donation will change hands 6 times before reaching its intended destination, so tracking the money can be very hard. She is the director of http://publishwhatyoufund.org/ which is hoping to address this. This was an interesting talk and an interesting problem, though I didn’t see an immediate CDP-relevancy.
  • Helen Turvy from the Shuttleworth Foundation made an announcement that to my ears said “if you are involved in making data available to the public somewhere on this planet, we want to help you”. Unfortunately I didn’t get a chance to chat with her at the conference, but we definitely need to follow-up with her. Her characterization of the kinds of projects the Shuttleworth Foundation funded contrasted with sharply with other foundations we’ve looked at in that they are happy to support general purpose “the more data the better” solutions, as opposed to projects that address a specific problem (e.g. homelessness, pollution). As an all-purpose solution to making sensitive safe for public access, we’ve been hard-pressed to find funders like Shuttleworth.
  • Another item that came up during the day, possibly more than once though I’m not sure from where, was the idea that increasingly organizations, and/or parts of the government are starting to think about having data be “open by default” – in order to save money dealing with Freedom of Information Act requests!! (The UK has a similar concept to the US one by the sound of things.) This is exciting because if the datatrust can provide a cheap way for organizations to meet disclosure obligations, cost might actually help drive adoption.

Finally, my talk went well (many thanks to Mimi and Grace) and the new demo looked great (many thanks there to Tony) – we’ll have a post up on the new demo shortly. The fact that we were talking about releasing sensitive data made us fairly unique at the conference, and to many very interesting for future stages of the open data initiatives.

I got a chance to chat with several different people running into sensitive data disclosure challenges, most of which today run into an all or nothing decision point: some governing body ends up deciding whether the data in question can be disclosed or not. Allowing a differential-privacy style analysis of the data, with no actual records being disclosed is not part of the discussion. As a result, valuable data is not being opened up for reasons that we hope to soon show are no longer technically valid.

To fellow OKCon folks, we look forward to being a more active part of the community, and to bring more attention to the sensitive data scenarios! As I said during my short talk, anyone with interesting sensitive data sharing scenarios, please contact us so we can see if our work can be of use to you.

PINQ Privacy Demo

Thursday, January 7th, 2010

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).


Get Adobe Flash player