Archive for the ‘For Nerds and Geeks’ Category

In The Mix…predicting the future; releasing healthcare claims; and $1.5 millions awarded to data privacy

Tuesday, November 30th, 2010

Some people out there think they can predict the future by scraping content off the web. Does it work simply because web 2.0 technologies are great at creating echo chambers? Is this just another way of amplifying that echo chamber and generating yet more self-fulfilling trend prophecies? See the Future with a Search (MIT Technology Review)

The U.S. Office of Personnel Management wants to create a huge database that contains healthcare claims of millions of. Many are concerned for how the data will be protected and used. More federal health database details coming following privacy alarm (Computer World)

Researchers at Purdue were awarded $1.5 million to investigate how well current techniques for anonymizing data are working and whether there’s a need for better methods. It would be interesting to know what they think of differential privacy. They  appear to be actually doing the dirty work of figuring out whether theoretical re-identification is more than just a theory. National Science Foundation Funds Purdue Data-Anonymization Project (Threat Post)

Update: PINQ Demo Revisited

Tuesday, May 4th, 2010

Here’s Take Two on our PINQ “Differential Privacy In Action” Demo.

Along with a general paring down of the visual interface, we’ve refined how you interact with the application as well as tried to visualize how PINQ is applying noise to each answer.

  • The demo app is no longer modal. Meaning, you don’t have to click a button to switch between zooming in and out of the map, panning around the map and drawing boxes to define query areas. All of this functionality is accessible from the keyboard.
  • You no longer draw boxes to define query areas. Instead, clicking “Ask a Question” plops a box on the map that you can move and resize with the mouse.
  • Additionally, the corresponding PINQ answers update in real-time as you move and resize the query boxes.
  • New thumbnail graphics next to each answer reflect how PINQ generates noisy answers and provide a more immediate sense of the “scale of noise” being applied. (A more detailed explanation of these pointy curves is forthcoming.)

The demo has proven enormously helpful as an aid in explaining our work and our goals. We continue to improve it every time we make use of it, so stay tuned for more to come!

Live Demo: http://demos.commondataproject.org/PINQDemo.html

Screenshots:

Program Manager Job Opening at Shan Gao Ma LLC

Monday, May 3rd, 2010

Hello CDP Blog Readers  –

If you’ve been following our progress working with differential privacy and PINQ, you’ll notice that my company Shan Gao Ma LLC (SGM), has been donating time investigating how to apply some of these new privacy technologies for practical use.

I am writing on behalf of SGM today to announce that we are looking to hire a new Program Manager to add to our team. How is this relevant to CDP, you ask? It is relevant because we are looking for someone who will help drive the technical explorations that Grant and Tony have been doing to ultimately put forth a viable technical plan for the CDP datatrust.

Why is SGM helping the Common Data Project?

In 2007, I left Microsoft, where I had been working with large-scale data collection systems for many years, to start the Common Data Project.

My motivation was fairly simple, yet ambitious as well. Through my work, I had come to feel that before long there would be a real need for an independent, trusted third-party to enable real public access to sensitive data that is today trapped within corporations and government agencies. CDP is an attempt to create that trusted third-party.

Shortly after creating CDP, I began working again, as a consultant, and I created Shan Gao Ma, a for-profit company which over the past three years has grown into a small consultancy specializing in collecting sensitive data and helping clients navigate privacy issues. In other words, I grapple with the same privacy issues I dealt with at Microsoft and I see, even more clearly, the need for the solutions put forth by CDP.

Today, I continue to fund and donate time to CDP, both personally and through Tony and Grant’s time from Shan Gao Ma, as we lay the groundwork for broader financial support.

The Future

Growing our team is exciting – it’s a sign of happy clients and an indication of the great work we’ve been doing – but at our size it can also be challenging to find the right fit. I wrote a bit of background on how we work to go along with the responsibilities and candidate requirements. If you are interested or simply have some questions, don’t hesitate to drop us a line.

Please send questions, refer folks you think would be interested, and help us get to the next stage!

Completely not there versus almost not there.

Wednesday, April 14th, 2010


Picture taken by Stephan Delange

In my last post where I tried to quantify the concept of “discernibility” I left off at the point where I said I was going to try out my “50/50” definition on the PINQ implementation of differential privacy.

It turned out to be a rather painful process. Both because I can be rather literal-minded in an unhelpful way at times and because it is plain hard to figure this stuff out.

To backtrack a bit, let’s first make some rather obvious statements to get a running start in preparation for wading through some truly non-obvious ones.

Crossing the discernibility line.

In the extreme case, we know that if there was no privacy protection whatsoever and the datatrust just gave out straight answers, then we would definitely cross the “discernibility line” and violate our privacy guarantee. So if we go back to my pirate friend again and ask, “How many people with skeletons in their closet wear an eye-patch and live in my building?” If you (my rather distinctive eye-patch wearing neighbor) exist in the data set, the answer will be 1. If you are not in the data set, the answer will be 0.

With no privacy protection, the presence or absence of your record in the data set makes a huge difference to the answers I get and are therefore extremely discernible.

Thankfully, PINQ doesn’t give straight answers. It adds “noise” to answers to obfuscate them.

Now when I ask, “How many people in this data set of people with skeletons in their closet wear an eye-patch and live in my building?” PINQ counts the number of people who meet these criteria and then decides to either “remove” some of those people or “add” some “fake” people to give me a “noisy” answer to my question.

How it chooses to do so is governed by a distribution curve developed and named for the French marquis Pierre-Simon La Place. (I don’t know why it has to be this particular curve, but I am curious to learn why.)

You can see the curve illustrated below in two distinct postures that illustrate very little privacy protection and quite a lot of privacy protection, respectively.

  • The point of the curve is centered on the “real answer.”
  • The width of the curve shows the range of possible “noisy answers” PINQ will choose from.
  • The height of the curve shows the relative probability of one noisy answer being chosen over another noisy answer.

A quiet curve with few “fake” answers for PINQ to choose from:

A noisy curve with many “fake” answers for PINQ to choose from:

More noise equals less discernibility.

It’s easy to wave your hands around and see in your mind’s eye how if you randomly add and remove people from “real answers” to questions, as you turn up the amount of noise you’re adding, the presence or absence of a particular record becomes increasingly irrelevant and therefore increasingly indiscernible. This in turn means that it will also be increasingly difficult to confidently isolate and identify a particular individual in the data set precisely because you can’t really ever get a “straight” answer out of PINQ that is accurate down to the individual.

With differential privacy, I can’t ever know that my eye-patch wearing neighbor has a skeleton in his closet. I can only conclude that he might or might not be in the dataset to varying degrees of certainty depending on how much noise is applied to the “real answer.”

Below, you can see how if you get a noisy answer of 2, it is about 7x more likely that the “real answer” is 1, than that the “real answer” is 0. A flatter, more noisy curve would yield a substantially smaller margin.

But wait a minute, we started out saying that our privacy guarantee, guarantees that individuals will be completely non-discernible. Is non-discernible the same thing as hardly discernible?

Clearly not.

Is complete indiscernibility even possible with differential privacy?

Apparently not…

On the question of “Discernibility”

Tuesday, April 13th, 2010

Where's Waldo?Where’s Waldo?

In my last post about PINQ and meaningful privacy guarantees, we defined “privacy guarantee” as a guarantee that the presence or absence of a single record will not be discernible.

Sounds reasonable enough, until you ask yourself, what exactly do we mean by “discernible”? And by “exactly”, I mean, “quantitatively” what do we mean by “discernible”? After all, differential privacy’s central value proposition is that it’s going to bring quantifiable, accountable math to bear on privacy, an area of policy that heretofore has been largely preoccupied with placing limitations on collecting and storing data or fine-print legalese and bald-faced marketing.

However, PINQ (a Microsoft Research implementation of differential privacy we’ve been working with) doesn’t have a built-in mathematical definition of “discernible” either. A human being (aka one of us) has to do that.

A human endeavors to come up with a machine definition of discernibility.

At our symposium last Fall, we talked about using a legal-ish framework for addressing this very issue of discernibility: Reasonable Suspicion, Probable Cause, Preponderence of Evidence, Clear and Convincing Evidence, Beyond a Reasonable Doubt.

Even if we decided to use such a framework, we would still need to figure out how these legal concepts translate into something quantifiable that PINQ can work with.

“Not Discernible” means seeing 50/50.

My initial reaction when I first starting thinking about this problem was that clearly, discernibility or lack thereof needed to revolve around some concept of 50/50, as in “odds of,” “chances are.”

Whatever answer you got out of PINQ, you should never get even a hint of an idea that any one number was more likely to be the real answer than the numbers to either of side of that number. (In other words, x and x+/-1 should be equally likely candidates for “real answerhood.”)

Testing discernibility with a “Worst-Case Scenario”

I ask a rather “pointed” question about my neighbor, one that essentially amounts to “Is so-and-so in this data set? Yes or no?” without actually naming names (or social security numbers, email addresses, cell phone numbers or any other unique identifiers). e.g. “How many people in this data set of ‘people with skeletons in their closet’ wear an eye-patch and live in my building?” Ideally, I should walk away with an answer that says,

“You know what, your guess is as good as mine, it is just as likely that the answer is 0, as it is that the answer is 1.”

In such a situation, I would be comfortable saying that I have received ZERO ADDITIONAL INFORMATION on the question of a certain eye-patched individual in my building and whether or not he has skeletons in his closets. I may as well have tossed a coin. My pirate neighbor is truly invisible in the dataset, if indeed he’s in there at all.

Armed with this idea, I set out to understand how this might be implemented with differential privacy...

PINQ Privacy Demo

Thursday, January 7th, 2010

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).

What does a privacy guarantee mean to you? Harm v. Obscurity

Friday, December 18th, 2009

Left: Senator Joseph McCarthy. Right: The band, Kajagoogoo.

At the Symposium in November we spent quite a bit of time trying to wrap our collective brain around the PINQ privacy technology, what it actually guarantees and how it does so.

I’ve attempted to condense our several hours of discussion into a series of blog posts.

We began our discussion with the question: What does a privacy guarantee mean to you?

There was a range of answers to this question. They all boiled down to one of the following two:

  1. Nothing bad will come of this information I give you: It won’t be used against me (discrimination, fraud investigations, psychological warfare). It won’t be used to harass me (spam).
  2. The absence or presence of a single record cannot be discerned.

Let’s just say definition 1 is the “layperson” definition, which is more focused on the consequences of giving up personal data.

And definition 2 is the “technologist'” definition, which is more focused on the mechanism behind how to actually fulfill the layperson’s guarantee in a meaningful, calculable way.

Q. What does PINQ guarantee?

Some context: PINQ is a layer of code that sits between data and anyone trying to ask questions of that data that guarantees privacy in a measurable way to the individuals represented in the data.

The privacy PINQ guarantees is broader than the layperson’s understanding of privacy. Not only does PINQ guard against re-identification, targeting, and in short, any kind of harm resulting from exposing your data, it prevents any and all things in the universe from changing as a direct result of your individual data contribution.

Sounds like cosmic wizardry. Not really, it’s simply a clever bit of armchair thinking.

If you want to guarantee that nothing in the world will change as a result of someone contributing their data to a data set, then you simply need to make sure that no one asking questions of that data set will get answers that are discernibly affected by the presence or absence of any one person.

Therefore, if you define privacy guarantee as “the absence or presence of a single record cannot be discerned,” meaning the inclusion of your data in a data set will have no discernible impact on the answers people get out of that data set, you also end up guaranteeing that nothing bad can ever happen to you if you contribute your data because in fact, absolutely nothing (good or bad) will happen to anyone as a direct result of you contributing your data, because with PINQ as the gatekeeper, your particular data record might as well not be there!

What is the practical fallout of such a guarantee?

Not only will you not be targeted to receive SPAM as a result of contributing your data to a dataset, no one else will be targeted to receive SPAM as a result of you contributing your data to a data set.

Not only will you not be discriminated against by future employers or insurance companies as a result of contributing your data to a dataset, no one else will be discriminated against as a result of contributing your data to a dataset.

Does this mean that my data doesn’t matter? Why then would I bother to contribute?

Now is a good time to point out that PINQ’s privacy guarantee is expansive, but in a very specific way. Nothing in the universe will change as a result of any one person’s data. However, the aggregate effect of everyone’s data will absolutely make a difference. It’s the same logic behind avoiding life’s little vices like telling white lies, littering or chewing gum in class. One person littering isn’t such a big deal. But what if everyone littered?

Still, is such an expansive privacy guarantee necessary?

It turns out, it’s incredibly hard to narrow a privacy guarantee to prevent just “harm,” because harm is a subjective concept with cultural and social overtones. One person’s spam is another’s helpful notification.

All privacy guarantees today which largely focus on preventing harm are by and large based on “good intentions” and “theoretical best practices,” not technical mechanisms that are measurable and “provable.”

However, change, as a function of how much any one person’s data is discernibly affecting answers to questions asked of a data set is readily measurable.

Up to this point, we’ve been engaged in a simple thought experiment that requires nothing more than a few turns of logic. How exactly PINQ keeps track of whether the absence or presence of a single record is discernible in the answers it gives out and the extent to which it’s discernible is a different matter and requires actual “innovation” and “technology.” Stay tuned for more on that.

Datatrust Prototype

Tuesday, December 8th, 2009

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

The talk can be broken up into three tersely titled sections:

Why? Basic motivation for the project.

What? What exactly the prototype will be.

Not Potential features of a Datatrust that are out of scope for this first prototype.

Why?

We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

  • SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
  • PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.

What?

Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

NHANES Survey Content Cover Page

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.

Not

Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

PINQ: Programmatic Privacy

Friday, August 28th, 2009

For those following this blog closely, you know CDP is developing the concept of a datatrust. Slightly less obvious may be the fact that we’re actually planning on building a datatrust too. As such, we have been have been following interesting privacy ideas and technologies for a few years now, by attending conferences, reading papers and talking to interesting folks.

PINQ Diagram

PINQ is a layer between the analyst and the datastore

Privacy Challenges with Aggregates

One of the key realizations that lead to the creation of CDP was that the data that is valuable for analysis (generally aggregate statistical data) is not, in principle, the data that concerns privacy advocates (identifiable, personal information). While that is true in many cases, the details are a bit more complicated.

I often quote the following statistic from Latanya Sweeney’s famous 2000 paper:

87% of people in the US can be uniquely identified by a combination of their zip code, their gender and their date of birth.

(Since then there has been some debate around that fact – Philippe Golle at PARC said in 2006 its really 63%.) But the fact remains that often your seemingly innocuous demographic data can actually be as unique a fingerprint as your social security number. A ways back I wrote about how it is standard practice for pollsters to tie a set of “anonymous” survey responses to any number of public databases that also contain those demographic details and tie your name(s), address(es), income, title(s), car(s), to your “anonymous” survey results. (Anonymous is in quotes because it’s CDP humor – we believe the term is used incorrectly most of the time.)  It’s like the Statistician’s Edition of Trivial Pursuit. In fact, zip code, gender and birth date are just an example – the more characteristics of any person (or place or object or anything really) you collect, the more likely it is that the set is unique. How unique is the set of purchases on your monthly credit card statement?

This reality poses a potentially showstopping problem for the datatrust: sure, aggregates are probably fine most of the time, but if we want to store and allow analysis of highly sensitive data, how can we be sure identities won’t be derived from even the aggregates?

PINQ: Privacy Integrated Queries

Enter PINQ, just made public by Frank McSherry at Microsoft Research under the standard MSR license. PINQ or Privacy Integrated Queries, is an implementation of a concept called Differential Privacy (Cynthia Dwork’s paper seems to be a good overview before diving into the math behind it, Frank’s paper speaks to the PINQ in particular. There’s also a tutorial for those who want to get their hands dirty.) PINQ provides a layer between the data analyst and the datastore that ensures no privacy disclosures.

Wait, one better: It guarantees no privacy disclosures. How could that be?

Here’s an example: Imagine you record the heights of each of the people in your subway car in the morning, and calculate the average height. Then imagine that you also recorded each person’s zip code, gender and birth date. According to Sweeney above, if you calculated the “average” height of each combination of zip code, gender and birth date, you would not only know the exact height of 87% of the people on the car, but, with the help of some other public databases, you’d also know who they were.

Here’s where differential privacy helps. Take the height data you recorded (zip codes and all) and put it behind a differential privacy software wall. By adding just the right amount of noise to the results, you the analyst can query for statistical representations of all different combinations of the characteristics, and you’ll get an answer and a measure of the accuracy of the response. Instead of finding out that the average height was 5′ 5.23″, you might find out that the average height was 5′ 4″ +/- 0.75″. (I’m making these numbers up and over-simplifying.)

A Programmatic Privacy Guarantee

The guarantee of differential privacy in the above example is that if you remove any one person from the subway car dataset, and asked PINQ again for the average height, the answer would be the same (same level of accuracy) as the answer when they were included in the set.

For the analyst trying to understand the big picture, PINQ offers accurate answers and privacy. For the attacker (to use the security-lingo) seeking an individual’s data, PINQ offers answers so inaccurate they are useless.

What every prototype needs: Road Miles

Does it work? I’ve chatted with Frank a lot, and there seems to be a growing consensus in the research communities that it does; based on what I have seen I am very optimistic. However, at least right now, the guarantee is less of a concern than usability: How hard is it to understand a dataset when all you can extract from it are noisy aggregates? We’re hoping that it is more useful than not having any access to certain sensitive datasets, but we don’t know yet.

PINQ Workflow: Query, Execution, Noise, Results

PINQ Workflow: Query, Execution, Noise, Results

So what will we be doing for the next few months? Taking PINQ out on the highway. And trying to figure out what role it can play in the datatrust. We’ll keep you posted!

Free the Law!

Thursday, August 27th, 2009

Working as a journalist, I am routinely astonished at how hard it is to lay eyes on our laws.

Everyone knows it’s illegal to jaywalk in New York City, but can you find the statute and read it back to me? Go on, Google it.

Didn’t think so.

Now a bunch of websites are trying to rectify the problem of hard-to-locate laws and legal records, as Katherine Mangu-Ward notes in the Wall Street Journal.

Openregs.com is an unofficial, user-friendly alternative to the official online government register, regulations.gov.

The official US Government stimulus-tracker, recovery.gov, can be clunky, but recovery.org, the creation of a for-profit business, is easier to navigate.

Most provocative of all, recapthelaw.org accepts donated legal records obtained by users of PACER, the maddeningly unsearchable official database of US court documents.

Some of these unofficial pages are still developing and have their own shortcomings. But 15 years after the internet gave people direct access to pornography, isn’t it about time we could also read our own laws in the privacy of our homes?


Get Adobe Flash player