Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project. Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use. We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.
This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.
That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.
The talk can be broken up into three tersely titled sections:
Why? Basic motivation for the project.
What? What exactly the prototype will be.
Not Potential features of a Datatrust that are out of scope for this first prototype.
We need a (concrete) thing
Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.
We need understanding of some technical issues
1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).
In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.
One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).
Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?
2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).
- SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
- PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.
In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.
3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.
Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.
A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):
This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.
Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.
Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.
Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.
Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).
Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).
Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.
Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.