Posts Tagged ‘Privacy’

Datatrust Prototype

Tuesday, December 8th, 2009

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

The talk can be broken up into three tersely titled sections:

Why? Basic motivation for the project.

What? What exactly the prototype will be.

Not Potential features of a Datatrust that are out of scope for this first prototype.

Why?

We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

  • SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
  • PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.

What?

Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

NHANES Survey Content Cover Page

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.

Not

Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

In the mix

Wednesday, November 25th, 2009

Firefox’s Plan to Kick the Login’s Butt (ReadWriteWeb)

The Cost of Getting Sick: GE (GE)

Happy Thanksgiving, everyone!

Remixing Creative Commons licenses for personal information, Part II — What good would that do?

Wednesday, November 25th, 2009

The scenarios of data sharing I outlined in my first blog post may not sound too exciting to you.  So what if one person uploads a dataset on her blog, making it public, and then says it’s available for reuse?  How does that make the world a better place?

It’s possible that although personal information licenses, a la Creative Commons, wouldn’t solve all data-collection problems today, it could shape and shift the debate in several important ways:

1) Create a proactive way for people to take control of their information.

Right now, we as users generally are told, “Take it or leave it.”  We can agree with the terms of use that govern the use of our personal information, or not. A few companies are trying to offer more choices—Firefox has a “Private Browsing” option, Google offers some choices in what interests are tracked.  But a user almost never gets a choice in how his or her information is used once it’s collected.  A set of licenses could be a way to assert control instead of waiting for the choices to be offered.  As many privacy advocates have noted, it’s problematic that most privacy choices are offered as an opt-out rather than an opt-in.  A set of licenses would create a way to “opt-in” before being asked.  Even if the licenses turned out to be difficult to enforce, if the licenses became popular and widespread, it would be harder to ignore that people do have preferences that are not being considered or honored.

2) Create a grassroots way for people to actively share their information for causes they explicitly support.

Obama's Healthcare Stories for America

We’ve all seen campaigns that are organized around human-interest stories, true stories about real people that are meant to humanize a campaign and give it urgency.  The current healthcare debate, for example, inspired a host of organizations to ask people to “share their stories,” the Obama administration’s site being one of the best-organized ones.

It had the following “Submission Terms“:

submission terms

By submitting your story, you agree that the story, along with any pictures or video you submit along with the story (the “Submission”), is non-confidential and may be freely used and disclosed, in whole or in part and in any manner or media, by or on behalf of Democratic National Committee (“DNC”) in support of health care reform.

You acknowledge that such use will be without acknowledgment or compensation to you.

You grant DNC a perpetual, irrevocable, sublicensable, royalty-free license to publish, reproduce, distribute, display, perform, adapt, create derivative works of and otherwise use the Submission.

Despite the all-or-nothing language, the Obama site was still able to solicit a great number of stories.  But the terms underscore a perennial problem for lesser-known organizations.  How do people trust an organization with their stories?

A more decentralized set of licenses could allow people to essentially tag their information across the internet and flag that it’s been provided in support of a specific cause, without giving their stories explicitly to another organization.  Individuals could also choose to tag their information in support of specific research projects.

The licenses could be an organizing tool, a way for organizations or people without established reputations to gather useful information without asking people to sign away the rights to their stories.  Or the licenses could be a research tool, enabling new forms of data collection.  Already, sociologists are exploring the possibilities of broadening research beyond the couple hundred subjects that can be managed through more traditional methods.  At Harvard, a graduate student in psychology created an iPhone application that allows research subjects in a study on happiness to rate their happiness in real time, rather than through recollection with an interviewer later.

Would the existence of standard licenses for sharing personal information make organizing around real stories easier?  Could it make personal information-based research easier?  Could it encourage people who support such causes or research but are uncertain about existing privacy guarantees more willing to try?  We think it’s certainly worth exploring.

3) Make sharing cool (and good).

WhyIGivebutton

Creative Commons is not without controversy, but almost everyone would agree, what the organization did manage to do was making sharing work cool.  The licenses created an easy way for people who shared the same view of intellectual property to band together and display their commitment.  They also made it easier to advertise and sell this ethos of IP to others.

We wonder if a set of licenses for sharing personal information might not be able to do the same.  We want to promote sharing information as a virtue, a civic act of generosity, and a way to enable all of us to have more information for decisions.  We want donating information to feel like donating blood.

4) Raise the bar on use of personal information in research, marketing, and other contexts.

It may seem like we’re encouraging less use and reuse of information by imagining a system where people put licenses on information they already make public (see screenshots from the first post.)  But what the licenses would make clear, which is not clear now, is that there is a difference between something being put out for the public, for general use and enjoyment, and something being put out for someone else’s reuse, gain, and potential profit.  Those who use the license would be signaling clearly their willingness to make their information available for research and other public uses.

About a year ago, researchers at the Berman Center for the Internet and Society at Harvard released a dataset of Facebook profile information for an entire class of college students at an “an anonymous, northeastern American university.”  As Michael Zimmer pointed out, however, the dataset was hardly “anonymous.”  He was quickly able to deduce that the university in question was Harvard.  Although some have argued that some of these profiles were already “public,” Zimmer argues (and we agree) that having a public profile does not equal consent to being a research subject:

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

By creating a license that allows people to clearly signal when they do consent to being “scraped, aggregated, coded, dissected, and distributed,” we would also make clearer that when people don’t clearly signal their consent, that consent cannot be assumed.

5) Ultimately create new scenarios in which licenses can be used.

So far, the scenarios I’ve outlined in which a license could be applied are where information is being displayed openly, as on a website.  But the licenses could eventually apply to more closed systems, where the individual’s decision to share data is not itself public.

CDP is working on building a datatrust, a new kind of institution and trusted entity to store sensitive, personal information and make it publicly accessible for research.  Individuals and institutions could choose to donate data to the datatrust, knowing that they are contributing to public knowledge on a range of issues.  CDP will likely use a system of licenses that allow each data donor to pre-determine his or her preferences on how their data is accessed rather than a single “terms of use” tha applies to everyone, take it or leave it.

Similarly, if the licenses were to become popular, other organizations and companies that collect information from their members or account holders would be under pressure to offer these set choices or licenses when people sign up for accounts that require them to provide personal information.

Taxonomy of data

Thursday, November 19th, 2009

I haven’t yet posted Parts II and III of our series on the idea of creating Creative Commons-type sharing licenses for personal information, but Bruce Schneier posted today on a proposed taxonomy of data, and I thought it was worth sharing now.  Although the taxonomy he’s discussing is limited to social networking data, it’s a helpful way to understand why it’s so hard to come up with rules around personal information in general.

Here is his taxonomy on social networking data:

  1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.
  2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
  3. Entrusted data. This is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data — someone else does.
  4. Incidental data. Incidental data is data the other people post about you. Again, it’s basically same same stuff as disclosed data, but the difference is that 1) you don’t have control over it, and 2) you didn’t create it in the first place.
  5. Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.

As I noted in my first license blog post, our idea is focusing strictly on “disclosed data,” data an individual actively chooses to release.  It doesn’t address the messiness around how the other types of data are being used and reused, except in that we hope explicitly talking about individual preferences around “disclosed data” can help all of us understand what really matters to people (and what doesn’t) when they talk about the need for privacy around other forms of data.

Remixing Creative Commons licenses for personal information, Part I

Wednesday, November 18th, 2009

Creative Commons, in creating its licenses, did a very sexy thing.  It didn’t repeal the Sonny Bono Copyright Term Extension Act, it didn’t change technology.  Yet it managed to shift the social norm around intellectual property.  It’s now cool to share.  And they did this, not by forcing people to give up their rights, but by offering a set of choices by which those rights can be exercised in a way that encourages collaboration and ultimately benefits the public.

Imitation being the sincerest form of flattery, we at CDP have been playing around with the idea of creating personal information licenses, a la Creative Commons. Right now, we live in a pardadoxical world where 1) people have little control over how their information is used and reused, and 2) lots of valuable, fascinating raw data is locked up because of the danger of violating privacy.  Big corporations get a lot of value out of their data-mining; researchers and regular individuals, not so much.  Modern privacy problems aren’t exactly analogous to modern intellectual property problems, but we think Creative Commons-type licenses could have a lot to offer in addressing these two issues.  We’re certainly not the first to think along these lines, but we want to add our voice to the ongoing discussion.

Over the next couple of posts, I’m going to lay out how such licenses might work, the scenarios in which people might choose to license their personal information, what such licenses could accomplish, and the challenges and obstacles such licenses would face.

What choices would the licenses offer?

Imagine a set of licenses with a specific, pre-determined set of choices.  Anyone who wants to signal their willingness to make their personal information available to the public could choose among these licenses and display it prominently, wherever their information is provided, whether it’s an online forum, a social network, or even personal website or blog.

The choices could include the following:

A)   NOTIFICATION:

  1. First ask my permission before using the information
  2. Tell me that you are going to use my information.
  3. I don’t care.

B)   COMMERCIAL/NON-COMMERCIAL USE:

  1. I’m okay with non-commercial academic use for research and/or publication.
  2. I’m okay with non-commercial governmental use.
  3. I’m okay with all uses.

C)   LEVEL OF PRIVACY

  1. If I’ve provided any of this information, strip my information of classic identifiers (as enumerated, most likely, name, email address, etc.), though with no guarantee that this equals “anonymous.”
  2. If I have not provided any identifiers, do not try to re-identify me.
  3. [intermediary option of better anonymization, should the technology develop]
  4. I don’t care.

What kind of “personal information” could be licensed?

The license could be attached to any personal information the individual has gathered and displayed.  It could apply to:

Fertility Forum

Specifics of a medical condition, as shared on an online forum.

ashtonkutcher

An individual’s profile information on Facebook, MySpace, or other social networking site.

An individual’s personal website and/or blog.

As these examples make clear, we’re not talking about slapping a license on “all personal information” about a person in the abstract universe, but about placing a license on specific bits of data collected and displayed by an individual online.  A set of information, a dataset, even arguably a database.  It’s an open question, what might be “licensable,” what might even be worth licensing.

Which brings us to the question, is it worth licensing information that’s already out there, in public view?  Would a license end up restricting rather than enabling more information sharing?  Why would it be useful to license information in the above examples?

All good questions that I’m going to try to address in Posts II and III…

Geeks Go Shopping

Tuesday, November 17th, 2009

Web curator Jason Kottke shares the items his readers bought after clicking on Amazon links he posted.

Weird that Amazon makes this information available. But good, I suppose, that it’s anonymous.

Double weird: people are still buying VHS. And Amazon is still selling it.

From Star Wars to Jedi – Making of a Saga [VHS]

In the mix

Friday, November 6th, 2009

Cuil’s Famous Privacy Policy No Longer Protects Privacy (michaelzimmer.org)

Google’s Privacy Dashboard Doesn’t Tell Us Anything We Didn’t Know Before (ReadWriteWeb)

What have we been doing?

Monday, October 19th, 2009

We’ve been silent for a while on the blog, but that’s because we’ve been distracted by actual work building out the datatrust (both the technology and the organization).

Here’s a brief rundown of what we’re doing.

Grace is multi-tasking on 3 papers.

Personal Data License We’re conducting a thought experiment to think through what the world might look like if there was an easy way for individuals to release personal information on their own terms.

Organizational Structures We’ve conducted a brief survey of a few organizational structures we think are interesting models for the datatrust “trusted” entities from Banks to Public Libraries and “member-based” organizations from Credit Unions to Wikipedia. We tried to answer the question: What institutional structures can be practical defenses against abuses of power as the datatrust becomes a significant repository of highly sensitive personal information?

Snapshot of Publicly Available Data Sources A cursory overview of some of the more interesting data sets that are available to the public from government agencies to answer the question: How is the datatrust going to be different / better than the myriad data sources we already have access to today?

We also now have 2 new contributors to CDP: Tony Gibbon and Grant Baillie.

A couple of months ago, Alex wrote about a new anonymization technology coming out of Microsoft Research: PINQ. It’s an elegant, simple solution, but perhaps not the most intuitive way for most people to think about guaranteeing privacy.

Tony is working on a demonstration of PINQ in action so that you and I can see how our privacy is protected and therefore believe *that* it works. Along the way, we’re figuring out what makes intuitive sense about the way PINQ works and what doesn’t and what we’ll need to extend so that researchers using the datatrust will be able to do their work in a way that makes sense.

Grant is working on a prototype of the datatrust itself which involves working out such issues as:

  • What data schemas will we support? We think this one to begin with: Star Schema.
  • How broadly do we support query structures?
  • Managing anonymizing noise levels.

To help us answer some of these questions, we’ve gathered a list of data sources we think we’d like to support in this first iteration. (e.g. IRS tax data, Census data) (More to come on that.)

We will be blogging about all of these projects in the coming week, so stay tuned!

What does it take to be an IAPP-certified privacy professional? What should it take?

Wednesday, September 9th, 2009

IAPPcert

UPDATE: I recently was referred to this thoughtful blog post on a similar topic, “Nurturing an Accountable Privacy Profession.” Well-worth a read.

A few weeks ago, I was very relieved to find out I had passed the IAPP exam to be a “Certified Information Privacy Professional” or CIPP.  I got this certificate and even a pin, which is more than I ever got for passing the bar exams of New York and California.

So what exactly did I need to know to become a CIPP?

To be certified in corporate privacy law, you’re expected to know what’s covered in the CIPP Body of Knowledge, primarily major U.S. privacy laws and regulations and “the legal requirements for the responsible transfer of sensitive personal data to/from the United States, the European Union and other jurisdictions.”

You’re also expected to pass the Certification Foundation, required for all three certifications offered by IAPP.  That covers basic privacy law, both in the U.S. and abroad, information security principles and practices, and “online privacy,” which includes an overview of the technologies used by online companies to collect information and the particular issues to be considered in this context.

So what do you think?  Should you be able to pass an all-objective, 180 question, three-hour exam (counting the CIPP and Certification Foundation exams together) on the above topics and be able to call yourself a “privacy professional”?

There are no sample questions available online, and I was too cheap to take a prep course, but if I remember correctly, a typical question on the exam went something like this:

The Gramm-Leach-Bliley Act authorizes financial institutions to share consumer information with third parties if:

a. The information is not personally identifiable.

b. The consumer is informed and given the opportunity to opt-out.

c.  Any information without notice if it is shared with affiliated companies.

d.  All of the above.

The answer would be “C,” since the consumer is only required to be given notice if the third party is “non-affiliated.”  My sample is poorly constructed, and there are also questions that require you to analyze a fact pattern, but essentially, the exam covers existing laws, practices, and technologies.

It doesn’t ever ask you, “What would you do if you were advising RealAge and they told you they wanted to sell answers from a health questionnaire to pharmaceutical companies?”  Or, “Is Facebook doing enough to prevent third parties from misusing images of Facebook members in their ads?”

IAPP presumably doesn’t ask you these questions because there’s no “objectively” right answer.  There may, one day, be an objectively legal answer, depending on if and when legislation gets passed.  Still, it’s obvious that in the field of privacy, the most interesting aspects are not what laws do exist, but what laws should exist, what practices should be used, what innovations, both technological and social, should be promoted to protect privacy in meaningful ways.  But the exam only covers what is, not what could be or what should be.

Privacy may be an ancient concept, but it’s a very modern, very new, very undefined profession, which perhaps is even more reason for the IAPP to exist.  We as a society, particularly in the U.S., are struggling to figure out what privacy means and what we need to do to protect it.  While the medical profession has the Hippocratic Oath dating back to the 4th century B.C., and the legal profession’s adherence to the concept of attorney-client privilege goes back at least as far as the 16th century, the privacy profession has no clear guiding principle.  We don’t know yet what it should be.

I’m not really criticizing the IAPP for having a test that doesn’t quite encompass the dynamic, constantly changing field of privacy.  It’s not like other professions do better.  The bar exam certainly doesn’t screen out incompetent, unethical people from practicing law, even if you are actually required to pass an ethics exam.  And the IAPP does provide resources to its members for tracking changes in privacy law and policy.  But I’m curious to see where the IAPP goes as it tries to “professionalize” the profession, whether the certification exam will change and what expectations will be set for IAPP-certified privacy professionals.  Perhaps in another 100 years, or hopefully sooner, we’ll have a code of conduct for privacy professionals.

In the mix

Wednesday, September 9th, 2009

OpenID Pilot Program to be Announced by U.S. Government (ReadWriteWeb)

Stimulus Funding Map is “Slick as Hell” (FlowingData)

Why Anonymized Data Isn’t (Slashdot)

Get Adobe Flash playerPlugin by wpburn.com wordpress themes