Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

In the mix

Tuesday, January 19th, 2010

Unboxed — A Data Explosion Is Remaking Retailing (NYTimes)

Microsoft Cuts Bing IP Address Storage to 6 Months (CNET)

Starbucks Receipts Used for NYC Calorie Study (NYTimes)

Did the NYTimes Netflix Data Graphic Reveal the Netflix Preferences of Individual Users?

Tuesday, January 12th, 2010

Slate has an interesting slant on the New York Times graphic everyone’s been raving about — the most popular Netflix movies by zip code all over the country.  It really is great and fun to play with, but as Slate points out, some of the zip codes with rather anomalous lists may be pointing to individual users.  For example, 11317 has this top-ten list:

  1. Wall-E
  2. Indiana Jones and the Temple of Doom
  3. Oz: Season 3: Disc 1
  4. Watchmen
  5. The Midnight Meat Train
  6. Man, Woman, and the Wall
  7. Traffic
  8. Romancing the Stone
  9. Crocodile Dundee 2
  10. Godzilla’s Revenge

11317 is the zip code for LaGuardia Airport, which doesn’t have any residents.  That means this list may very well represent the Netflix renting habit of a small group or even a single subscriber who has his or her DVDs mailed there.

Slate finds some other zip codes that may represent a single subscriber, but doesn’t point out the privacy problem here, despite the fact that Netflix is already in hot water about its data releases.

We’ve said a lot about what “anonymization” means and what a privacy guarantee should include, so I won’t say more here.  Instead, I just want to point out that the Slate article helps illustrate the problem PINQ is trying to avoid.  As Tony points out in his post, PINQ won’t give you answers that would be changed by the presence of a single record.  Of course, because PINQ gives aggregate answers, you wouldn’t be asking questions phrased exactly as, “What are the top ten most popular Netflix movies for 11317?”  But if you tried to ask, “How many people in 11317 had viewed “The Midnight Meat Train?”, it would add sufficient noise that you would never know that the single person using LaGuardia airport as an address had viewed it.

PINQ Privacy Demo

Thursday, January 7th, 2010

Editor’s Note: Tony Gibbon is developing a datatrust demo as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Tony’s work, like Grant’s, could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re happy to have him guest blogging about the demo here.

Back in August, Alex wrote about the PINQ privacy technology and noted that we would be trying to figure out what role it could play in the datatrust.  The goal was to build a demo of PINQ in action and get a better understanding of PINQ and its challenges and quirks in the process.  We settled on a quick-and-dirty interactive demo to try to demonstrate the answers to the following.

What does PINQ bring to the table?

Before we look at the benefits of PINQ, let’s first take a look at the shortcomings of one of the ways data is often released with an example taken from the CDC website.

This probably isn’t the best example of a compelling dataset, but it is a good example of the lack of flexibility of many datasets that are available—namely that the data is pre-bucketed and there is a limit to how far you are able to drill down on the data.

On one hand, the limitation makes sense:  If the CDC allowed you (or your prospective insurance company) to view disease information at street level, the potential consequences are quite frightening.  On the other hand, they are also potentially limiting the value of the data.  For example, each county is not necessarily homogenous.  Depending on the dataset, a researcher may legitimately wish to drill down without wanting to invade anyone’s privacy—for example to compare urban vs. suburban incidence.

This is where PINQ shines—it works in both these cases.  PINQ allows you to execute an arbitrary aggregate query (meaning I can ask how many people are wearing pink, but I can’t ask PINQ to list the names of people wearing pink) while still protecting privacy.

Let’s turn to the demo.  (Note: the data points in the demo were generated randomly and do not actually indicate people or residences, much less anything about their health.)  The quickest, most visual arbitrary query we came up with is drawing a rectangle on a map and counting each data point that falls inside, so we placed hundreds of “sick” people on a map to let users count them.  (Keep in mind that the arbitrariness of a PINQ query need not be limited to location on a map.  It could be numerical like age, textual like name, include multiple fields etc.)

Now let’s attempt to answer the researcher’s question.  Is there a higher incidence of this mysterious disease in urban or suburban areas?  For the sake of simplicity, we’ll pretend he’s particularly interested in two similarly populated, conveniently rectangular areas: one in Seattle and the other in a nearby suburb as shown below:

An arbitrary query such as this one is clearly not possible with data that is pre-bucketed such as the diabetes by county.  Let’s take a look at what PINQ spits out.

We get an “answer” and a likely range.  (The likely range is actually an input to the query, but that’s a topic for another post.)  So what does this mean? Are there really 311.3 people in Seattle with the mysterious disease?  Why are there partial people?

PINQ adds a random amount of noise to each answer, which prevents us from being able to measure the impact of a single record in the dataset.  The PINQ answer indicates that about 311 people (plus or minus noise) in Seattle have the disease.  The noise, though randomly generated, is likely to fall within a particular range, in this case 30.  So the actual number is likely to be within 30 of 311, while the actual number of those in the nearby suburb with the disease is likely to be within 30 of 177.

Given these numbers (and ignoring the oversimplification and silliness of his question), the researcher could conclude that the incidence in the urban area is higher than the suburban area.  As a bonus, since this is a demo and no one’s privacy is at stake, we can look at the actual data and real numbers:

The answers from PINQ were in fact pretty close to the real answer.  We got a little unlucky with the Seattle answer as the actual random noise for that query was slightly greater than the likely range, but our conclusion was the same as if we had been given the real data.

But what about the evil insurance company/ employer/ neighbor?

By now, you’re hopefully starting to see potential value of allowing people to execute arbitrary queries rather than relying on pre-bucketed data, but what about the potential harm?  Let’s imagine there’s a high correlation between having this disease and having high medical costs.  While you might want your data included in this dataset so it could be studied by someone researching a cure, you probably don’t want it used to discriminate against you.

To examine this further, let’s zoom in and ask about the disease at my house.  PINQ only allows questions with aggregate answers, so instead of asking “does Tony have the disease?” we’ll ask, “how many people at Tony’s house have the disease?”

You’ll notice, unlike the CDC map, PINQ doesn’t try to stop me from asking this potentially harmful, privacy-infringing question.  (I don’t actually live there.)  PINQ doesn’t care if the actual answer is big or small, or if I ask about a large or small area, it just adds enough noise to ensure the presence or absence of a single record (in this case person) doesn’t have an effect on your answers.

PINQ’s answer was “about 2.4, with likely noise within  +/- 5”  (I dialed down the likely noise to +/-5 for this example).  As with all PINQ answers, we have to interpret this answer in the context of my initial question: “Does Tony have the disease?”  Since the noise added is likely to be within 5 and -5, the real answer is likely to be between 0 and 7, inclusive, and we can’t draw any strong conclusions about my health because the noise overwhelms the real answer.

Another way of looking at this is that we get similarly inconclusive answers when we try to attack the privacy of both the infected and the healthy.  Below I’ve made the diseased areas visible on the map and we can compare the results of querying me and my neighbor, only one of whom is infected:

Keep in mind that my address may not be in the dataset because I’m healthy or because I chose not to submit my information.  In either case, the noise causes the answer at my house to be indistinguishable from the answer at my neighbor’s address, and our decisions to be included or excluded from the dataset do not affect our privacy.  Of equal importance from the first example, the addition of this privacy preserving noise does not preclude the extraction of potentially useful answers from the dataset.

You can play with the demo here (requires Silverlight).

What does a privacy guarantee mean to you? Harm v. Obscurity

Friday, December 18th, 2009


Left: Senator Joseph McCarthy. Right: The band, Kajagoogoo.

At the Symposium in November we spent quite a bit of time trying to wrap our collective brain around the PINQ privacy technology, what it actually guarantees and how it does so.

I’ve attempted to condense our several hours of discussion into series of blog posts.

We began our discussion with the question: What does a privacy guarantee mean to you?

There was a range of answers to this question. They all boiled down to one of the following two:

  1. Nothing bad will come of this information I give you: It won’t be used against me (discrimination, fraud investigations, psychological warfare). It won’t be used to harass me (spam).
  2. The absence or presence of a single record cannot be discerned.

Let’s just say definition 1 is the “layperson” definition, which is more focused on the consequences of giving up personal data.

And definition 2 is the “technologist’” definition, which is more focused on the mechanism behind how to actually fulfill the layperson’s guarantee in a meaningful, calculable way.

Q. What does PINQ guarantee?

Some context: PINQ is a layer of code that sits between data and anyone trying to ask questions of that data that guarantees privacy in a measurable way to the individuals represented in the data.

The privacy PINQ guarantees is broader than the layperson’s understanding of privacy. Not only does PINQ guard against re-identification, targeting, and in short, any kind of harm resulting from exposing your data, it prevents any and all things in the universe from changing as a direct result of your individual data contribution.

Sounds like cosmic wizardry. Not really, it’s simply a clever bit of armchair thinking.

If you want to guarantee that nothing in the world will change as a result of someone contributing their data to a data set, then you simply need to make sure that no one asking questions of that data set will get answers that are discernibly affected by the presence or absence of any one person.

Therefore, if you define privacy guarantee as “the absence or presence of a single record cannot be discerned,” meaning the inclusion of your data in a data set will have no discernible impact on the answers people get out of that data set, you also end up guaranteeing that nothing bad can ever happen to you if you contribute your data because in fact, absolutely nothing (good or bad) will happen to anyone as a direct result of you contributing your data, because with PINQ as the gatekeeper, your particular data record might as well not be there!

What is the practical fallout of such a guarantee?

Not only will you not be targeted to receive SPAM as a result of contributing your data to a dataset, no one else will be targeted to receive SPAM as a result of you contributing your data to a data set.

Not only will you not be discriminated against by future employers, insurance companies as a result of contributing your data to a dataset, no one else will be discriminated against as a result of contributing your data to a dataset.

Does this mean that my data doesn’t matter? Why then would I bother to contribute?

Now is a good time to point out that PINQ’s privacy guarantee is expansive, but in a very specific way. Nothing in the universe will change as a result of any one person’s data. However, the aggregate effect of everyone’s data will absolutely make a difference. It’s the same logic behind avoiding life’s little vices like telling white lies, littering or chewing gum in class. One person littering isn’t such a big deal. But what if everyone littered?

Still, is such an expansive privacy guarantee necessary?

It turns out, it’s incredibly hard to narrow a privacy guarantee to prevent just “harm,” because harm is a subjective concept with cultural and social overtones. One person’s spam is another’s helpful notification.

All privacy guarantees today which largely focus on prevent harm are by in large based on “good intentions” and “theoretical best practices,” not technical mechanisms that are measurable and “provable.”

However, change, as a function of how much any one person’s data is discernibly affecting answers to questions asked of a data set is readily measurable.

Up to this point, we’ve been engaged in a simple thought experiment that requires nothing more than a few turns of logic. How exactly PINQ keeps track of whether the absence or presence of a single record is discernible in the answers it gives out and the extent to which it’s discernible is a different matter and requires actual “innovation” and “technology.” Stay tuned for more on that.

Wow, new privacy features!

Friday, December 11th, 2009

Wow, so many companies rolling out new privacy features lately!

Facebook rolled out its new “simplified” privacy settingsGoogle introduced Google Dashboard, a central location from which to manage your profile data, which supplements Google Ads Preferences.  And Yahoo released a beta version of the Ad Interest Manager.

Many, many people have reviewed Facebook’s new changes, and pointed out some of the “bait-and-switch” Facebook has done for some new, and I think better, controls.  I don’t have much more to say about that.

But it’s interesting to me that Google and Yahoo have chosen similar strategies around privacy issues, though with some differences in execution.  Both companies haven’t actually changed their data collection practices, and cynics have argued that they’re both just trying to stave off government regulation.  Still, I think that it makes a difference when companies actually make clear and visible what they are doing with user data.

“Is this everything?”

Both Google and Yahoo indicate in different ways that the user who is looking at Dashboard or Ad Interest Manager is not getting the full data story.

Google’s Dashboard is supposed to be a central place where a user can manage his or her own data.  In and of itself, it’s not that exciting.  As ReadWriteWeb put it, it doesn’t tell you anything you didn’t know before.  It provides links in one place to the privacy settings for various applications, but it focuses on profile information the user provides, which represents only a tiny bit of the personal information Google is tracking.

Google does, however, provide a link to the question, “Is this everything?” that describes some of their browser-based data collection and a link to the Ads Preferences Manager page.  To me, it feels a little shifty, that the Dashboard promises to be a place for you to control “data that is personally associated with you,” but it doesn’t reveal until you scroll to the bottom that this might not be everything.  Others may feel differently, but this to me goes right at the heart of the problem of how “personal information” is defined.  When I go to the Ads Preferences Manager, I see clearly that Google has associated all kinds of interests with me–how is this not “personally associated” with me?  Google states it’s not linking this data to my personal account data which is why they haven’t put it all in one place, which is good, but it seems too convenient a reason to silo that off.

Yahoo’s strategy is a little different.  It may not be fair to compare Yahoo’s Ad Interest Manager to Google’s Dashboard at this point, given that it’s in such a rudimentary phase.  It’s in beta and doesn’t work yet with all browsers.  (As David Courtney points out in PCWorld, being in beta is a pretty sorry excuse for the fact that it doesn’t work with IE8 and Firefox.)  Depending on how much you use Yahoo, you may not see anything about yourself.

Still, I thought it was interesting that Yahoo highlighted some of the hairy parts of its privacy policy in separate boxes high up on the page.  Starting from the top, Yahoo states clearly in separate boxes with bold headings that there are ways in which your data is collected and analyzed that are not addressed in this Ad Interest Manager.  The box for the Network Advertising Initiative is a little weak; it doesn’t really explain what it means that Yahoo is connected to the NAI.  But the box on “other inputs,” shows prominently that even as you manage your settings on this page, there may be other sources of data Yahoo is using to find out more about you.

zoombox

Yahoo also reveals that the information they’re tracking from you is collected from a wide range of sources, including both Yahoo account information like Mail and non-account websites like its Front Page.  Unlike Google, Yahoo doesn’t ask you to click around to find out that some of “everything” is elsewhere.

zoombox2

Turning “interests” on and off

Google and Yahoo are very similar here.  Google’s Ad Preferences Manager indicates which interests have been associated with you with a clear link to how they can be removed, with a button for opting out from tracking altogether.

Googleopt

Yahoo’s Ad Interest Manager has a different design, but the button for opting out altogether is similarly visible.

Yahooopt

We’re using cookies!

Compared to the other issues, this is the most obvious difference between Google and Yahoo.

Google has this on its Ads Preferences Manager:

Googlecookie

So you can see that some string of numbers and letters has somehow been attached to your computer, but you’re not told what this means in terms of what Google knows about you.

In contrast, Yahoo shows this at the bottom of the Ad Interest Manager:

YahooAd3

Yahoo knows I’m a woman!  Between 26 and 35!  The location is actually wrong, as I am in Brooklyn, NY, but I did live in San Francisco 5 years ago when I first signed up for a Yahoo account.  Still, Yahoo is very explicitly showing, and not just telling, that it knows geographical information, age, gender, and the make and operating system of your computer.  I’m impressed—they must know this is going to scare some people.

Does any of this even matter?

I prefer the Yahoo design in many ways — the boxes and verticality of the manager to me are easier to read and understand than the horizontal spareness of the Google design.  But in the end, the design differences between Google and Yahoo’s new privacy tools may not even matter.  I don’t know how many people will actually see either Manager.  You still have to be curious enough about privacy to click on “Privacy Policy,” which takes you to Yahoo! Privacy, at which point, in the top right-hand corner, you see a link to “Opt-out” of interest-based advertising.  The same is true with Google. And neither company has actually changed much about their data collection practices.  They’re just being more open about them.

But I am impressed and heartened that both companies have started to reveal more about what they’re tracking and in ways that are more visually understandable than a long, boring, legalistic privacy policy.  I hope Yahoo is feeling competitive with Google on privacy issues and vice-versa.  I’d love to see a race to the top.

Datatrust Prototype

Tuesday, December 8th, 2009

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

The talk can be broken up into three tersely titled sections:

Why? Basic motivation for the project.

What? What exactly the prototype will be.

Not Potential features of a Datatrust that are out of scope for this first prototype.

Why?

We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

  • SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
  • PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.

What?

Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

NHANES Survey Content Cover Page

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.

Not

Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

Creative Commons-style licenses for personal information, Part III: What are the challenges?

Monday, November 30th, 2009

In the first two posts, I described how personal information licenses might work and why they might be useful in shifting the debate around how personal information is collected and used. Sharing information could be cool!  People could exercise choices!  Companies could be pressured to offer similar choices!

Unfortunately, it wouldn’t be that easy.  There are certainly obstacles and challenges to creating a system of personal information licenses for common use, which I describe below.

1) Personal information isn’t property—why do you want to propertize it?

The short answer is, we don’t. We’re well aware that there is a history of academic debate on this issue, pro and con around whether making personal information personal property would make it easier to protect individual privacy.  Although the issues are certainly interesting, we don’t want to step into that debate and we don’t think we have to for the licenses we’re imagining.

First, let’s examine how personal information is viewed today.  I can’t own a fact.  I can’t own the fact that I’m 32, but I can have copyright in an essay in which I state I am 32 and I can have copyright in a database that includes the fact I am 32 if I’m creative in building the structure of that database (in the U.S.).  We can understand the reasoning behind this.  We want to live in a world where facts are “free” to be used and reused without any need to pay a licensing fee.

But the simple declaration, “You can’t own a fact” doesn’t begin to describe the many ways in which people are collecting data, selling it, renting it, and otherwise making money off of it.  When a company sells a mailing list, it may not “own” the fact that I live at XYZ Avenue in Brooklyn, but it certainly is using it to its advantage.  Why, then, should the fact that I can’t own the fact of where I live keep me from sharing that data as I like and trying to control it in new ways?

The digital revolution is forcing us to think beyond property/not property.  Facts have become valuable even when they’re not technically “owned” by anyone.  I haven’t come up with some snappy new terms to use, but the issue should no longer be defined solely around “property/not property.”

BlueKai

Some new businesses seem to be working off this model. Blue Kai and KindClicks, while collecting personal information for market research, provide individuals with a way of stating their preferences and monetizing their data.  KindClicks, for example, allows individuals who contribute data to then donate the money they make off their data to the charity of their choice.  BlueKai collects data through cookies, but provides a link on their site by which users can see what information has been gathered about them. Those who want to opt out can.  Those that choose to participate in BlueKai’s registry can then choose to donate a portion of their “earnings” to charity.

We don’t actually want to model these companies’ ways of valuing data.  CDP’s mission isn’t to make sure that everyone gets a dollar here or a dollar there every time their data is accessed.  To us, the value of such information is immense to the public and yet not easily measurable in dollars.  But we do want to explore the idea that we could just take control of our data and obtain value from it, even if it’s the non-monetary, social value of providing something useful to the public.

2) How would these licenses be enforceable? What about existing terms of use on online forums, social networks?

This is a big question.  I’m not sure what kind of dataset could be licensable and the extent to which a license could cover facts within that dataset.  Could the license really encourage new forms of sharing if there was no way to prevent people from using individual facts within that dataset outside of the license terms?  How useful would such a license be?

Arguably, Creative Commons licenses are not easily enforced.  They certainly have an easier case for arguing that they are enforceable; there has been one case I know of where a court upheld the terms of the license.  But the vast majority of people using photographs, art, and other work outside the terms of the license do it without impunity.  Most CC license holders never find out their Flickr photo was used outside the license terms, and most wouldn’t have the resources to do anything about it even if they did find out.  Yet CC licenses have still managed to impact societal norms on intellectual property.

Personal information licenses may still have an effect, then, on societal norms about how information is collected and shared regardless of how much the licenses are litigated.  Even the process of litigation may help us as a society have a smarter conversation about current practices.

As to the objection that the licenses wouldn’t work in the face of existing terms of use for social networks and other sites — the fact that I might not be able to “license” my own information that I put myself on Facebook just underscores why creative, proactive, even aggressive strategies might be necessary.

3) Why would you encourage people to put their personal information out in public?  Isn’t it irresponsible to encourage people to provide information that could increase risk of identity theft and fraud?

I don’t want to dismiss this concern off-hand.  But as my father likes to say, everything in life has good and bad.  There are many things we do that are risky, and we try our best to minimize those risks, both as a society when we pass laws and as individuals when we take more particular, personal measures.  Driving is a very dangerous activity.  It is also a very valuable one.  Many governments have decided to legislatively require the wearing of seatbelts.  Many of us personally make the decision to practice other safe driving techniques that aren’t legally required.

We think it’s imperative that we, as a society, think hard and carefully about how to minimize the risks of personal information being used, collected, and exchanged.  Creative Commons-style licenses for personal information sharing may or may not be the best way to address today’s privacy problems.  I’m curious to hear if you think the risks outweigh the benefits and why.  But to shut down the idea solely because the risk exists — that is not going to help push the conversation forward.

CONCLUSION

Licenses for making personal information more widely available for research and public use—would they work?  Maybe, maybe not.  Worth exploring?  Most definitely.

We’d love to hear your questions and comments.

Remixing Creative Commons licenses for personal information, Part II — What good would that do?

Wednesday, November 25th, 2009

The scenarios of data sharing I outlined in my first blog post may not sound too exciting to you.  So what if one person uploads a dataset on her blog, making it public, and then says it’s available for reuse?  How does that make the world a better place?

It’s possible that although personal information licenses, a la Creative Commons, wouldn’t solve all data-collection problems today, it could shape and shift the debate in several important ways:

1) Create a proactive way for people to take control of their information.

Right now, we as users generally are told, “Take it or leave it.”  We can agree with the terms of use that govern the use of our personal information, or not. A few companies are trying to offer more choices—Firefox has a “Private Browsing” option, Google offers some choices in what interests are tracked.  But a user almost never gets a choice in how his or her information is used once it’s collected.  A set of licenses could be a way to assert control instead of waiting for the choices to be offered.  As many privacy advocates have noted, it’s problematic that most privacy choices are offered as an opt-out rather than an opt-in.  A set of licenses would create a way to “opt-in” before being asked.  Even if the licenses turned out to be difficult to enforce, if the licenses became popular and widespread, it would be harder to ignore that people do have preferences that are not being considered or honored.

2) Create a grassroots way for people to actively share their information for causes they explicitly support.

Obama's Healthcare Stories for America

We’ve all seen campaigns that are organized around human-interest stories, true stories about real people that are meant to humanize a campaign and give it urgency.  The current healthcare debate, for example, inspired a host of organizations to ask people to “share their stories,” the Obama administration’s site being one of the best-organized ones.

It had the following “Submission Terms“:

submission terms

By submitting your story, you agree that the story, along with any pictures or video you submit along with the story (the “Submission”), is non-confidential and may be freely used and disclosed, in whole or in part and in any manner or media, by or on behalf of Democratic National Committee (“DNC”) in support of health care reform.

You acknowledge that such use will be without acknowledgment or compensation to you.

You grant DNC a perpetual, irrevocable, sublicensable, royalty-free license to publish, reproduce, distribute, display, perform, adapt, create derivative works of and otherwise use the Submission.

Despite the all-or-nothing language, the Obama site was still able to solicit a great number of stories.  But the terms underscore a perennial problem for lesser-known organizations.  How do people trust an organization with their stories?

A more decentralized set of licenses could allow people to essentially tag their information across the internet and flag that it’s been provided in support of a specific cause, without giving their stories explicitly to another organization.  Individuals could also choose to tag their information in support of specific research projects.

The licenses could be an organizing tool, a way for organizations or people without established reputations to gather useful information without asking people to sign away the rights to their stories.  Or the licenses could be a research tool, enabling new forms of data collection.  Already, sociologists are exploring the possibilities of broadening research beyond the couple hundred subjects that can be managed through more traditional methods.  At Harvard, a graduate student in psychology created an iPhone application that allows research subjects in a study on happiness to rate their happiness in real time, rather than through recollection with an interviewer later.

Would the existence of standard licenses for sharing personal information make organizing around real stories easier?  Could it make personal information-based research easier?  Could it encourage people who support such causes or research but are uncertain about existing privacy guarantees more willing to try?  We think it’s certainly worth exploring.

3) Make sharing cool (and good).

WhyIGivebutton

Creative Commons is not without controversy, but almost everyone would agree, what the organization did manage to do was making sharing work cool.  The licenses created an easy way for people who shared the same view of intellectual property to band together and display their commitment.  They also made it easier to advertise and sell this ethos of IP to others.

We wonder if a set of licenses for sharing personal information might not be able to do the same.  We want to promote sharing information as a virtue, a civic act of generosity, and a way to enable all of us to have more information for decisions.  We want donating information to feel like donating blood.

4) Raise the bar on use of personal information in research, marketing, and other contexts.

It may seem like we’re encouraging less use and reuse of information by imagining a system where people put licenses on information they already make public (see screenshots from the first post.)  But what the licenses would make clear, which is not clear now, is that there is a difference between something being put out for the public, for general use and enjoyment, and something being put out for someone else’s reuse, gain, and potential profit.  Those who use the license would be signaling clearly their willingness to make their information available for research and other public uses.

About a year ago, researchers at the Berman Center for the Internet and Society at Harvard released a dataset of Facebook profile information for an entire class of college students at an “an anonymous, northeastern American university.”  As Michael Zimmer pointed out, however, the dataset was hardly “anonymous.”  He was quickly able to deduce that the university in question was Harvard.  Although some have argued that some of these profiles were already “public,” Zimmer argues (and we agree) that having a public profile does not equal consent to being a research subject:

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

By creating a license that allows people to clearly signal when they do consent to being “scraped, aggregated, coded, dissected, and distributed,” we would also make clearer that when people don’t clearly signal their consent, that consent cannot be assumed.

5) Ultimately create new scenarios in which licenses can be used.

So far, the scenarios I’ve outlined in which a license could be applied are where information is being displayed openly, as on a website.  But the licenses could eventually apply to more closed systems, where the individual’s decision to share data is not itself public.

CDP is working on building a datatrust, a new kind of institution and trusted entity to store sensitive, personal information and make it publicly accessible for research.  Individuals and institutions could choose to donate data to the datatrust, knowing that they are contributing to public knowledge on a range of issues.  CDP will likely use a system of licenses that allow each data donor to pre-determine his or her preferences on how their data is accessed rather than a single “terms of use” tha applies to everyone, take it or leave it.

Similarly, if the licenses were to become popular, other organizations and companies that collect information from their members or account holders would be under pressure to offer these set choices or licenses when people sign up for accounts that require them to provide personal information.

Watching the Electrons

Monday, November 23rd, 2009

A warning about electrical outlets that has nothing to do with bathtubs.

This from Ontario’s Information and Privacy Commissioner, who has been studying the implications of “smart grid” technology that will enable utilities to micro-monitor usage, with the aim of more efficient electric delivery:

Intimate details of hydro customers’ habits, from when they cook or take showers, to when they go to bed, plus such security issues as whether they have an alarm system engaged, could all be discerned by the data automatically fed by appliances and other devices to the companies providing electric power.

Why does it matter here in the USA? Well, the economic stimulus enacted earlier this year contains at least $4.5 billion specifically for smart grid projects, with the aim of kick-starting even greater private investments.

As in so many other areas (medicine, taste in movies), the tech that promises better, “smarter” outcomes also demands more detailed information.  So what’s our plan for that information?

Taxonomy of data

Thursday, November 19th, 2009

I haven’t yet posted Parts II and III of our series on the idea of creating Creative Commons-type sharing licenses for personal information, but Bruce Schneier posted today on a proposed taxonomy of data, and I thought it was worth sharing now.  Although the taxonomy he’s discussing is limited to social networking data, it’s a helpful way to understand why it’s so hard to come up with rules around personal information in general.

Here is his taxonomy on social networking data:

  1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.
  2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
  3. Entrusted data. This is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data — someone else does.
  4. Incidental data. Incidental data is data the other people post about you. Again, it’s basically same same stuff as disclosed data, but the difference is that 1) you don’t have control over it, and 2) you didn’t create it in the first place.
  5. Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.

As I noted in my first license blog post, our idea is focusing strictly on “disclosed data,” data an individual actively chooses to release.  It doesn’t address the messiness around how the other types of data are being used and reused, except in that we hope explicitly talking about individual preferences around “disclosed data” can help all of us understand what really matters to people (and what doesn’t) when they talk about the need for privacy around other forms of data.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes