Posts Tagged ‘Privacy’

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

Friday, January 29th, 2010

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

Yay, it’s Data Privacy Day!

Thursday, January 28th, 2010

As sponsored by, among others, Google, Microsoft, Lexis-Nexis, and AT&T.

Lexis-Nexis, for those of you who are not lawyers and journalists, is an amazing tool for doing research on court decsions, regulations, statutes, and other legal matters.  It is also a great way to investigate people, comb through property records, and more!  In a way, though, the information it stores is pretty private, at least to the extent that it’s so expensive to access, it’s not available to the vast majority of people.  Which makes me wonder, how much is Lexis-Nexis worried that its product is becoming less valuable because more and more of their information is available elsewhere for free?

Which leads me to the crux of the problem.  Privacy, a word for which very few people can agree on a definition, is nevertheless a real issue these days.  But the reason it’s become such a pressing concern isn’t only because surveillance technology has gotten better or more pervasive.  It’s also because more information is available everywhere.  Re-identification from supposedly anonymized databases wouldn’t be so easy if other data sources, like DMV records, weren’t so readily available.  In addition, the Internet is teeming with information we want to provide ourselves, through Facebook, PatientsLikeMe, Mint.com, which we do not just because we’re exhibitionists, but because we get value from sharing that information and seeing what others have shared as well.

We want privacy.  We want information.  How are we going to reconcile these two very legitimate desires?  Will there be trade-offs?  Can we really have it all?

We’re definitely not in the camp of “We’ll never have privacy, let’s throw out the data!”, nor the camp of “Privacy’s gone anyway.”  So yes, we do think we can have a lot, if not “all.”  And to do that, we need to move beyond talking about privacy and information in the abstract.  We need to look at specific areas — like electronic health records, campaign finance, government transparency — and be concrete about what we lose and what we gain with every decision we make.

Data Privacy Day may be “an international celebration of the dignity of the individual expressed through personal information,” but let’s be honest.  Dealing with these questions will be interesting, but it isn’t going to be a party.

Did the NYTimes Netflix Data Graphic Reveal the Netflix Preferences of Individual Users?

Tuesday, January 12th, 2010

Slate has an interesting slant on the New York Times graphic everyone’s been raving about — the most popular Netflix movies by zip code all over the country.  It really is great and fun to play with, but as Slate points out, some of the zip codes with rather anomalous lists may be pointing to individual users.  For example, 11317 has this top-ten list:

  1. Wall-E
  2. Indiana Jones and the Temple of Doom
  3. Oz: Season 3: Disc 1
  4. Watchmen
  5. The Midnight Meat Train
  6. Man, Woman, and the Wall
  7. Traffic
  8. Romancing the Stone
  9. Crocodile Dundee 2
  10. Godzilla’s Revenge

11317 is the zip code for LaGuardia Airport, which doesn’t have any residents.  That means this list may very well represent the Netflix renting habit of a small group or even a single subscriber who has his or her DVDs mailed there.

Slate finds some other zip codes that may represent a single subscriber, but doesn’t point out the privacy problem here, despite the fact that Netflix is already in hot water about its data releases.

We’ve said a lot about what “anonymization” means and what a privacy guarantee should include, so I won’t say more here.  Instead, I just want to point out that the Slate article helps illustrate the problem PINQ is trying to avoid.  As Tony points out in his post, PINQ won’t give you answers that would be changed by the presence of a single record.  Of course, because PINQ gives aggregate answers, you wouldn’t be asking questions phrased exactly as, “What are the top ten most popular Netflix movies for 11317?”  But if you tried to ask, “How many people in 11317 had viewed “The Midnight Meat Train?”, it would add sufficient noise that you would never know that the single person using LaGuardia airport as an address had viewed it.

Privacy Problems as Governance Problems at Facebook

Monday, January 4th, 2010

You know that feeling when you’ve been pondering something for awhile and then you read something that articulates what you’ve been thinking about perfectly?  It’s a feeling between relief and joy, and it’s what I felt reading Ed Felten’s critique of Facebook’s new privacy problems:

What Facebook has, in other words, is a governance problem. Users see Facebook as a community in which they are members. Though Facebook (presumably) has no legal obligation to get users’ permission before instituting changes, it makes business sense to consult the user community before making significant changes in the privacy model. Announcing a new initiative, only to backpedal in the face of user outrage, can’t be the best way to maximize long-term profits.

The challenge is finding a structure that allows the company to explore new business opportunities, while at the same time securing truly informed consent from the user community. Some kind of customer advisory board seems like an obvious approach. But how would the members be chosen? And how much information and power would they get? This isn’t easy to do. But the current approach isn’t working either. If your business is based on user buy-in to an online community, then you have to give that community some kind of voice — you have to make it a community that users want to inhabit.

This is a question we at CDP have been asking ourselves recently — how do you create a community that users want to inhabit?  We agree with Ed Felten that privacy in Facebook, as in most online activities, “means not the prevention of all information flow, but control over the content of their story and who gets to read it.”  Our idea of a datatrust is premised on precisely this principle, that people can and should share information in a way that benefits all of society without being asked to relinquish control over their data. Which is why we’re in the process of researching a wide range of online and offline communities, so that when we launch our datatrust, it will be built around a community of users who feel a sense of investment and commitment to our shared mission of making more sensitive data available for public decision-making.

We’d love to know, what communities are you happy to inhabit?  And what makes them worth inhabiting?  What do they do that’s different from Facebook or any other organization?

Wow, new privacy features!

Friday, December 11th, 2009

Wow, so many companies rolling out new privacy features lately!

Facebook rolled out its new “simplified” privacy settingsGoogle introduced Google Dashboard, a central location from which to manage your profile data, which supplements Google Ads Preferences.  And Yahoo released a beta version of the Ad Interest Manager.

Many, many people have reviewed Facebook’s new changes, and pointed out some of the “bait-and-switch” Facebook has done for some new, and I think better, controls.  I don’t have much more to say about that.

But it’s interesting to me that Google and Yahoo have chosen similar strategies around privacy issues, though with some differences in execution.  Both companies haven’t actually changed their data collection practices, and cynics have argued that they’re both just trying to stave off government regulation.  Still, I think that it makes a difference when companies actually make clear and visible what they are doing with user data.

“Is this everything?”

Both Google and Yahoo indicate in different ways that the user who is looking at Dashboard or Ad Interest Manager is not getting the full data story.

Google’s Dashboard is supposed to be a central place where a user can manage his or her own data.  In and of itself, it’s not that exciting.  As ReadWriteWeb put it, it doesn’t tell you anything you didn’t know before.  It provides links in one place to the privacy settings for various applications, but it focuses on profile information the user provides, which represents only a tiny bit of the personal information Google is tracking.

Google does, however, provide a link to the question, “Is this everything?” that describes some of their browser-based data collection and a link to the Ads Preferences Manager page.  To me, it feels a little shifty, that the Dashboard promises to be a place for you to control “data that is personally associated with you,” but it doesn’t reveal until you scroll to the bottom that this might not be everything.  Others may feel differently, but this to me goes right at the heart of the problem of how “personal information” is defined.  When I go to the Ads Preferences Manager, I see clearly that Google has associated all kinds of interests with me–how is this not “personally associated” with me?  Google states it’s not linking this data to my personal account data which is why they haven’t put it all in one place, which is good, but it seems too convenient a reason to silo that off.

Yahoo’s strategy is a little different.  It may not be fair to compare Yahoo’s Ad Interest Manager to Google’s Dashboard at this point, given that it’s in such a rudimentary phase.  It’s in beta and doesn’t work yet with all browsers.  (As David Courtney points out in PCWorld, being in beta is a pretty sorry excuse for the fact that it doesn’t work with IE8 and Firefox.)  Depending on how much you use Yahoo, you may not see anything about yourself.

Still, I thought it was interesting that Yahoo highlighted some of the hairy parts of its privacy policy in separate boxes high up on the page.  Starting from the top, Yahoo states clearly in separate boxes with bold headings that there are ways in which your data is collected and analyzed that are not addressed in this Ad Interest Manager.  The box for the Network Advertising Initiative is a little weak; it doesn’t really explain what it means that Yahoo is connected to the NAI.  But the box on “other inputs,” shows prominently that even as you manage your settings on this page, there may be other sources of data Yahoo is using to find out more about you.

zoombox

Yahoo also reveals that the information they’re tracking from you is collected from a wide range of sources, including both Yahoo account information like Mail and non-account websites like its Front Page.  Unlike Google, Yahoo doesn’t ask you to click around to find out that some of “everything” is elsewhere.

zoombox2

Turning “interests” on and off

Google and Yahoo are very similar here.  Google’s Ad Preferences Manager indicates which interests have been associated with you with a clear link to how they can be removed, with a button for opting out from tracking altogether.

Googleopt

Yahoo’s Ad Interest Manager has a different design, but the button for opting out altogether is similarly visible.

Yahooopt

We’re using cookies!

Compared to the other issues, this is the most obvious difference between Google and Yahoo.

Google has this on its Ads Preferences Manager:

Googlecookie

So you can see that some string of numbers and letters has somehow been attached to your computer, but you’re not told what this means in terms of what Google knows about you.

In contrast, Yahoo shows this at the bottom of the Ad Interest Manager:

YahooAd3

Yahoo knows I’m a woman!  Between 26 and 35!  The location is actually wrong, as I am in Brooklyn, NY, but I did live in San Francisco 5 years ago when I first signed up for a Yahoo account.  Still, Yahoo is very explicitly showing, and not just telling, that it knows geographical information, age, gender, and the make and operating system of your computer.  I’m impressed—they must know this is going to scare some people.

Does any of this even matter?

I prefer the Yahoo design in many ways — the boxes and verticality of the manager to me are easier to read and understand than the horizontal spareness of the Google design.  But in the end, the design differences between Google and Yahoo’s new privacy tools may not even matter.  I don’t know how many people will actually see either Manager.  You still have to be curious enough about privacy to click on “Privacy Policy,” which takes you to Yahoo! Privacy, at which point, in the top right-hand corner, you see a link to “Opt-out” of interest-based advertising.  The same is true with Google. And neither company has actually changed much about their data collection practices.  They’re just being more open about them.

But I am impressed and heartened that both companies have started to reveal more about what they’re tracking and in ways that are more visually understandable than a long, boring, legalistic privacy policy.  I hope Yahoo is feeling competitive with Google on privacy issues and vice-versa.  I’d love to see a race to the top.

Datatrust Prototype

Tuesday, December 8th, 2009

Editor’s Note: Grant Baillie is developing a datatrust prototype as an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Grant’s work could have interesting implications for CDP’s mission, as it would use technologies that could enable more disclosure of personal data for public re-use.  We’re glad to have him guest-blogging about the prototype on our blog, and we’re looking forward to hearing more as he moves forward.

This post is mostly the contents of a short talk I gave at the CDP Symposium last month. In a way, it was a little like the qualifying oral you have to give in some Ph.D. programs, where you stand up in front of a few professors who know way more than you do, and tell them what you think your research project is going to be.

That is the point we’re at with the Datatrust Prototype: We are ready to move forward with an actual, real, project. This proposal is the result of discussions Mimi, Alex and I have been having over the course of the past couple month or so, with questions and insights on PINQ thrown in by Tony, and answers to some of those questions from Frank.

The talk can be broken up into three tersely titled sections:

Why? Basic motivation for the project.

What? What exactly the prototype will be.

Not Potential features of a Datatrust that are out of scope for this first prototype.

Why?

We need a (concrete) thing

Partly this is to have something to demo to future clients/partner organizations, of course. However, we also need the beginnings of a real datatrust so that different people’s abstract concepts of what a datatrust is can begin to converge.

We need understanding of some technical issues

1. Understanding Privacy: People have been looking at this problem (i.e. releasing aggregated data in such a way that individuals’ privacy isn’t compromised) for over 40 years. After some well-publicized disasters involving ad-hoc approaches (e.g. “Just add some noise to the data”, or “remove all identifying data like names or social security numbers”) a bunch of researchers (like our friends at research.microsoft.com) came up with a mathematical model where there is a measure of privacy, ε (epsilon).

In the model, there’s a clear message: you have to pay (in privacy) if you want greater accuracy (i.e. less noise) in your answers. In particular, the PINQ C# API can
calculate the privacy cost ε of each query on a dataset. So, one can imagine having different users of a datatrust using up allocations of privacy they have been assigned, hence the term “Privacy Budget”. (Frank dislikes this term because there are many privacy strategies possible other than a simple, fixed budget). In any case, by creating the prototype, we are hoping to gain an intuitive understanding of this mathematical concept of privacy, as well as obtain insight on more practical matters like how to allocate privacy so that the datatrust is still useful.

One way of understanding privacy is to think of it as a probability, (i.e. of leaking data) or measure of risk. You could even imagine an organization buying insurance against loss of individual data, based on the mathematical bounds supplied by PINQ. The downside of this approach is that we humans don’t seem to have a good intuitive grasp of things like probability and risk (writ large, for example, in the financial meltdown last year).

Another approach that might be helpful is to notice that privacy behaves in the same way as a currency (for example, it is additive). Here, you can imagine people earning or trading currency, for example. With actual money, we have a couple of thousands of years worth of experience built into evaluations like a house being worth a million Snickers bars: How long will it take us to have similar intuition with a privacy currency?

2. PINQ vs SQL: Here, by “SQL” I’m talking of traditional persistent data storage mechanisms in general. In most specific cases we are talking about SQL-based databases (although in the data analysis world there are other possibilities, like SAS).

  • SQL has been around for over 35 years, and is based on a mathematical model of its own. It basically provides a set of building blocks for querying and updating a database.
  • PINQ is a wrapper that basically protects the privacy of an underlying SQL database. It allows you to run SQL-like queries to get result sets, but then only lets you see statistical information about these sets. Even this information will come at some privacy cost, depending on how accurate you want the answer to be. PINQ will add random noise to any answer it gives you; if you want to ask for more accurate answers, i.e. less noise added (on average), you have to pay more privacy currency.
Note: At this point in the talk, I went in to a detailed discussion of a particular case of how PINQ privacy protection is supposed to work. However, I’m going to defer this to an upcoming post.

PINQ provides building blocks that are similar to SQL’s, with the caveat that the only data you can get out is aggregated (i.e. numbers, averages, and other statistical information). Also, some SQL operations cannot be supported by PINQ because they cannot be privacy protected at all.

In any case, both PINQ and SQL support an infinite number of questions, since you can ask about arbitrary functions of the input records. However, because they have somewhat different query building blocks, it is at least theoretically possible that there are real-world data analyses that cannot be replicated exactly in PINQ, or can only be done in a cumbersome or (privacy) expensive way. So, it will be good to focus on more concrete uses cases, in order to see whether this is the case or not.

3. Efficent Queries: It’s not uncommon for database-based software projects to grind to a halt at some point when it becomes clear that the database isn’t handling the full data set as well as is needed. Then various experts are rushed in to tune and rewrite the queries so that they perform better. In the case of PINQ, there is an additional measure of query performance, that of privacy used. Frank’s PINQ tutorial already has one example of a query that can be tuned to use privacy budget more efficiently. Hopefully, by working through specific use cases, CDP can start building expertise in query optimization.

What?

Target: A researcher working for a single organization. We’re going to imagine that some organization has a dataset containing personal information, but they want to be able to do some data analysis and release statistical aggregates to the public without compromising any individual’s privacy.

A Mockup of a Rich Dataset: Hopefully, I’ve given enough incentive for why we want a reasonably “real-world” structure to our data. I’m proposing that we choose a subset of the schema available as part of the National Health and Nutrition Examination Survey (NHANES):

NHANES Survey Content Cover Page

This certainly satisfies the “rich” requirement: NHANES combines an interesting and extensive mix of medical and sociological information (The above cover page image comes from the description of the data collected, a 12-page PDF file). Clearly, we wouldn’t want to mock up the entire dataset, but even a small subset should make for some reasonably complex analyses.

Queries: We will supply at least a canned set of queries over the sample data. A scenario I have in mind is being able to have something like Tony’s demo, but with a more complex data set. A core requirement of the prototype is to be able to reproduce the published aggregations done with the real NHANES dataset. Some kind of geographical layout, like the demo, would be compelling, too.

Not

Account management: This includes issues of tracking privacy allocation and expenditures on a per-user basis, possibly having some measure of trust to allow this. There may be some infrastructure for different users in the prototype, but for the most part we’ll be assuming a single, global user.

Collaborative queries: In the future, we could imagine having users contribute to a library of well-known queries for a given data set. The problem with public access like this is that it basically means that all privacy budget is effectively shared, since query results are shared, so for this first cut at the problem we are not going to tackle this.

Multiple Datasets, Updates: For now, we will assume a single data set, with no updates. (The former can raise security concerns, especially if data sets aren’t co-hosted, while the latter is an area where I’m not sure what the mathematical contraints are).

Sneaky code (though maybe we have a service): There is a known issue at the moment with having PINQ executing arbitrary C# code to do queries. At the moment, it is possible to have your code save all the records it sees to a file on disk. We may work around this by having the datatrust be a service (i.e. effectively restricting the allowed queries so no user-supplied code is run).

Deployment issues (e.g. who owns the data): Our prototype will just have PINQ and the simulated database running on the same machine, even though more general configurations are possible. We also explicitly don’t tackle whether the database is running on a CDP server or the organization that owns the data.

Open Source Ideological Purity: While it would be nice for CDP to be able to deploy on an open source platform, it is clear that serious issues might lie in wait for deploying on Mono (the open source C# environment). In that case, it is quite possible to switch to running PINQ on top of, say, Microsoft SQL Server.

In the mix

Wednesday, November 25th, 2009

Firefox’s Plan to Kick the Login’s Butt (ReadWriteWeb)

The Cost of Getting Sick: GE (GE)

Happy Thanksgiving, everyone!

Remixing Creative Commons licenses for personal information, Part II — What good would that do?

Wednesday, November 25th, 2009

The scenarios of data sharing I outlined in my first blog post may not sound too exciting to you.  So what if one person uploads a dataset on her blog, making it public, and then says it’s available for reuse?  How does that make the world a better place?

It’s possible that although personal information licenses, a la Creative Commons, wouldn’t solve all data-collection problems today, it could shape and shift the debate in several important ways:

1) Create a proactive way for people to take control of their information.

Right now, we as users generally are told, “Take it or leave it.”  We can agree with the terms of use that govern the use of our personal information, or not. A few companies are trying to offer more choices—Firefox has a “Private Browsing” option, Google offers some choices in what interests are tracked.  But a user almost never gets a choice in how his or her information is used once it’s collected.  A set of licenses could be a way to assert control instead of waiting for the choices to be offered.  As many privacy advocates have noted, it’s problematic that most privacy choices are offered as an opt-out rather than an opt-in.  A set of licenses would create a way to “opt-in” before being asked.  Even if the licenses turned out to be difficult to enforce, if the licenses became popular and widespread, it would be harder to ignore that people do have preferences that are not being considered or honored.

2) Create a grassroots way for people to actively share their information for causes they explicitly support.

Obama's Healthcare Stories for America

We’ve all seen campaigns that are organized around human-interest stories, true stories about real people that are meant to humanize a campaign and give it urgency.  The current healthcare debate, for example, inspired a host of organizations to ask people to “share their stories,” the Obama administration’s site being one of the best-organized ones.

It had the following “Submission Terms“:

submission terms

By submitting your story, you agree that the story, along with any pictures or video you submit along with the story (the “Submission”), is non-confidential and may be freely used and disclosed, in whole or in part and in any manner or media, by or on behalf of Democratic National Committee (“DNC”) in support of health care reform.

You acknowledge that such use will be without acknowledgment or compensation to you.

You grant DNC a perpetual, irrevocable, sublicensable, royalty-free license to publish, reproduce, distribute, display, perform, adapt, create derivative works of and otherwise use the Submission.

Despite the all-or-nothing language, the Obama site was still able to solicit a great number of stories.  But the terms underscore a perennial problem for lesser-known organizations.  How do people trust an organization with their stories?

A more decentralized set of licenses could allow people to essentially tag their information across the internet and flag that it’s been provided in support of a specific cause, without giving their stories explicitly to another organization.  Individuals could also choose to tag their information in support of specific research projects.

The licenses could be an organizing tool, a way for organizations or people without established reputations to gather useful information without asking people to sign away the rights to their stories.  Or the licenses could be a research tool, enabling new forms of data collection.  Already, sociologists are exploring the possibilities of broadening research beyond the couple hundred subjects that can be managed through more traditional methods.  At Harvard, a graduate student in psychology created an iPhone application that allows research subjects in a study on happiness to rate their happiness in real time, rather than through recollection with an interviewer later.

Would the existence of standard licenses for sharing personal information make organizing around real stories easier?  Could it make personal information-based research easier?  Could it encourage people who support such causes or research but are uncertain about existing privacy guarantees more willing to try?  We think it’s certainly worth exploring.

3) Make sharing cool (and good).

WhyIGivebutton

Creative Commons is not without controversy, but almost everyone would agree, what the organization did manage to do was making sharing work cool.  The licenses created an easy way for people who shared the same view of intellectual property to band together and display their commitment.  They also made it easier to advertise and sell this ethos of IP to others.

We wonder if a set of licenses for sharing personal information might not be able to do the same.  We want to promote sharing information as a virtue, a civic act of generosity, and a way to enable all of us to have more information for decisions.  We want donating information to feel like donating blood.

4) Raise the bar on use of personal information in research, marketing, and other contexts.

It may seem like we’re encouraging less use and reuse of information by imagining a system where people put licenses on information they already make public (see screenshots from the first post.)  But what the licenses would make clear, which is not clear now, is that there is a difference between something being put out for the public, for general use and enjoyment, and something being put out for someone else’s reuse, gain, and potential profit.  Those who use the license would be signaling clearly their willingness to make their information available for research and other public uses.

About a year ago, researchers at the Berman Center for the Internet and Society at Harvard released a dataset of Facebook profile information for an entire class of college students at an “an anonymous, northeastern American university.”  As Michael Zimmer pointed out, however, the dataset was hardly “anonymous.”  He was quickly able to deduce that the university in question was Harvard.  Although some have argued that some of these profiles were already “public,” Zimmer argues (and we agree) that having a public profile does not equal consent to being a research subject:

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

By creating a license that allows people to clearly signal when they do consent to being “scraped, aggregated, coded, dissected, and distributed,” we would also make clearer that when people don’t clearly signal their consent, that consent cannot be assumed.

5) Ultimately create new scenarios in which licenses can be used.

So far, the scenarios I’ve outlined in which a license could be applied are where information is being displayed openly, as on a website.  But the licenses could eventually apply to more closed systems, where the individual’s decision to share data is not itself public.

CDP is working on building a datatrust, a new kind of institution and trusted entity to store sensitive, personal information and make it publicly accessible for research.  Individuals and institutions could choose to donate data to the datatrust, knowing that they are contributing to public knowledge on a range of issues.  CDP will likely use a system of licenses that allow each data donor to pre-determine his or her preferences on how their data is accessed rather than a single “terms of use” tha applies to everyone, take it or leave it.

Similarly, if the licenses were to become popular, other organizations and companies that collect information from their members or account holders would be under pressure to offer these set choices or licenses when people sign up for accounts that require them to provide personal information.

Taxonomy of data

Thursday, November 19th, 2009

I haven’t yet posted Parts II and III of our series on the idea of creating Creative Commons-type sharing licenses for personal information, but Bruce Schneier posted today on a proposed taxonomy of data, and I thought it was worth sharing now.  Although the taxonomy he’s discussing is limited to social networking data, it’s a helpful way to understand why it’s so hard to come up with rules around personal information in general.

Here is his taxonomy on social networking data:

  1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.
  2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
  3. Entrusted data. This is what you post on other people’s pages. It’s basically the same stuff as disclosed data, but the difference is that you don’t have control over the data — someone else does.
  4. Incidental data. Incidental data is data the other people post about you. Again, it’s basically same same stuff as disclosed data, but the difference is that 1) you don’t have control over it, and 2) you didn’t create it in the first place.
  5. Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.

As I noted in my first license blog post, our idea is focusing strictly on “disclosed data,” data an individual actively chooses to release.  It doesn’t address the messiness around how the other types of data are being used and reused, except in that we hope explicitly talking about individual preferences around “disclosed data” can help all of us understand what really matters to people (and what doesn’t) when they talk about the need for privacy around other forms of data.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes