Archive for the ‘Public Policy’ Category

Will the “Closing of the Open Web” also close the door on the possibility of individual agency over personal data?

Saturday, March 10th, 2012

Recently there were a a spate of posts about the threat posed by dynamically-generated and therefore closed black box web services like Facebook and even more hermetically sealed proprietary content delivery platforms like iOS and Android to an open web.

I won’t rehash the whole explanation here as it is very well explained there, but a thought experiment that is worth unraveling is to imagine what the first-world would look like today if we had gone straight from proprietary desktop applications to proprietary mobile device applications, skipping over the age of the browser altogether.

Even if you are not well-versed in the nitty-gritty of http (hyper-text transfer protocol, as in how data gets passed around the internet), you might still have an intuitive sense that many things we take for granted today would not exist.

Google for one.
Shopping comparison sites for another.
Facebook would be a completely different beast.
Twitter wouldn’t exist.

All of the above rely on an “open web” speaking a “standard language” that anyone (meaning any piece of software that also speaks the “standard language”) can access and understand.

lockers

Sharing content via “links” wouldn’t really make sense if the content were locked inside of Apps that each individual had to pay for.

In general there would probably be a lot less content overall as content providers would have had to make the same hard choice desktop software makers have had to make for decades: Which platforms do I build for?

How might the world be different today if open web standards had never taken hold?

Free content might not be the norm.

For example, publishing might not be in its death throes as content-providers would have been able to lock down their content by delivering them in apps users paid for.

Without Google and the ability to follow users across property boundaries on the web, targeted advertising might not be the de rigueur business model for internet startups.

One scenario we haven’t seen spelled out is that the end of the open web also has unfortunate consequences for the idea that users might one day gain control over their own data.

Today, we are for the most part at the mercy of the websites, services and apps we use when it comes to what data is collected about us and how it’s used. Each web domain and service has it’s own terms of use (Apple recently presented me with a 42-page doozy for iTunes) and gets to decide what to collect, how to make use of it and who to share it with (usually with Google via Google AdWords or Google Analytics and Facebook via Facebook Like Button or Facebook Connect).

However today, in the browser, theoretical meta-services could be built that allow users to collect all the same data web sites collect and decide what they want to do with it. This possibility exists in the form of browser add-ons that have the same access to user behavior that web sites do, but with the significant difference that the browser add-on can track users wherever the user decides they would like to collect data about themselves.

By contrast, each web site can only track users within the confines of its domain, or if you’re a big player like Google and Facebook, you can track users wherever other websites have agreed to put Google AdWords, make use of Google Analytics or have installed Facebook Like buttons.

This theoretical possibility however does not exist on any of the mobile platforms. There is no way to plug-in to iOS or Android as a “meta-layer” across all apps to collect user behavioral data.

Given that these meta-layer user-centered data collection services are largely theoretical, it’s understandable that no one is losing sleep over them. However from our perspective, the decline of access points to collect data on the web presents a very real loss for advancing individual user control over data.

In the end, we still believe that the best way for individuals to regain footing in our privacy/data-collection tug-of-war with online services is for users to engage with data rather than run away from it.

Open Graph, Silk, etc: Let’s stop calling it a privacy problem

Tuesday, October 4th, 2011

The recent announcements of Facebook’s Open Graph update and Amazon Silk have provoked the usual media reaction about privacy. Maybe it’s time to give up on trying to fight data collection and data use issues with privacy arguments.

Briefly, the new features: Facebook is creating more ways for you to passively track your own activity and share it with others. Amazon, in the name of speedier browsing (on their new Kindle device), has launched a service that will capture all of your online browsing activity tied to your identity, and use it to do what sounds like collaborative filtering to predict your browsing patterns and speed them up.

Who cares if they know I'm a dog?

Who cares if they know I'm a dog? (SF Weekly)

Amazon likens what Silk is doing to the role of an Internet Service Provider, which seems reasonable, but since regulators are getting wary of how ISP’s leverage the data that passes through them, Amazon may not always enjoy that association.

EPIC (Electronic Privacy Information Center) has sent a letter to the FTC requesting an investigation of Facebook’s Open Graph changes and the new Timeline.

I’m not optimistic about the response. Depending on how the default privacy settings are configured, Open Graph may fall victim to another “Facebook ruined my diamond ring surprise by advertising it on my behalf” kerfuffle, which will result in a half-hearted apology from Zuckerberg and some shuffling around of checkboxes and radio buttons. The watchdogs aren’t as used to keeping tabs on Amazon, which has done a better job of meeting expectations around its use of customer data, so Silk may provoke a bit more soul-searching.

But I doubt it. In an excerpt from his book “Nothing to Hide: The False Tradeoff Between Privacy and Security” published in the Chronicle of Higher Education earlier this year, Daniel J. Solove does a great job of explaining why we have trouble protecting individual privacy at the cost of [national] security. In the course of his argument he makes two points which are useful in thinking about protecting privacy on the internet.

He quotes South Carolina law professor Ann Bartow as saying,

There are not enough privacy “dead bodies” for privacy to be weighed against other harms.

There’s plenty of media chatter monitoring the decay of personal privacy online, but the conversations have been largely theoretical, the stuff of political and social theory. We have yet to have an event that crystallizes the conversation into a debate of moral rights and wrongs.

Whatevers, See No Evil, and the OMG!’s

At one end of the “privacy theory” debate, there are the Whatevers, whose blasé battle cry of “No one cares about privacy any more,” is bizarrely intended to be reassuring. At the other end are the OMG!’s, who only speak of data collection and online privacy in terms of degrees of personal violation, which equally bizarrely has the effect of inducing public equanimity in the face of “fresh violations.”

However, as per usual, the majority of people exist in the middle where so long as they “See no evil and Hear no evil,” privacy is a tab in the Settings dialog, not a civil liberties issue. Believe it or not this attitude hampers both companies trying to get more information out of their users AND civil liberties advocates who desperately want the public to “wake up” to what’s happening. Recently, privacy lost to free speech – but more on that in a minute.

When you look into most of the privacy concerns that are raised about legitimate web sites and software, (not viruses, phishing or other malicious efforts) they usually have to do with fairly mundane personal information. Your name or address being disclosed inadvertently. Embarrassing photos. Terms you search for. The web sites you visit. Public records digitized and put on the web.

The most legally harmful examples involve identity theft, which while not unrelated to internet privacy, falls squarely in the well-understood territory of criminal activity. What’s less clear is what’s wrong with “legitimate actors” such as Google and Facebook and what they’re doing with our data.

Which brings us a second point from Solove:

“Legal and policy solutions focus too much on the problems under the Orwellian metaphor—those of surveillance—and aren’t adequately addressing the Kafkaesque problems—those of information processing.”

In other words, who cares if the servers at Google “know” what I’m up to. We can’t as yet really even understand what it means for a computer to “know” something about human activity. Instead, the real question is what is Google (the company, comprised of human beings) deciding to do with this data?

What are People deciding to do with data?

By and large, the data collection that happens on the internet today is feeding into one flavor or another of “targeted advertising.” Loosely, that means showing you advertisements that are intended for an individual with some of your traits, based on information that has been collected about you. A male. A parent. A music lover. The changes to Facebook’s Open Graph will create a targeting field day. Which, on some level is a perfectly reasonable and predictable extension of age-old advertising and marketing practices.

In theory, advertising provides social value in bridging information gaps about useful, valuable products; data-driven services like Facebook, Google and Amazon are simply providing the technical muscle to close that gap.

However, Open Graph, Silk and other data rich services place us at the top of a very long and shallow slide down to a much darker side of information processing, which has nothing to do with the processing, but about manipulation and balance of power. And it’s the very length and gentle slope of that slide that make it almost impossible for us to talk about what’s really going wrong, and makes it even somewhat pleasant to ride down on it. (Yes, I’m making a slippery slide argument.)

At the top of the slide, are issues of values and dehumanization.

Recently employers have been making use of credit checks to screen potential candidates, automatically rejecting applicants with low credit scores. Perhaps this is an ingenious, if crude, way to quickly filter down a flood of job applicants. While its utility remains to be proven, it’s with good reason that we pause to consider the unintended consequences of such a policy. In many areas, we have often chosen to supplement “objective,” statistical evaluations with more humanist, subjective techniques (the college application process being one notable example). We are also a society that likes to believe in second chances.

A bit further down the slide, there are questions of fairness.

Credit card companies have been using purchase histories as a way to decide who to push to pay their debt in full and who to strike a deal with. In other words, they’re figuring out who will be susceptible to “being guilted” and who’s just going to give them the finger when they call. This is a truly ingenious and effective way to lower the cost and increase the effectiveness of debt collection efforts. But is it fair to debtors that some people “get a deal” and others don’t? Surely, such inequalities have always existed. At the very least, it’s problematic that such practices are happening behind closed doors with little to no public oversight, all in the name of protecting individual privacy.

Finally, there are issues of manipulation where information about you is used to get you to do things you don’t actually want to do.

The fast food industry has been micro-engineering the taste, smell and texture of their food products to induce a very real food addiction in the human brain. Surely, this is where online behavioral data-mining is headed, amplified by the power to deliver custom-tailored experiences to individuals.

But it’s just the Same-Old, Same-Old

This last scenario sounds bad, but isn’t this simply more of the same old advertising techniques we love to hate? Is there a bright line test we can apply so we know when we’ve “crossed the line” over into manipulation and lies?

Drawing Lines

Clearly the ethics of data use and manipulation in advertising is something we have been struggling with for a long time and something we will continue to struggle with, probably forever. However, some lines have been drawn, even if they’re not very clear.

While the original defining study on subliminal advertising has since been invalidated, when it was first publicized, the idea of messages being delivered subliminally into people’s minds was broadly condemned. In a world of imperfect definitions of “truth in advertising” it was immediately clear to the public that subliminal messaging (if it could be done) crossed the line into pure manipulation, and that was unacceptable. It was quickly banned by the UK, Australia and the American Networks and the National Association of Broadcasters.

Thought Experiment: If we were to impose a “code of ethics” on data practitioners, what would it look like?

Here’s a real-world, data-driven scenario:

  • Pharmacies sell customer information to drug companies so that they can identify doctors who will be most “receptive” to their marketing efforts.
  • Drug companies spend $1 billion a year advertising online to encourage individuals to “ask your doctor about [insert your favorite drug here]” with vague happy-people-in-sunshine imagery.
  • Drug companies employ 90,000 salespeople (in 2005) to visit the best target doctors and sway them to their brands.

Vermont passed a law outlawing the use of the pharmacy data without patient consent on the grounds of individual privacy. Then, this past June 23rd, the supreme court decided it was a free-speech problem and struck down the Vermont law.

Privacy as an argument for hemming in questionable data use will probably continue to fail.

The trouble again is that theoretical privacy harms are weak sauce in comparison to data as a way to “bridge information gaps.” If we shut down use of this data on the basis of privacy, that prevents the government from using the same data to prioritize distribution of vaccines to clinics in high-risk areas.

Ah, but here we’ve stumbled on the real problem…

Let’s shift the conversation from Privacy to Access

Innovative health care cost reduction schemes like care management are starved for data. Privacy concerns about broad, timely analysis of tax returns have prevented effective policy evaluation. Municipalities negotiating with corporations lack data to make difficult economic stimulus decisions. Meanwhile private companies are drowning in data that they are barely scratching the surface of.

At the risk of sounding like a broken record, since we have written volumes about this already:

  • The problem does not lie in the mere fact that data is collected, but in how it is secured and processed and in who’s interest it is deployed.
  • Your activity on the internet, captured in increasingly granular detail is enormously valuable, and can be mined for a broad range of uses that as a society we may or may not approve of.
  • Privacy is an ineffective weapon to wield against the dark side of data use and instead, we should focus our efforts on (1) regulations that require companies to be more transparent about how they’re using data and (2) making personal data into a public resource that is in the hands of many.

 

Kerry-McCain Privacy Bill: What it got right, what’s still missing.

Wednesday, May 11th, 2011

At long last, we have a bill to talk about. It’s official name is the “Commercial Privacy Bill of Rights Act of 2011” and it was introduced by Senators Kerry and McCain.

I was pleasantly surprised by how well many of the concepts and definitions were articulated, especially given some of the vague commentary that I had read before the bill was officially released.

Perhaps most importantly, the bill acknowledges that de-identification doesn’t work, even if it doesn’t make a lot of noise about it.

More generally though, there is a lot that is right about this bill, and it cannot be dismissed as an ill-conceived, knee-jerk reaction to the media hype around privacy issues.

Commercial Privacy Bill of Rights Act of 2011For readers who are interested, I have outlined some of the key points from the bill that jumped out at me, as well as some questions and clarifications. Before getting to that however, I’d like to make three suggestions for additions to the bill.

Transparency, Clear Definitions and Public Access

Lawmakers should legislate more transparency into data collection; they should define what it means to render data “not personally identifiable;” and they should push for commercial data to be made available for public use.

Legislators should look for opportunities to require more transparency of companies and organizations collecting data by establishing new standards for “privacy accounting” practices.

Doing so will encourage greater responsibility on the part of data collectors and provide regulators with more meaningful tools for oversight. Some examples include:

  1. Companies collecting data should be required to identify outside contractors they hire to perform data-related services. Currently in the bill, companies are liable for their contractors when it comes to privacy and security issues. However, we need a more positive carrot to incent companies to keep closer track of who has access to sensitive data and for what purposes. A requirement to publicly account for that information is the best way to encourage more disciplined internal accounting practices.
  2. Data collectors should publicly and specifically state what data they are collecting in plain English. Most privacy policies today are far too vague and high-level because companies don’t want to be limited by their own policies.

For example, the following is taken from the Google Toolbar Privacy Policy:

“Toolbar’s enhanced features, such as PageRank and Sidewiki, operate by sending Google the addresses and other information about sites at the time you visit them.” (Italics mine.)

This begs the question, what exactly is covered by “other information?” How long I remain on a page? Whether I scroll down to the bottom of the page? What personalized content shows up? What comments I leave? The passwords I type in? These are all reasonable examples of the level of specificity at which Google could be more transparent about what data they collect. None of these items are too technical for the general user to understand and at this granularity, I don’t believe such a list would be terribly onerous keep up to date. We should be able to find a workable middle-ground that gives users of online services a more specific idea of what data is being collected about them without overwhelming them with too much technical detail.

Legislators Need to Establish Meaningful Standards for Anonymization

After describing the spirit of the regulations, the bill assigns certain tasks that are either too detailed or too dynamic to “rulemaking proceedings.” One such task is defining the requirements for providing adequate data security. I would like to add an additional, critical task to the responsibilities of those proceedings:

They must define what it means to “render not personally identifiable” (Sec 202a5A) or “anonymise” (sec 701-4) data.

Without a clear legal standard for anonymization the public will continue to be misled into believing that anonymous means their data is no longer linkable to their identity when in fact there can only ever be degrees of anonymity because complete anonymity does not exist. This is a problem we have been struggling with as well.

Our best guess at a good way to approach a legal definition would be to build up a framework around acceptable levels of risk and require companies and organizations collecting data to quantify the amount of risk they incur when they share data, which is actually possible with something like differential privacy.

Legislators Should Push for Public Access

Entities that collect data from the public should be required to make it publicly available, through something like our proposal for the datatrust.

Businesses of all sorts have, with the advent of technology, become data businesses. They live and die by the data that they come by, though little of it was given to them for the purposes it is now used for. That doesn’t mean we should delete the data, or stop them from gathering it – that data is enormously valuable.

It does mean that the public needs a datastore to compete with the massive private sector data warehouses. The competitive edge that large datasets provide the entities that have them is gigantic, and no amount of notice and security can address that imbalance with the paucity of granular data available in the public realm.

Now for a more detailed look at the bill.

Key Points of the Bill

  1. The bill is about protecting Personally Identifiable Information (PII), which it correctly disambiguates to mean both the unique identifying information itself AND any information that is linked to that identifier.
  2. Though much of the related discussion in the media talks about the bill in terms of its impact to tracking individuals on the internet, the bill is about all commercial entities, online or off.
  3. “Entities” must give notice to users about collecting or using PII – this isn’t particularly shocking, but what may be more complicated will be what constitutes “notice”.
  4. Opt-out for individuals is required for use of information that would otherwise be considered an unauthorized use. (This is a nice thought, but the list of exceptions to the unauthorized use definition seems to be very comprehensive – if anyone has a good example of use that would “otherwise be unauthorized” and is thus addressed by this point, I would be interested to hear it.)
  5. Opt-out for individuals is also required for the use of an individual’s covered information by a third-party for behavioral advertising or marketing. (I guess this means that a news site would need to provide an opt-out for users that prevents ad-networks from setting cookies, for example?)
  6. Opt-in for individuals is required for the use or transfer of sensitive PII (a special category of PII that could cause the individual physical or economic harm, in particular medical information or religious affiliations) for uses other than handling a transaction (does serving an ad count as a transaction? – this is not defined), fighting fraud or preventative security. Opt-in is also required if there is a material change to the previously consented uses and that use creates a risk of economic or physical harm.
  7. Entities need to be accountable for providing adequate security/protection for the PII that they store.
  8. Entities can use the PII that they collect for an enumerated list of purposes, but from my reading, just about any purpose related to their business.
  9. Entities can’t transfer this data to other entities without explicit user consent. Entities may not combine de-identified data with other data “in order to” re-identify it. (Unclear if they combine it without the intent of re-identification, but it has the same effect.)
  10. Entities are liable for the actions of the vendors they contract PII work to.
  11. Individuals must be able to access and update the information entities have about them. (The process of authenticating individuals to ensure they are updating their own information will be a hard nut to crack, and ironically may potentially require additional information be collected about them to do so.)

It’s hard to disagree with the direction of the above points – all are ideas that seem to be doing the right thing for user privacy. However, there are some hidden issues, some of which may be my misunderstanding, but some of which definitely require clarifying the goal of the bill.

Clarifications/Questions

1. Practical Enforcement – While the bill specifies fines and indicates that various rule making groups will be created to flesh out the practical implications of the bill, it’s not clear how the new law will actually change the status quo when it comes to enforcement of privacy rules. With no filing and accounting requirements to demonstrate that they are actually doing so, outside of blatant violations such as completely failing to provide notice to end users use of PII, the FTC will have no way of “being alerted” when data collectors break the rules. Instead, they will be operating blindly, wholly dependent on whistle blowers for any view into the reality of day-to-day data collection practices.

2. Meaningful Notice and Consent – While the bill lays out specific scenarios where “proper notice” and “explicit [individual] consent” will be required, there is no further explication of what “proper notice” and “explicit consent” should consist of.

Today, “proper notice” for online services FDA Nutritional Facts Sampleconsists of providing a lengthy legal document that is almost never read, and even more rarely fully understood by individuals. In the same vein, “Explicit consent” is when those same individuals “agree” to the terms laid out in the lengthy document they didn’t read.

We need guidelines that provide formatting and placement requirements for notice and consent, much the way the the FDA actually designed “Nutrition Facts” labels for food packaging.

3. Regulating Ad Networks – In the bill’s attempt to distinguish between third-parties (requires separate notice) and business partners (does not require separate notice), it remains unclear which category ad networks belong to.

Ads served up directly by New York Times on nytimes.com should probably be considered an integral part of the NYT site.

However, should Google AdWords be handled in the same way? Or are they really third party advertisers that should be required to provide users with separate notice before they can set and retrieve cookies?

More disturbingly, the bill seems to imply that online services gain an all-inclusive free pass to track you wherever you go on the web as soon as you “establish a business relationship,” what EFF is calling the “Facebook loophole.” This means that by signing up for a gmail account, you are also agreeing to Google AdWords tracking what you read on blogs and what you buy online.

This is, of course, how privacy agreements work today. But the ostensible goal of this bill is to close such loopholes.

A Step In The Right Direction

The Kerry-McCain Privacy Bill is undeniable evidence of significant progress in public awareness of privacy issues. However, in the final analysis, the bill in its current form is unlikely to practically change how businesses collect, use and manage sensitive personal data.

Should Pharma have access to doctors’ prescription records?

Tuesday, April 26th, 2011

Maine, New Hampshire and Vermont want to pass laws to prevent pharmacies from selling prescription data to drug companies, who in turn use it for “targeted marketing to doctors” or “tailoring their products to better meet the needs of health practitioners” (depending on who you talk to).

This gets at the heart of the issue of imbalance between private and public sectors when it comes to access to sensitive information.

From our perspective, it doesn’t seem like a good idea to limit data usage. If the drug companies are smart, they’re also using the same data to figure out things like what drugs are being prescribed in combination and how that affects the effectiveness of their products.

Instead, we should be thinking of ways to expand access so that for every drug company buying data for marketing and product development, there is an active community of researchers, public advocates and policymakers who have low-cost or free access to the same data.

Comments on Richard Thaler “Show Us the Data. (It’s Ours, After All.)” NYT 4/23/11

Tuesday, April 26th, 2011

Professor Richard Thaler, a professor from the University of Chicago wrote a piece in the New York Times this weekend with an idea that is dear to CDP’s mission: making data available to the individuals it was collected from.

Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.

1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.

2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.

3. I’m also not sure why Professor Thaler says that the Kerry-McCain Commercial Privacy Bill of Rights Act of 2011 doesn’t cover this issue. My reading of the bill is that it’s covered in the general sense of access to your information – Section 202(4) reads:

to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-

(A) access to such information; and

(B) mechanisms to correct such information to improve the accuracy of such information;

Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.

I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.

 

Response to: “A New Internet Privacy Law?” (New York Times – Opinion, March 18)

Wednesday, April 6th, 2011

There has been scant detailed coverage of the current discussions in Congress around an online privacy bill. The Wall Street Journal has published several pieces on it in their “What They Know” section but I’ve had a hard time finding anything that actually details the substance of the proposed legislation. There are mentions of Internet Explorer 9’s Tracking Protection Lists, and Firefox’s “Do Not Track” functionality, but little else.

Not surprisingly, we’re generally feeling like legislators are barking up the wrong tree by pushing to limit rather than expand legitimate uses of data in hard-to-enforce ways (e.g. “Do Not Track,” data deletion) without actually providing standards and guidance where government regulation could be truly useful and effective (e.g. providing a technical definition of “anonymous” for the industry and standardizing “privacy risk” accounting methods).

Last but not least, we’re dismayed that no one seems to be worried about the lack of public access to all this data.

In response, we sent the following letter to the editor to the New York Times on March 23, 2011 in response to the first appearance of the issue in their pages – an opinion piece titled “A New Internet Privacy Law,” published on March 18, 2011.

 

While it is heartening to see Washington finally paying attention to online privacy, the new regulations appear to miss the point.

What’s needed is more data, more creative re-uses of data and more public access to data.

Instead, current proposals are headed in the direction of unenforceable regulations that hope to limit data collection and use.

So, what *should* regulators care about?

1. Much valuable data analysis can and should be done without identifying individuals. However, there is as yet, no widely accepted technical definition of “anonymous.” As a result, data is bought, sold and shared with “third-parties” with wildly varying degrees of privacy protection. Regulation can help standardize anonymization techniques which would create a freer, safer market for data-sharing.

2. The data stockpiles being amassed in the private sector have enormous value to the public, yet we have little to no access to it. Lawmakers should explore ways to encourage or require companies to donate data to the public.

The future will be about making better decisions with data, and the public is losing out.

Alex Selkirk
The Common Data Project – Working towards a public trust of sensitive data
http://commondataproject.org

 

Whitepaper 2.0: A moral and practical argument for public access to private data.

Monday, April 4th, 2011

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

In The Mix…predicting the future; releasing healthcare claims; and $1.5 millions awarded to data privacy

Tuesday, November 30th, 2010

Some people out there think they can predict the future by scraping content off the web. Does it work simply because web 2.0 technologies are great at creating echo chambers? Is this just another way of amplifying that echo chamber and generating yet more self-fulfilling trend prophecies? See the Future with a Search (MIT Technology Review)

The U.S. Office of Personnel Management wants to create a huge database that contains healthcare claims of millions of. Many are concerned for how the data will be protected and used. More federal health database details coming following privacy alarm (Computer World)

Researchers at Purdue were awarded $1.5 million to investigate how well current techniques for anonymizing data are working and whether there’s a need for better methods. It would be interesting to know what they think of differential privacy. They  appear to be actually doing the dirty work of figuring out whether theoretical re-identification is more than just a theory. National Science Foundation Funds Purdue Data-Anonymization Project (Threat Post)

In the mix…data-sharing’s impact on Alzheimer’s research, the limits of a Do Not Track registry, meaning of data ownership and more

Friday, August 13th, 2010

1)  It’s heartening that an article on how data-sharing led to a breakthrough in Alzheimer’s research is the Most Emailed article on the NYTimes website right now. The reasons for resisting data-sharing are the same in so many contexts:

At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.

But Alzheimer’s researchers and drug companies realized they had little choice.

“Companies were caught in a prisoner’s dilemma,” said Dr. Jason Karlawish, an Alzheimer’s researcher at the University of Pennsylvania. “They all wanted to move the field forward, but no one wanted to take the risks of doing it.”

2) Google agonizes on privacy. The Wall Street Journal article discusses a confidential Google document that reveals the disagreements within the company on how it should use its data.  Interestingly, all the scenarios in which Google considers using its data involve targeted advertising; none involve sharing that data with Google users in a broader, more extensive way than they do now.  Google believes it owns the data it’s collected, but it also clearly senses that ownership of such data has implications that are different from ownership of other assets.  There are individuals who are implicated — what claims might they have to how that data is used?

3) Some people have suggested that if people are unhappy with targeted advertising, the government should come up with a Do Not Track registry, similar to the Do Not Call list.  But as Harlan Yu notes, Do Not Track would not be as simple as it sounds. He notes that the challenges involve both technology and policy:

Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.

4) What happens to a business’s data when it goes bankrupt? The former publisher and partners of a magazine and dating website for gay youth were fighting over ownership of the company’s assets, including its databases.  They recently came to an agreement to destroy the dataEFF argues that the Bankruptcy Code should be amended to require such outcomes for data assets.  I don’t know enough about bankruptcy law to have an opinion on that, but this conflict illuminates what’s so problematic about the way we treat data and property.  No one can own a fact, but everyone acts like they own data.  Something fundamental needs to be thrashed out.

5) Geotags are revealing more locational info than the photographers intended.

6) The owner of an ISP that resisted an FBI request for information can finally reveal his identity. Nicholas Merrill can now reveal that he was the plaintiff behind an ACLU lawsuit that challenged the legality of national security letter, by which the FBI can request information without a court order or proving just cause.  In fact, the FBI can even impose a gag order prohibiting the recipient of the NSL from telling anyone about the NSL, which is what happened to Merrill.

In the mix…WSJ’s “What They Know”; data potential in healthcare; and comparing the privacy bills in Congress

Monday, August 9th, 2010

1.  The Wall Street Journal Online has a new feature section, ominously named, “What They Know.” The section highlights articles that focus on technology and tracking.  The tone feels a little overwrought, with language that evokes spies, like “Stalking by Cellphone” and “The Web’s New Gold Mine: Your Secrets.”  Some of their methodology is a little simplistic.  Their study on how much people are “exposed” online was based on simply counting tracking tools, such as cookies and beacons, installed by certain websites.

It is interesting, though, to see that the big, bad wolves of privacy, like Facebook and Google, are pretty low on the WSJ’s exposure scale, while sites people don’t really think about, like dictionary.com, are very high.  The debate around online data collection does need to shift to include companies that aren’t so name-brand.

2.  In response to WSJ’s feature, AdAge published this article by Erin Jo Richey, a digital marketing analyst, addressing whether “online marketers are actually spies.” She argues that she doesn’t know that much about the people she’s tracking, but she does admit she could know more:

I spend most of my time looking at trends and segments of visitors with shared characteristics rather than focusing on profiles of individual browsers. However, if I already know that Mary Smith bought a black toaster with product number 08971 on Monday morning, I can probably isolate the anonymous profile that represents Mary’s visit to my website Monday morning.

3.  A nice graphic illustrates how data could transform healthcare. Are there people making these kinds of detailed arguments made for other industries and areas of research and policy?

4.  There are now two proposed privacy bills in Congress, the BEST PRACTICES bill proposed by Representative Bobby Rush and the draft proposed by Representatives Rick Boucher and Cliff Stearns.  CDT has released a clear and concise table breaking down the differences between these two proposed bills and what CDT recommends.  Some things that jumped out at us:

  • Both bills make exceptions for aggregated or de-identified data.  The BEST PRACTICES bill has a more descriptive definition of what that means, stating that it excepts aggregated information and information from which identifying information has been obscured or removed, such that there is no reasonable basis to believe that the information could be used to identify an individual or a computer used by the individual.  CDT supports the BEST PRACTICES exception.
  • Both bills make some, though not sweeping provisions, for consumer access to the information collected about them.  CDT endorses neither, and would support a bill that would generally require covered entities to make available to consumers the covered information possessed about them along with a reasonable method of correction.  Some companies, including a start-up called Bynamite, have already begun to show consumers what’s being collected, albeit in rather limited ways.  We at the Common Data Project hope this push to access also includes access to the richness of the information collected from all of us, and not just the interests asssociated with me.  It’ll be interesting to see where this legislation goes, and how it might affect the development of our datatrust.

Get Adobe Flash player