Posts Tagged ‘Access to Information’

Kerry-McCain Privacy Bill: What it got right, what’s still missing.

Wednesday, May 11th, 2011

At long last, we have a bill to talk about. It’s official name is the “Commercial Privacy Bill of Rights Act of 2011” and it was introduced by Senators Kerry and McCain.

I was pleasantly surprised by how well many of the concepts and definitions were articulated, especially given some of the vague commentary that I had read before the bill was officially released.

Perhaps most importantly, the bill acknowledges that de-identification doesn’t work, even if it doesn’t make a lot of noise about it.

More generally though, there is a lot that is right about this bill, and it cannot be dismissed as an ill-conceived, knee-jerk reaction to the media hype around privacy issues.

Commercial Privacy Bill of Rights Act of 2011For readers who are interested, I have outlined some of the key points from the bill that jumped out at me, as well as some questions and clarifications. Before getting to that however, I’d like to make three suggestions for additions to the bill.

Transparency, Clear Definitions and Public Access

Lawmakers should legislate more transparency into data collection; they should define what it means to render data “not personally identifiable;” and they should push for commercial data to be made available for public use.

Legislators should look for opportunities to require more transparency of companies and organizations collecting data by establishing new standards for “privacy accounting” practices.

Doing so will encourage greater responsibility on the part of data collectors and provide regulators with more meaningful tools for oversight. Some examples include:

  1. Companies collecting data should be required to identify outside contractors they hire to perform data-related services. Currently in the bill, companies are liable for their contractors when it comes to privacy and security issues. However, we need a more positive carrot to incent companies to keep closer track of who has access to sensitive data and for what purposes. A requirement to publicly account for that information is the best way to encourage more disciplined internal accounting practices.
  2. Data collectors should publicly and specifically state what data they are collecting in plain English. Most privacy policies today are far too vague and high-level because companies don’t want to be limited by their own policies.

For example, the following is taken from the Google Toolbar Privacy Policy:

“Toolbar’s enhanced features, such as PageRank and Sidewiki, operate by sending Google the addresses and other information about sites at the time you visit them.” (Italics mine.)

This begs the question, what exactly is covered by “other information?” How long I remain on a page? Whether I scroll down to the bottom of the page? What personalized content shows up? What comments I leave? The passwords I type in? These are all reasonable examples of the level of specificity at which Google could be more transparent about what data they collect. None of these items are too technical for the general user to understand and at this granularity, I don’t believe such a list would be terribly onerous keep up to date. We should be able to find a workable middle-ground that gives users of online services a more specific idea of what data is being collected about them without overwhelming them with too much technical detail.

Legislators Need to Establish Meaningful Standards for Anonymization

After describing the spirit of the regulations, the bill assigns certain tasks that are either too detailed or too dynamic to “rulemaking proceedings.” One such task is defining the requirements for providing adequate data security. I would like to add an additional, critical task to the responsibilities of those proceedings:

They must define what it means to “render not personally identifiable” (Sec 202a5A) or “anonymise” (sec 701-4) data.

Without a clear legal standard for anonymization the public will continue to be misled into believing that anonymous means their data is no longer linkable to their identity when in fact there can only ever be degrees of anonymity because complete anonymity does not exist. This is a problem we have been struggling with as well.

Our best guess at a good way to approach a legal definition would be to build up a framework around acceptable levels of risk and require companies and organizations collecting data to quantify the amount of risk they incur when they share data, which is actually possible with something like differential privacy.

Legislators Should Push for Public Access

Entities that collect data from the public should be required to make it publicly available, through something like our proposal for the datatrust.

Businesses of all sorts have, with the advent of technology, become data businesses. They live and die by the data that they come by, though little of it was given to them for the purposes it is now used for. That doesn’t mean we should delete the data, or stop them from gathering it – that data is enormously valuable.

It does mean that the public needs a datastore to compete with the massive private sector data warehouses. The competitive edge that large datasets provide the entities that have them is gigantic, and no amount of notice and security can address that imbalance with the paucity of granular data available in the public realm.

Now for a more detailed look at the bill.

Key Points of the Bill

  1. The bill is about protecting Personally Identifiable Information (PII), which it correctly disambiguates to mean both the unique identifying information itself AND any information that is linked to that identifier.
  2. Though much of the related discussion in the media talks about the bill in terms of its impact to tracking individuals on the internet, the bill is about all commercial entities, online or off.
  3. “Entities” must give notice to users about collecting or using PII – this isn’t particularly shocking, but what may be more complicated will be what constitutes “notice”.
  4. Opt-out for individuals is required for use of information that would otherwise be considered an unauthorized use. (This is a nice thought, but the list of exceptions to the unauthorized use definition seems to be very comprehensive – if anyone has a good example of use that would “otherwise be unauthorized” and is thus addressed by this point, I would be interested to hear it.)
  5. Opt-out for individuals is also required for the use of an individual’s covered information by a third-party for behavioral advertising or marketing. (I guess this means that a news site would need to provide an opt-out for users that prevents ad-networks from setting cookies, for example?)
  6. Opt-in for individuals is required for the use or transfer of sensitive PII (a special category of PII that could cause the individual physical or economic harm, in particular medical information or religious affiliations) for uses other than handling a transaction (does serving an ad count as a transaction? – this is not defined), fighting fraud or preventative security. Opt-in is also required if there is a material change to the previously consented uses and that use creates a risk of economic or physical harm.
  7. Entities need to be accountable for providing adequate security/protection for the PII that they store.
  8. Entities can use the PII that they collect for an enumerated list of purposes, but from my reading, just about any purpose related to their business.
  9. Entities can’t transfer this data to other entities without explicit user consent. Entities may not combine de-identified data with other data “in order to” re-identify it. (Unclear if they combine it without the intent of re-identification, but it has the same effect.)
  10. Entities are liable for the actions of the vendors they contract PII work to.
  11. Individuals must be able to access and update the information entities have about them. (The process of authenticating individuals to ensure they are updating their own information will be a hard nut to crack, and ironically may potentially require additional information be collected about them to do so.)

It’s hard to disagree with the direction of the above points – all are ideas that seem to be doing the right thing for user privacy. However, there are some hidden issues, some of which may be my misunderstanding, but some of which definitely require clarifying the goal of the bill.


1. Practical Enforcement – While the bill specifies fines and indicates that various rule making groups will be created to flesh out the practical implications of the bill, it’s not clear how the new law will actually change the status quo when it comes to enforcement of privacy rules. With no filing and accounting requirements to demonstrate that they are actually doing so, outside of blatant violations such as completely failing to provide notice to end users use of PII, the FTC will have no way of “being alerted” when data collectors break the rules. Instead, they will be operating blindly, wholly dependent on whistle blowers for any view into the reality of day-to-day data collection practices.

2. Meaningful Notice and Consent – While the bill lays out specific scenarios where “proper notice” and “explicit [individual] consent” will be required, there is no further explication of what “proper notice” and “explicit consent” should consist of.

Today, “proper notice” for online services FDA Nutritional Facts Sampleconsists of providing a lengthy legal document that is almost never read, and even more rarely fully understood by individuals. In the same vein, “Explicit consent” is when those same individuals “agree” to the terms laid out in the lengthy document they didn’t read.

We need guidelines that provide formatting and placement requirements for notice and consent, much the way the the FDA actually designed “Nutrition Facts” labels for food packaging.

3. Regulating Ad Networks – In the bill’s attempt to distinguish between third-parties (requires separate notice) and business partners (does not require separate notice), it remains unclear which category ad networks belong to.

Ads served up directly by New York Times on should probably be considered an integral part of the NYT site.

However, should Google AdWords be handled in the same way? Or are they really third party advertisers that should be required to provide users with separate notice before they can set and retrieve cookies?

More disturbingly, the bill seems to imply that online services gain an all-inclusive free pass to track you wherever you go on the web as soon as you “establish a business relationship,” what EFF is calling the “Facebook loophole.” This means that by signing up for a gmail account, you are also agreeing to Google AdWords tracking what you read on blogs and what you buy online.

This is, of course, how privacy agreements work today. But the ostensible goal of this bill is to close such loopholes.

A Step In The Right Direction

The Kerry-McCain Privacy Bill is undeniable evidence of significant progress in public awareness of privacy issues. However, in the final analysis, the bill in its current form is unlikely to practically change how businesses collect, use and manage sensitive personal data.

Should Pharma have access to doctors’ prescription records?

Tuesday, April 26th, 2011

Maine, New Hampshire and Vermont want to pass laws to prevent pharmacies from selling prescription data to drug companies, who in turn use it for “targeted marketing to doctors” or “tailoring their products to better meet the needs of health practitioners” (depending on who you talk to).

This gets at the heart of the issue of imbalance between private and public sectors when it comes to access to sensitive information.

From our perspective, it doesn’t seem like a good idea to limit data usage. If the drug companies are smart, they’re also using the same data to figure out things like what drugs are being prescribed in combination and how that affects the effectiveness of their products.

Instead, we should be thinking of ways to expand access so that for every drug company buying data for marketing and product development, there is an active community of researchers, public advocates and policymakers who have low-cost or free access to the same data.

Comments on Richard Thaler “Show Us the Data. (It’s Ours, After All.)” NYT 4/23/11

Tuesday, April 26th, 2011

Professor Richard Thaler, a professor from the University of Chicago wrote a piece in the New York Times this weekend with an idea that is dear to CDP’s mission: making data available to the individuals it was collected from.

Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.

1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.

2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.

3. I’m also not sure why Professor Thaler says that the Kerry-McCain Commercial Privacy Bill of Rights Act of 2011 doesn’t cover this issue. My reading of the bill is that it’s covered in the general sense of access to your information – Section 202(4) reads:

to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-

(A) access to such information; and

(B) mechanisms to correct such information to improve the accuracy of such information;

Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.

I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.


Response to: “A New Internet Privacy Law?” (New York Times – Opinion, March 18)

Wednesday, April 6th, 2011

There has been scant detailed coverage of the current discussions in Congress around an online privacy bill. The Wall Street Journal has published several pieces on it in their “What They Know” section but I’ve had a hard time finding anything that actually details the substance of the proposed legislation. There are mentions of Internet Explorer 9’s Tracking Protection Lists, and Firefox’s “Do Not Track” functionality, but little else.

Not surprisingly, we’re generally feeling like legislators are barking up the wrong tree by pushing to limit rather than expand legitimate uses of data in hard-to-enforce ways (e.g. “Do Not Track,” data deletion) without actually providing standards and guidance where government regulation could be truly useful and effective (e.g. providing a technical definition of “anonymous” for the industry and standardizing “privacy risk” accounting methods).

Last but not least, we’re dismayed that no one seems to be worried about the lack of public access to all this data.

In response, we sent the following letter to the editor to the New York Times on March 23, 2011 in response to the first appearance of the issue in their pages – an opinion piece titled “A New Internet Privacy Law,” published on March 18, 2011.


While it is heartening to see Washington finally paying attention to online privacy, the new regulations appear to miss the point.

What’s needed is more data, more creative re-uses of data and more public access to data.

Instead, current proposals are headed in the direction of unenforceable regulations that hope to limit data collection and use.

So, what *should* regulators care about?

1. Much valuable data analysis can and should be done without identifying individuals. However, there is as yet, no widely accepted technical definition of “anonymous.” As a result, data is bought, sold and shared with “third-parties” with wildly varying degrees of privacy protection. Regulation can help standardize anonymization techniques which would create a freer, safer market for data-sharing.

2. The data stockpiles being amassed in the private sector have enormous value to the public, yet we have little to no access to it. Lawmakers should explore ways to encourage or require companies to donate data to the public.

The future will be about making better decisions with data, and the public is losing out.

Alex Selkirk
The Common Data Project – Working towards a public trust of sensitive data


Whitepaper 2.0: A moral and practical argument for public access to private data.

Monday, April 4th, 2011

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

Yahoo or Google as a Datatrust? But will Facebook play?

Monday, May 4th, 2009

Time will tell, but it appears that Yahoo! has made it *really* easy (for application developers) to extract publicly available data from all over the interwebs and query it through Yahoo!’s servers.

YQL Execute allows you to build tables of data from other sources online, using Javascript as a programming language and run it on Yahoo’s servers, so the infrastructure needs are very small.

Similarly, Google “just launched a new search feature that makes it easy (for you and I) to find and compare public data.”

Graph from Google Public Data

Image taken from the Google Blog.

Which is pretty exciting as both are huge leaps towards what we’ve envisioned as a “datatrust” in various blog posts and our white paper. Well except for maybe the “trust” part. (Especially given our experiences with Yahoo here and here.)

A few more points to contemplate:

  1. Now that the Promised Land of collating all the world’s data approaches on the horizon, will that change people’s willingness to make data publicly accessible? What I share on my personal website might not be okay rearing its head in new contexts I never intended. As we’ve said elsewhere, when talking about privacy, context is everything.
  2. What about ownership? Both Yahoo! and Google may only temporarily cache the data insofar as is needed to serve it up. But, in effect, they will become the gatekeepers to all of our public data, data you and I contribute to. So the question remains, What about ownership?
  3. There’s still a lot of data that’s *not* publicly accessible. Possibly some of the most interesting and accurate data out there. How will we get at that? Case in point, Facebook just shut down a new app that allows you to extract your personal “Facebook Newsfeed” and make it public via an RSS feed, citing, what else? Privacy concerns. (Not to mention the fact that access to Facebook data is generally hamstrung by privacy.)

Data’s endless possibilities

Friday, January 9th, 2009

The New York Times recently published a succinct but meaty article on New York City’s new electronic health record system.  Planned and promoted by the Bloomberg administration, the system includes about 1000 primary care physicians, focused primarily on three of the poorest neighborhoods, and the data they generate about their patients.  As I read it, I found myself counting all the different functions of the system.  I found at least ten:

•    Clean up outdated filing systems;
•    Enable a doctor to compare how one patient is doing compared with his or her other patients;
•    Enable a doctor to compare how one patient is doing compared to patients all over the city;
•    Enable the city’s public health department to monitor disease frequency and outbreaks, like the flu;
•    Enable the city to promote preventative measures, like cancer screening in new ways;
•    Create new financial incentives for doctors to improve their patients’ health, on measures like controlling blood pressure or cholesterol;
•    Provide reports cards to doctors comparing their results with other doctors’;
•    Improve care by less-experienced doctors with advice and information based on a patient’s age, sex, ethnic background, and medical history, including prompts to provide routine tests and vaccinations and warnings on how drugs can potentially interact;
•    Allow doctors to follow up more closely with patients, like reminding them of appointments through new calling and text-messaging systems and being notified if their patients do not fill prescriptions; and
•    Allow patients to access their own records, make appointments electronically, and monitor their own progress on health targets (should the doctor decide to do so);

Pretty amazing, isn’t it?

Data is like that.  Once you collect it, the possibilities are endless.  Reading about this one system for health records made me realize why it’s so hard for me to describe CDP’s goals in one sentence.  We’re not trying to do something singular, like “enable a doctor to compare patients’ data.”  We’re trying to create a place where this function, and innumerable other possibilities can exist, while also being mindful that “endless possibilities” include some scary ones that we need to guard against.

Making personal data more personal

Monday, December 29th, 2008


The New York State Department of Health recently launched a new online tool for researching the prevalence of certain medical conditions by zip code.  It has a terribly boring name—Prevention Quality Indicators in New York State—but what they’re providing is very exciting.

Prevention Quality Indicators or PQIs are a set of measures developed by a federal health agency.  They count the number of people admitted to hospitals for a specific list of twelve conditions, some of which include various complications from diabetes, hypertension, asthma, and urinary tract infections.  All of these are conditions in which good preventative care can help avoid hospitalization or the development of more severe conditions.  As the Department explains, “The PQIs can be used as a starting point for evaluating the overall quality of primary and preventive care in an area. They are sometimes characterized as ‘avoidable hospitalizations,’ but this does not mean that the hospitalizations were unnecessary or inappropriate at the time they occurred.”

It’s not the kind of data that would normally get your average New York resident excited.  Even though it’s personal information—it doesn’t get more personal than health—it’s unlikely to feel very personal to anyone.

That’s what makes numbers and data off-putting for so many people.  Even when the numbers include people like us, we don’t see ourselves in them, so it’s hard to feel like those numbers have anything to say to us personally.  At the same time, so many decisions are being made based on data, huge decisions that affect all of us.  It’s important for democracy that ordinary citizens have a stake in the data, that they not only have access to the data but that they also have an interest in reviewing the data themselves.

What’s interesting to me about this website, then, is that is its potential for making this obscure piece of government health data much more immediate and personal for ordinary citizens, and not just public health data geeks.  As soon as I heard about this website, the first thing I did was look up my zip code, “11205” in the county of Kings (Brooklyn).  I could then see racial disparities in the admission rate for these conditions in my neighborhood, and even see data on specific hospitals in my area.  Whenever there is a way to organize and access data in a way that is personal to the user, it’s immediately more compelling.

There’s no particular reason for me to wonder what asthma admission rates were in my zip code in 2006.  But I can imagine a mother of a child with asthma coming upon this site, wondering what asthma rates are in her zip code and the ones around it, and maybe seeing patterns that lead her to talk to other parents and elected officials.  And I can imagine other data sets of personal information being made truly relevant and personal in similar ways.

Woo-hoo, more data…from Amazon?

Tuesday, December 9th, 2008

Amazon announced recently that they would begin hosting huge databases of public information on their servers and charging users only for the cost of computing and storage for their own applications.  Although this information is already publicly available, Amazon’s service in hosting the data means scientists, other researchers, and businesses no longer have to create their own infrastructure to store and analyze this data.  It’s the data equivalent of a library—where people can do research without having to house and maintain their own collections.

This is an incredible service Amazon is providing, but it did make me wonder, do we need an Andrew Carnegie of public databases for our time?  Carnegie, of course, was not a saint, and he imposed terms on the towns that applied for his money, but ultimately, he created the public institution of the public library.  Although we now take the idea of a public library for granted, to the point that we’ve let many of them wither away without funding, we’ve come to believe wholeheartedly that public access to information is essential and right.  Even the great collections of private universities support this principle; as nonprofit institutions given tax-exempt status, they are governed by their missions to add knowledge to the world and have simple procedures to grant access to people who are not affiliated with the university.

Here, Amazon is providing public access, but as a private company rather than a public institution or nonprofit organization.  I’m not saying that nonprofits and government entities are morally superior to private companies, or that private companies are incapable of providing a public service.  I actually think that private and public, for-profit and non-profit approaches to different issues is crucial for creating a truly vibrant marketplace of ideas.  But given the central and increasingly commanding role of data in our lives, it’s essential that we at least ask ourselves the question, “Are there functions that nonprofits and public institutions could fill better with regards to public access to data, than private companies?

We at the Common Data Project obviously believe there are good reasons to found a nonprofit organization to make data more public and accessible.  The number one reason, for me, is that the goal of public access to information may not always jive neatly with the more simple and straightforward goal of profit for a private company.

But what do you think?

The great story of good data

Wednesday, October 22nd, 2008

I love stories.

You might think, then, that I wouldn’t love data.  Stories and data are often seen as two very different ways of presenting information.  Data is considered cold, impersonal, incomplete.

But much of data’s bad reputation comes from limited data, not data in and of itself.  As Hans Rosling, a Swedish professor, demonstrates in this video, data can tell amazing stories.

It’s long but well-worth watching in its entirety.  It’s a few years old, from the 2006 TED conference, but I would bet it’s almost as riveting on YouTube as it was live at the conference.  In it, Rosling uses animated graphs of UN statistics from 1962 to 2003 to tell stories about our world and how it’s changed in ways that defy easy generalizations.

For example, his Swedish medical students studying global health assumed that there were two kinds of countries in the world—Western countries where family sizes are small and people live longer, and Third World countries where family sizes are large and people die young.  But as his animated graphs show, many countries that are still poor and developing have moved by 2003 into the upper left-hand quadrant, of countries with smaller families and longer life expectancies.  By 2003, Vietnam is in the same place the United States was in 1974.  As he declares, “If we don’t look at the data, we underestimate the tremendous change in Asia.”

In my favorite segment, like a great novelist building a complex character, Rosling breaks down one set of data over and over, showing the much more interesting and complex story behind average income and child survival.


The first graph, comparing GDP per capita among countries in the OECD, East Asia, South Asia, Africa, and Latin America, tells the story we all expect.  The blue dot in the upper right-hand quadrant is OECD countries; the small red dot on the bottom left-hand quadrant is Africa.


But then he shows how different countries within Africa have tremendous variations in GDP per capita, as well as child mortality, despite Western conceptions of a monolithic “Africa and its problems.”


And just when you’re patting yourself on the back for understanding that Africa includes a very diverse range of countries, he shows that even within the countries, the distribution of income is very broad.  The highest income quintile in South Africa is quite high, approaching the average per capita GDP in the United States.

As Rosling says, “Improvement of the world must be highly contextualized!”  And the data is what will allow us to do it.  His demonstration itself shows how data can be limiting, how it can be used to “prove” that all of Africa is poor and sick.  But the solution clearly isn’t to ignore the data but to look at more data. Ultimately, broad, detailed, longitudinal data push us to think harder, rather than rest on our assumptions. Stories still need to be told–how did Mauritius get wealthy and healthy?  Why didn’t Ghana? But without the data, we wouldn’t even know those stories were there.

Get Adobe Flash player