Common Data Project Enters Knight News Challenge

June 21st, 2012 by The Common Data Project

Knight Foundation




The Knight News Challenge from the Knight Foundation with the following goal:

The Knight News Challenge accelerates media innovation by funding breakthrough ideas in news and information. Winners receive a share of $5 million in funding and support from Knight’s network of influential peers and advisors to help advance their ideas.

The proposal process is all about brevity, so we had to encapsulate the next phase of datatrust development in a few short sentences.


Do companies like Acxiom keep you up at night?

June 20th, 2012 by Mimi Yin

IT knows who you are. It knows where you live. It knows what you do.

It peers deeper into American life than the F.B.I. or the I.R.S., or those prying digital eyes at Facebook and Google. If you are an American adult, the odds are that it knows things like your age, race, sex, weight, height, marital status, education level, politics, buying habits, household health worries, vacation dreams — and on and on.

Creepy? The author of this article certainly seems to be trying to make it sound creepy. What isn’t mentioned is that as an unregulated 3rd party data broker, Acxiom can cross-reference the data they buy from various sources to create a Frankenstein profile of each of us…the very kind of thing Google and Microsoft aren’t allowed to do.

Why is this interesting?

Crunching data to build up demographic and psychological profiles of people (as consumers) is probably inevitable. (A pretty safe bet given that it’s already happening.) And we believe that used in the right way, the ability to create these “comprehensive” profiles could be a net positive for all of us.

What isn’t inevitable is the lack of regulation around transparency and disclosure. We do it with food. We could do it with advertising and marketing offers.

(Plus we know that people tend to do the right thing if they know they’re being watched. And fortunately, corporations are people too.)

This ad was brought to you by your recent purchase of anti-fungal cream.

This phone call from your credit card company was brought to you because based on your purchases, we think you’re more susceptible to feeling guilty about not paying your bills.

Doesn’t sound realistic does it?

Maybe just a subtle, yet ubiquitous reminder that nothing is mere serendipity in the world of commerce would work better:

*Based on your profile.

(People still smoke, but no one can pretend ignorance of the health risks.)

It’s too early to know what companies should or shouldn’t be allowed to do with data, but what is clear is that we should at least be aware that they’re doing it! (Whatever it is they’re doing.)

Will the “Closing of the Open Web” also close the door on the possibility of individual agency over personal data?

March 10th, 2012 by Alex Selkirk

Recently there were a a spate of posts about the threat posed by dynamically-generated and therefore closed black box web services like Facebook and even more hermetically sealed proprietary content delivery platforms like iOS and Android to an open web.

I won’t rehash the whole explanation here as it is very well explained there, but a thought experiment that is worth unraveling is to imagine what the first-world would look like today if we had gone straight from proprietary desktop applications to proprietary mobile device applications, skipping over the age of the browser altogether.

Even if you are not well-versed in the nitty-gritty of http (hyper-text transfer protocol, as in how data gets passed around the internet), you might still have an intuitive sense that many things we take for granted today would not exist.

Google for one.
Shopping comparison sites for another.
Facebook would be a completely different beast.
Twitter wouldn’t exist.

All of the above rely on an “open web” speaking a “standard language” that anyone (meaning any piece of software that also speaks the “standard language”) can access and understand.


Sharing content via “links” wouldn’t really make sense if the content were locked inside of Apps that each individual had to pay for.

In general there would probably be a lot less content overall as content providers would have had to make the same hard choice desktop software makers have had to make for decades: Which platforms do I build for?

How might the world be different today if open web standards had never taken hold?

Free content might not be the norm.

For example, publishing might not be in its death throes as content-providers would have been able to lock down their content by delivering them in apps users paid for.

Without Google and the ability to follow users across property boundaries on the web, targeted advertising might not be the de rigueur business model for internet startups.

One scenario we haven’t seen spelled out is that the end of the open web also has unfortunate consequences for the idea that users might one day gain control over their own data.

Today, we are for the most part at the mercy of the websites, services and apps we use when it comes to what data is collected about us and how it’s used. Each web domain and service has it’s own terms of use (Apple recently presented me with a 42-page doozy for iTunes) and gets to decide what to collect, how to make use of it and who to share it with (usually with Google via Google AdWords or Google Analytics and Facebook via Facebook Like Button or Facebook Connect).

However today, in the browser, theoretical meta-services could be built that allow users to collect all the same data web sites collect and decide what they want to do with it. This possibility exists in the form of browser add-ons that have the same access to user behavior that web sites do, but with the significant difference that the browser add-on can track users wherever the user decides they would like to collect data about themselves.

By contrast, each web site can only track users within the confines of its domain, or if you’re a big player like Google and Facebook, you can track users wherever other websites have agreed to put Google AdWords, make use of Google Analytics or have installed Facebook Like buttons.

This theoretical possibility however does not exist on any of the mobile platforms. There is no way to plug-in to iOS or Android as a “meta-layer” across all apps to collect user behavioral data.

Given that these meta-layer user-centered data collection services are largely theoretical, it’s understandable that no one is losing sleep over them. However from our perspective, the decline of access points to collect data on the web presents a very real loss for advancing individual user control over data.

In the end, we still believe that the best way for individuals to regain footing in our privacy/data-collection tug-of-war with online services is for users to engage with data rather than run away from it.

Open Graph, Silk, etc: Let’s stop calling it a privacy problem

October 4th, 2011 by Alex Selkirk

The recent announcements of Facebook’s Open Graph update and Amazon Silk have provoked the usual media reaction about privacy. Maybe it’s time to give up on trying to fight data collection and data use issues with privacy arguments.

Briefly, the new features: Facebook is creating more ways for you to passively track your own activity and share it with others. Amazon, in the name of speedier browsing (on their new Kindle device), has launched a service that will capture all of your online browsing activity tied to your identity, and use it to do what sounds like collaborative filtering to predict your browsing patterns and speed them up.

Who cares if they know I'm a dog?

Who cares if they know I'm a dog? (SF Weekly)

Amazon likens what Silk is doing to the role of an Internet Service Provider, which seems reasonable, but since regulators are getting wary of how ISP’s leverage the data that passes through them, Amazon may not always enjoy that association.

EPIC (Electronic Privacy Information Center) has sent a letter to the FTC requesting an investigation of Facebook’s Open Graph changes and the new Timeline.

I’m not optimistic about the response. Depending on how the default privacy settings are configured, Open Graph may fall victim to another “Facebook ruined my diamond ring surprise by advertising it on my behalf” kerfuffle, which will result in a half-hearted apology from Zuckerberg and some shuffling around of checkboxes and radio buttons. The watchdogs aren’t as used to keeping tabs on Amazon, which has done a better job of meeting expectations around its use of customer data, so Silk may provoke a bit more soul-searching.

But I doubt it. In an excerpt from his book “Nothing to Hide: The False Tradeoff Between Privacy and Security” published in the Chronicle of Higher Education earlier this year, Daniel J. Solove does a great job of explaining why we have trouble protecting individual privacy at the cost of [national] security. In the course of his argument he makes two points which are useful in thinking about protecting privacy on the internet.

He quotes South Carolina law professor Ann Bartow as saying,

There are not enough privacy “dead bodies” for privacy to be weighed against other harms.

There’s plenty of media chatter monitoring the decay of personal privacy online, but the conversations have been largely theoretical, the stuff of political and social theory. We have yet to have an event that crystallizes the conversation into a debate of moral rights and wrongs.

Whatevers, See No Evil, and the OMG!’s

At one end of the “privacy theory” debate, there are the Whatevers, whose blasé battle cry of “No one cares about privacy any more,” is bizarrely intended to be reassuring. At the other end are the OMG!’s, who only speak of data collection and online privacy in terms of degrees of personal violation, which equally bizarrely has the effect of inducing public equanimity in the face of “fresh violations.”

However, as per usual, the majority of people exist in the middle where so long as they “See no evil and Hear no evil,” privacy is a tab in the Settings dialog, not a civil liberties issue. Believe it or not this attitude hampers both companies trying to get more information out of their users AND civil liberties advocates who desperately want the public to “wake up” to what’s happening. Recently, privacy lost to free speech – but more on that in a minute.

When you look into most of the privacy concerns that are raised about legitimate web sites and software, (not viruses, phishing or other malicious efforts) they usually have to do with fairly mundane personal information. Your name or address being disclosed inadvertently. Embarrassing photos. Terms you search for. The web sites you visit. Public records digitized and put on the web.

The most legally harmful examples involve identity theft, which while not unrelated to internet privacy, falls squarely in the well-understood territory of criminal activity. What’s less clear is what’s wrong with “legitimate actors” such as Google and Facebook and what they’re doing with our data.

Which brings us a second point from Solove:

“Legal and policy solutions focus too much on the problems under the Orwellian metaphor—those of surveillance—and aren’t adequately addressing the Kafkaesque problems—those of information processing.”

In other words, who cares if the servers at Google “know” what I’m up to. We can’t as yet really even understand what it means for a computer to “know” something about human activity. Instead, the real question is what is Google (the company, comprised of human beings) deciding to do with this data?

What are People deciding to do with data?

By and large, the data collection that happens on the internet today is feeding into one flavor or another of “targeted advertising.” Loosely, that means showing you advertisements that are intended for an individual with some of your traits, based on information that has been collected about you. A male. A parent. A music lover. The changes to Facebook’s Open Graph will create a targeting field day. Which, on some level is a perfectly reasonable and predictable extension of age-old advertising and marketing practices.

In theory, advertising provides social value in bridging information gaps about useful, valuable products; data-driven services like Facebook, Google and Amazon are simply providing the technical muscle to close that gap.

However, Open Graph, Silk and other data rich services place us at the top of a very long and shallow slide down to a much darker side of information processing, which has nothing to do with the processing, but about manipulation and balance of power. And it’s the very length and gentle slope of that slide that make it almost impossible for us to talk about what’s really going wrong, and makes it even somewhat pleasant to ride down on it. (Yes, I’m making a slippery slide argument.)

At the top of the slide, are issues of values and dehumanization.

Recently employers have been making use of credit checks to screen potential candidates, automatically rejecting applicants with low credit scores. Perhaps this is an ingenious, if crude, way to quickly filter down a flood of job applicants. While its utility remains to be proven, it’s with good reason that we pause to consider the unintended consequences of such a policy. In many areas, we have often chosen to supplement “objective,” statistical evaluations with more humanist, subjective techniques (the college application process being one notable example). We are also a society that likes to believe in second chances.

A bit further down the slide, there are questions of fairness.

Credit card companies have been using purchase histories as a way to decide who to push to pay their debt in full and who to strike a deal with. In other words, they’re figuring out who will be susceptible to “being guilted” and who’s just going to give them the finger when they call. This is a truly ingenious and effective way to lower the cost and increase the effectiveness of debt collection efforts. But is it fair to debtors that some people “get a deal” and others don’t? Surely, such inequalities have always existed. At the very least, it’s problematic that such practices are happening behind closed doors with little to no public oversight, all in the name of protecting individual privacy.

Finally, there are issues of manipulation where information about you is used to get you to do things you don’t actually want to do.

The fast food industry has been micro-engineering the taste, smell and texture of their food products to induce a very real food addiction in the human brain. Surely, this is where online behavioral data-mining is headed, amplified by the power to deliver custom-tailored experiences to individuals.

But it’s just the Same-Old, Same-Old

This last scenario sounds bad, but isn’t this simply more of the same old advertising techniques we love to hate? Is there a bright line test we can apply so we know when we’ve “crossed the line” over into manipulation and lies?

Drawing Lines

Clearly the ethics of data use and manipulation in advertising is something we have been struggling with for a long time and something we will continue to struggle with, probably forever. However, some lines have been drawn, even if they’re not very clear.

While the original defining study on subliminal advertising has since been invalidated, when it was first publicized, the idea of messages being delivered subliminally into people’s minds was broadly condemned. In a world of imperfect definitions of “truth in advertising” it was immediately clear to the public that subliminal messaging (if it could be done) crossed the line into pure manipulation, and that was unacceptable. It was quickly banned by the UK, Australia and the American Networks and the National Association of Broadcasters.

Thought Experiment: If we were to impose a “code of ethics” on data practitioners, what would it look like?

Here’s a real-world, data-driven scenario:

  • Pharmacies sell customer information to drug companies so that they can identify doctors who will be most “receptive” to their marketing efforts.
  • Drug companies spend $1 billion a year advertising online to encourage individuals to “ask your doctor about [insert your favorite drug here]” with vague happy-people-in-sunshine imagery.
  • Drug companies employ 90,000 salespeople (in 2005) to visit the best target doctors and sway them to their brands.

Vermont passed a law outlawing the use of the pharmacy data without patient consent on the grounds of individual privacy. Then, this past June 23rd, the supreme court decided it was a free-speech problem and struck down the Vermont law.

Privacy as an argument for hemming in questionable data use will probably continue to fail.

The trouble again is that theoretical privacy harms are weak sauce in comparison to data as a way to “bridge information gaps.” If we shut down use of this data on the basis of privacy, that prevents the government from using the same data to prioritize distribution of vaccines to clinics in high-risk areas.

Ah, but here we’ve stumbled on the real problem…

Let’s shift the conversation from Privacy to Access

Innovative health care cost reduction schemes like care management are starved for data. Privacy concerns about broad, timely analysis of tax returns have prevented effective policy evaluation. Municipalities negotiating with corporations lack data to make difficult economic stimulus decisions. Meanwhile private companies are drowning in data that they are barely scratching the surface of.

At the risk of sounding like a broken record, since we have written volumes about this already:

  • The problem does not lie in the mere fact that data is collected, but in how it is secured and processed and in who’s interest it is deployed.
  • Your activity on the internet, captured in increasingly granular detail is enormously valuable, and can be mined for a broad range of uses that as a society we may or may not approve of.
  • Privacy is an ineffective weapon to wield against the dark side of data use and instead, we should focus our efforts on (1) regulations that require companies to be more transparent about how they’re using data and (2) making personal data into a public resource that is in the hands of many.


Kerry-McCain Privacy Bill: What it got right, what’s still missing.

May 11th, 2011 by Alex Selkirk

At long last, we have a bill to talk about. It’s official name is the “Commercial Privacy Bill of Rights Act of 2011” and it was introduced by Senators Kerry and McCain.

I was pleasantly surprised by how well many of the concepts and definitions were articulated, especially given some of the vague commentary that I had read before the bill was officially released.

Perhaps most importantly, the bill acknowledges that de-identification doesn’t work, even if it doesn’t make a lot of noise about it.

More generally though, there is a lot that is right about this bill, and it cannot be dismissed as an ill-conceived, knee-jerk reaction to the media hype around privacy issues.

Commercial Privacy Bill of Rights Act of 2011For readers who are interested, I have outlined some of the key points from the bill that jumped out at me, as well as some questions and clarifications. Before getting to that however, I’d like to make three suggestions for additions to the bill.

Transparency, Clear Definitions and Public Access

Lawmakers should legislate more transparency into data collection; they should define what it means to render data “not personally identifiable;” and they should push for commercial data to be made available for public use.

Legislators should look for opportunities to require more transparency of companies and organizations collecting data by establishing new standards for “privacy accounting” practices.

Doing so will encourage greater responsibility on the part of data collectors and provide regulators with more meaningful tools for oversight. Some examples include:

  1. Companies collecting data should be required to identify outside contractors they hire to perform data-related services. Currently in the bill, companies are liable for their contractors when it comes to privacy and security issues. However, we need a more positive carrot to incent companies to keep closer track of who has access to sensitive data and for what purposes. A requirement to publicly account for that information is the best way to encourage more disciplined internal accounting practices.
  2. Data collectors should publicly and specifically state what data they are collecting in plain English. Most privacy policies today are far too vague and high-level because companies don’t want to be limited by their own policies.

For example, the following is taken from the Google Toolbar Privacy Policy:

“Toolbar’s enhanced features, such as PageRank and Sidewiki, operate by sending Google the addresses and other information about sites at the time you visit them.” (Italics mine.)

This begs the question, what exactly is covered by “other information?” How long I remain on a page? Whether I scroll down to the bottom of the page? What personalized content shows up? What comments I leave? The passwords I type in? These are all reasonable examples of the level of specificity at which Google could be more transparent about what data they collect. None of these items are too technical for the general user to understand and at this granularity, I don’t believe such a list would be terribly onerous keep up to date. We should be able to find a workable middle-ground that gives users of online services a more specific idea of what data is being collected about them without overwhelming them with too much technical detail.

Legislators Need to Establish Meaningful Standards for Anonymization

After describing the spirit of the regulations, the bill assigns certain tasks that are either too detailed or too dynamic to “rulemaking proceedings.” One such task is defining the requirements for providing adequate data security. I would like to add an additional, critical task to the responsibilities of those proceedings:

They must define what it means to “render not personally identifiable” (Sec 202a5A) or “anonymise” (sec 701-4) data.

Without a clear legal standard for anonymization the public will continue to be misled into believing that anonymous means their data is no longer linkable to their identity when in fact there can only ever be degrees of anonymity because complete anonymity does not exist. This is a problem we have been struggling with as well.

Our best guess at a good way to approach a legal definition would be to build up a framework around acceptable levels of risk and require companies and organizations collecting data to quantify the amount of risk they incur when they share data, which is actually possible with something like differential privacy.

Legislators Should Push for Public Access

Entities that collect data from the public should be required to make it publicly available, through something like our proposal for the datatrust.

Businesses of all sorts have, with the advent of technology, become data businesses. They live and die by the data that they come by, though little of it was given to them for the purposes it is now used for. That doesn’t mean we should delete the data, or stop them from gathering it – that data is enormously valuable.

It does mean that the public needs a datastore to compete with the massive private sector data warehouses. The competitive edge that large datasets provide the entities that have them is gigantic, and no amount of notice and security can address that imbalance with the paucity of granular data available in the public realm.

Now for a more detailed look at the bill.

Key Points of the Bill

  1. The bill is about protecting Personally Identifiable Information (PII), which it correctly disambiguates to mean both the unique identifying information itself AND any information that is linked to that identifier.
  2. Though much of the related discussion in the media talks about the bill in terms of its impact to tracking individuals on the internet, the bill is about all commercial entities, online or off.
  3. “Entities” must give notice to users about collecting or using PII – this isn’t particularly shocking, but what may be more complicated will be what constitutes “notice”.
  4. Opt-out for individuals is required for use of information that would otherwise be considered an unauthorized use. (This is a nice thought, but the list of exceptions to the unauthorized use definition seems to be very comprehensive – if anyone has a good example of use that would “otherwise be unauthorized” and is thus addressed by this point, I would be interested to hear it.)
  5. Opt-out for individuals is also required for the use of an individual’s covered information by a third-party for behavioral advertising or marketing. (I guess this means that a news site would need to provide an opt-out for users that prevents ad-networks from setting cookies, for example?)
  6. Opt-in for individuals is required for the use or transfer of sensitive PII (a special category of PII that could cause the individual physical or economic harm, in particular medical information or religious affiliations) for uses other than handling a transaction (does serving an ad count as a transaction? – this is not defined), fighting fraud or preventative security. Opt-in is also required if there is a material change to the previously consented uses and that use creates a risk of economic or physical harm.
  7. Entities need to be accountable for providing adequate security/protection for the PII that they store.
  8. Entities can use the PII that they collect for an enumerated list of purposes, but from my reading, just about any purpose related to their business.
  9. Entities can’t transfer this data to other entities without explicit user consent. Entities may not combine de-identified data with other data “in order to” re-identify it. (Unclear if they combine it without the intent of re-identification, but it has the same effect.)
  10. Entities are liable for the actions of the vendors they contract PII work to.
  11. Individuals must be able to access and update the information entities have about them. (The process of authenticating individuals to ensure they are updating their own information will be a hard nut to crack, and ironically may potentially require additional information be collected about them to do so.)

It’s hard to disagree with the direction of the above points – all are ideas that seem to be doing the right thing for user privacy. However, there are some hidden issues, some of which may be my misunderstanding, but some of which definitely require clarifying the goal of the bill.


1. Practical Enforcement – While the bill specifies fines and indicates that various rule making groups will be created to flesh out the practical implications of the bill, it’s not clear how the new law will actually change the status quo when it comes to enforcement of privacy rules. With no filing and accounting requirements to demonstrate that they are actually doing so, outside of blatant violations such as completely failing to provide notice to end users use of PII, the FTC will have no way of “being alerted” when data collectors break the rules. Instead, they will be operating blindly, wholly dependent on whistle blowers for any view into the reality of day-to-day data collection practices.

2. Meaningful Notice and Consent – While the bill lays out specific scenarios where “proper notice” and “explicit [individual] consent” will be required, there is no further explication of what “proper notice” and “explicit consent” should consist of.

Today, “proper notice” for online services FDA Nutritional Facts Sampleconsists of providing a lengthy legal document that is almost never read, and even more rarely fully understood by individuals. In the same vein, “Explicit consent” is when those same individuals “agree” to the terms laid out in the lengthy document they didn’t read.

We need guidelines that provide formatting and placement requirements for notice and consent, much the way the the FDA actually designed “Nutrition Facts” labels for food packaging.

3. Regulating Ad Networks – In the bill’s attempt to distinguish between third-parties (requires separate notice) and business partners (does not require separate notice), it remains unclear which category ad networks belong to.

Ads served up directly by New York Times on should probably be considered an integral part of the NYT site.

However, should Google AdWords be handled in the same way? Or are they really third party advertisers that should be required to provide users with separate notice before they can set and retrieve cookies?

More disturbingly, the bill seems to imply that online services gain an all-inclusive free pass to track you wherever you go on the web as soon as you “establish a business relationship,” what EFF is calling the “Facebook loophole.” This means that by signing up for a gmail account, you are also agreeing to Google AdWords tracking what you read on blogs and what you buy online.

This is, of course, how privacy agreements work today. But the ostensible goal of this bill is to close such loopholes.

A Step In The Right Direction

The Kerry-McCain Privacy Bill is undeniable evidence of significant progress in public awareness of privacy issues. However, in the final analysis, the bill in its current form is unlikely to practically change how businesses collect, use and manage sensitive personal data.

The CDP Private Map Maker v0.2

April 27th, 2011 by Tony Gibbon

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)

Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Should Pharma have access to doctors’ prescription records?

April 26th, 2011 by Mimi Yin

Maine, New Hampshire and Vermont want to pass laws to prevent pharmacies from selling prescription data to drug companies, who in turn use it for “targeted marketing to doctors” or “tailoring their products to better meet the needs of health practitioners” (depending on who you talk to).

This gets at the heart of the issue of imbalance between private and public sectors when it comes to access to sensitive information.

From our perspective, it doesn’t seem like a good idea to limit data usage. If the drug companies are smart, they’re also using the same data to figure out things like what drugs are being prescribed in combination and how that affects the effectiveness of their products.

Instead, we should be thinking of ways to expand access so that for every drug company buying data for marketing and product development, there is an active community of researchers, public advocates and policymakers who have low-cost or free access to the same data.

Comments on Richard Thaler “Show Us the Data. (It’s Ours, After All.)” NYT 4/23/11

April 26th, 2011 by Alex Selkirk

Professor Richard Thaler, a professor from the University of Chicago wrote a piece in the New York Times this weekend with an idea that is dear to CDP’s mission: making data available to the individuals it was collected from.

Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.

1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.

2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.

3. I’m also not sure why Professor Thaler says that the Kerry-McCain Commercial Privacy Bill of Rights Act of 2011 doesn’t cover this issue. My reading of the bill is that it’s covered in the general sense of access to your information – Section 202(4) reads:

to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-

(A) access to such information; and

(B) mechanisms to correct such information to improve the accuracy of such information;

Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.

I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.


Response to: “A New Internet Privacy Law?” (New York Times – Opinion, March 18)

April 6th, 2011 by Alex Selkirk

There has been scant detailed coverage of the current discussions in Congress around an online privacy bill. The Wall Street Journal has published several pieces on it in their “What They Know” section but I’ve had a hard time finding anything that actually details the substance of the proposed legislation. There are mentions of Internet Explorer 9’s Tracking Protection Lists, and Firefox’s “Do Not Track” functionality, but little else.

Not surprisingly, we’re generally feeling like legislators are barking up the wrong tree by pushing to limit rather than expand legitimate uses of data in hard-to-enforce ways (e.g. “Do Not Track,” data deletion) without actually providing standards and guidance where government regulation could be truly useful and effective (e.g. providing a technical definition of “anonymous” for the industry and standardizing “privacy risk” accounting methods).

Last but not least, we’re dismayed that no one seems to be worried about the lack of public access to all this data.

In response, we sent the following letter to the editor to the New York Times on March 23, 2011 in response to the first appearance of the issue in their pages – an opinion piece titled “A New Internet Privacy Law,” published on March 18, 2011.


While it is heartening to see Washington finally paying attention to online privacy, the new regulations appear to miss the point.

What’s needed is more data, more creative re-uses of data and more public access to data.

Instead, current proposals are headed in the direction of unenforceable regulations that hope to limit data collection and use.

So, what *should* regulators care about?

1. Much valuable data analysis can and should be done without identifying individuals. However, there is as yet, no widely accepted technical definition of “anonymous.” As a result, data is bought, sold and shared with “third-parties” with wildly varying degrees of privacy protection. Regulation can help standardize anonymization techniques which would create a freer, safer market for data-sharing.

2. The data stockpiles being amassed in the private sector have enormous value to the public, yet we have little to no access to it. Lawmakers should explore ways to encourage or require companies to donate data to the public.

The future will be about making better decisions with data, and the public is losing out.

Alex Selkirk
The Common Data Project – Working towards a public trust of sensitive data


Whitepaper 2.0: A moral and practical argument for public access to private data.

April 4th, 2011 by The Common Data Project

It’s here! The Common Data Project’s White Paper version 2.0.

This is our most comprehensive moral and practical argument to date for the creation of a public datatrust that provides public access to today’s growing store of sensitive personal information.

At this point, there can be no doubt that sensitive personal data, in aggregate, is and will continue to be an invaluable resource for commerce and society. However, today, the private sector holds a near monopoly on such data. We believe that it is time We, The People gain access to our own data; access that will enable researchers, policymakers and NGOs acting in the public interest to make decisions in the same data-informed ways businesses have for decades.

Access to sensitive personal information will be the next “Digital Divide” and our work is perhaps best described as an effort to bridge that gap.

Still, we recognize that there are many hurdles to overcome. Currently, highly valuable data, from online behavioral data to personal financial and medical records are silo-ed and, in the name of privacy, inaccessible. Valuable data is kept out of the reach of the public and in many cases unavailable even to the businesses, organizations and government agencies that collect the data in the first place. Many of these data holders have business reasons or public mandates to share the data they have, but can’t or only do so in a severely limited manner and through a time-consuming process.

We believe there are technological and policy solutions that can remedy this situation and our white paper attempts to sketch out these solutions in the form of a “datatrust.”

We set out to answer the major questions and open issues that challenge the viability of the datatrust idea.

  1. Is public access to sensitive personal information really necessary?
  2. If it is, why isn’t this already a solved problem?
  3. How can you open up sensitive data to the public without harming the individuals represented in that data?
  4. How can any organization be trusted to hold such sensitive data?
  5. Assuming this is possible and there is public will to pull it off, will such data be useful?
  6. All existing anonymization methodologies degrade the utility of data, how will the datatrust strike a balance between utility and privacy?
  7. How will the data be collated, managed and curated into a usable form?
  8. How will the quality of the data be evaluated and maintained?
  9. Who has a stake in the datatrust?
  10. The datatrust’s purported mission is to serve the interests of society, will you and I as members of society have a say in how the datatrust is run?

You can read the full paper here.

Comments, reactions and feedback are all welcome. You can post your thoughts here or write us directly at info at commondataproject dot org.

Get Adobe Flash player