Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

Do companies like Acxiom keep you up at night?

Wednesday, June 20th, 2012

IT knows who you are. It knows where you live. It knows what you do.

It peers deeper into American life than the F.B.I. or the I.R.S., or those prying digital eyes at Facebook and Google. If you are an American adult, the odds are that it knows things like your age, race, sex, weight, height, marital status, education level, politics, buying habits, household health worries, vacation dreams — and on and on.

Creepy? The author of this article certainly seems to be trying to make it sound creepy. What isn’t mentioned is that as an unregulated 3rd party data broker, Acxiom can cross-reference the data they buy from various sources to create a Frankenstein profile of each of us…the very kind of thing Google and Microsoft aren’t allowed to do.

Why is this interesting?

Crunching data to build up demographic and psychological profiles of people (as consumers) is probably inevitable. (A pretty safe bet given that it’s already happening.) And we believe that used in the right way, the ability to create these “comprehensive” profiles could be a net positive for all of us.

What isn’t inevitable is the lack of regulation around transparency and disclosure. We do it with food. We could do it with advertising and marketing offers.

(Plus we know that people tend to do the right thing if they know they’re being watched. And fortunately, corporations are people too.)

This ad was brought to you by your recent purchase of anti-fungal cream.

This phone call from your credit card company was brought to you because based on your purchases, we think you’re more susceptible to feeling guilty about not paying your bills.

Doesn’t sound realistic does it?

Maybe just a subtle, yet ubiquitous reminder that nothing is mere serendipity in the world of commerce would work better:

*Based on your profile.

(People still smoke, but no one can pretend ignorance of the health risks.)

It’s too early to know what companies should or shouldn’t be allowed to do with data, but what is clear is that we should at least be aware that they’re doing it! (Whatever it is they’re doing.)

Kerry-McCain Privacy Bill: What it got right, what’s still missing.

Wednesday, May 11th, 2011

At long last, we have a bill to talk about. It’s official name is the “Commercial Privacy Bill of Rights Act of 2011” and it was introduced by Senators Kerry and McCain.

I was pleasantly surprised by how well many of the concepts and definitions were articulated, especially given some of the vague commentary that I had read before the bill was officially released.

Perhaps most importantly, the bill acknowledges that de-identification doesn’t work, even if it doesn’t make a lot of noise about it.

More generally though, there is a lot that is right about this bill, and it cannot be dismissed as an ill-conceived, knee-jerk reaction to the media hype around privacy issues.

Commercial Privacy Bill of Rights Act of 2011For readers who are interested, I have outlined some of the key points from the bill that jumped out at me, as well as some questions and clarifications. Before getting to that however, I’d like to make three suggestions for additions to the bill.

Transparency, Clear Definitions and Public Access

Lawmakers should legislate more transparency into data collection; they should define what it means to render data “not personally identifiable;” and they should push for commercial data to be made available for public use.

Legislators should look for opportunities to require more transparency of companies and organizations collecting data by establishing new standards for “privacy accounting” practices.

Doing so will encourage greater responsibility on the part of data collectors and provide regulators with more meaningful tools for oversight. Some examples include:

  1. Companies collecting data should be required to identify outside contractors they hire to perform data-related services. Currently in the bill, companies are liable for their contractors when it comes to privacy and security issues. However, we need a more positive carrot to incent companies to keep closer track of who has access to sensitive data and for what purposes. A requirement to publicly account for that information is the best way to encourage more disciplined internal accounting practices.
  2. Data collectors should publicly and specifically state what data they are collecting in plain English. Most privacy policies today are far too vague and high-level because companies don’t want to be limited by their own policies.

For example, the following is taken from the Google Toolbar Privacy Policy:

“Toolbar’s enhanced features, such as PageRank and Sidewiki, operate by sending Google the addresses and other information about sites at the time you visit them.” (Italics mine.)

This begs the question, what exactly is covered by “other information?” How long I remain on a page? Whether I scroll down to the bottom of the page? What personalized content shows up? What comments I leave? The passwords I type in? These are all reasonable examples of the level of specificity at which Google could be more transparent about what data they collect. None of these items are too technical for the general user to understand and at this granularity, I don’t believe such a list would be terribly onerous keep up to date. We should be able to find a workable middle-ground that gives users of online services a more specific idea of what data is being collected about them without overwhelming them with too much technical detail.

Legislators Need to Establish Meaningful Standards for Anonymization

After describing the spirit of the regulations, the bill assigns certain tasks that are either too detailed or too dynamic to “rulemaking proceedings.” One such task is defining the requirements for providing adequate data security. I would like to add an additional, critical task to the responsibilities of those proceedings:

They must define what it means to “render not personally identifiable” (Sec 202a5A) or “anonymise” (sec 701-4) data.

Without a clear legal standard for anonymization the public will continue to be misled into believing that anonymous means their data is no longer linkable to their identity when in fact there can only ever be degrees of anonymity because complete anonymity does not exist. This is a problem we have been struggling with as well.

Our best guess at a good way to approach a legal definition would be to build up a framework around acceptable levels of risk and require companies and organizations collecting data to quantify the amount of risk they incur when they share data, which is actually possible with something like differential privacy.

Legislators Should Push for Public Access

Entities that collect data from the public should be required to make it publicly available, through something like our proposal for the datatrust.

Businesses of all sorts have, with the advent of technology, become data businesses. They live and die by the data that they come by, though little of it was given to them for the purposes it is now used for. That doesn’t mean we should delete the data, or stop them from gathering it – that data is enormously valuable.

It does mean that the public needs a datastore to compete with the massive private sector data warehouses. The competitive edge that large datasets provide the entities that have them is gigantic, and no amount of notice and security can address that imbalance with the paucity of granular data available in the public realm.

Now for a more detailed look at the bill.

Key Points of the Bill

  1. The bill is about protecting Personally Identifiable Information (PII), which it correctly disambiguates to mean both the unique identifying information itself AND any information that is linked to that identifier.
  2. Though much of the related discussion in the media talks about the bill in terms of its impact to tracking individuals on the internet, the bill is about all commercial entities, online or off.
  3. “Entities” must give notice to users about collecting or using PII – this isn’t particularly shocking, but what may be more complicated will be what constitutes “notice”.
  4. Opt-out for individuals is required for use of information that would otherwise be considered an unauthorized use. (This is a nice thought, but the list of exceptions to the unauthorized use definition seems to be very comprehensive – if anyone has a good example of use that would “otherwise be unauthorized” and is thus addressed by this point, I would be interested to hear it.)
  5. Opt-out for individuals is also required for the use of an individual’s covered information by a third-party for behavioral advertising or marketing. (I guess this means that a news site would need to provide an opt-out for users that prevents ad-networks from setting cookies, for example?)
  6. Opt-in for individuals is required for the use or transfer of sensitive PII (a special category of PII that could cause the individual physical or economic harm, in particular medical information or religious affiliations) for uses other than handling a transaction (does serving an ad count as a transaction? – this is not defined), fighting fraud or preventative security. Opt-in is also required if there is a material change to the previously consented uses and that use creates a risk of economic or physical harm.
  7. Entities need to be accountable for providing adequate security/protection for the PII that they store.
  8. Entities can use the PII that they collect for an enumerated list of purposes, but from my reading, just about any purpose related to their business.
  9. Entities can’t transfer this data to other entities without explicit user consent. Entities may not combine de-identified data with other data “in order to” re-identify it. (Unclear if they combine it without the intent of re-identification, but it has the same effect.)
  10. Entities are liable for the actions of the vendors they contract PII work to.
  11. Individuals must be able to access and update the information entities have about them. (The process of authenticating individuals to ensure they are updating their own information will be a hard nut to crack, and ironically may potentially require additional information be collected about them to do so.)

It’s hard to disagree with the direction of the above points – all are ideas that seem to be doing the right thing for user privacy. However, there are some hidden issues, some of which may be my misunderstanding, but some of which definitely require clarifying the goal of the bill.


1. Practical Enforcement – While the bill specifies fines and indicates that various rule making groups will be created to flesh out the practical implications of the bill, it’s not clear how the new law will actually change the status quo when it comes to enforcement of privacy rules. With no filing and accounting requirements to demonstrate that they are actually doing so, outside of blatant violations such as completely failing to provide notice to end users use of PII, the FTC will have no way of “being alerted” when data collectors break the rules. Instead, they will be operating blindly, wholly dependent on whistle blowers for any view into the reality of day-to-day data collection practices.

2. Meaningful Notice and Consent – While the bill lays out specific scenarios where “proper notice” and “explicit [individual] consent” will be required, there is no further explication of what “proper notice” and “explicit consent” should consist of.

Today, “proper notice” for online services FDA Nutritional Facts Sampleconsists of providing a lengthy legal document that is almost never read, and even more rarely fully understood by individuals. In the same vein, “Explicit consent” is when those same individuals “agree” to the terms laid out in the lengthy document they didn’t read.

We need guidelines that provide formatting and placement requirements for notice and consent, much the way the the FDA actually designed “Nutrition Facts” labels for food packaging.

3. Regulating Ad Networks – In the bill’s attempt to distinguish between third-parties (requires separate notice) and business partners (does not require separate notice), it remains unclear which category ad networks belong to.

Ads served up directly by New York Times on should probably be considered an integral part of the NYT site.

However, should Google AdWords be handled in the same way? Or are they really third party advertisers that should be required to provide users with separate notice before they can set and retrieve cookies?

More disturbingly, the bill seems to imply that online services gain an all-inclusive free pass to track you wherever you go on the web as soon as you “establish a business relationship,” what EFF is calling the “Facebook loophole.” This means that by signing up for a gmail account, you are also agreeing to Google AdWords tracking what you read on blogs and what you buy online.

This is, of course, how privacy agreements work today. But the ostensible goal of this bill is to close such loopholes.

A Step In The Right Direction

The Kerry-McCain Privacy Bill is undeniable evidence of significant progress in public awareness of privacy issues. However, in the final analysis, the bill in its current form is unlikely to practically change how businesses collect, use and manage sensitive personal data.

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)

Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Response to: “A New Internet Privacy Law?” (New York Times – Opinion, March 18)

Wednesday, April 6th, 2011

There has been scant detailed coverage of the current discussions in Congress around an online privacy bill. The Wall Street Journal has published several pieces on it in their “What They Know” section but I’ve had a hard time finding anything that actually details the substance of the proposed legislation. There are mentions of Internet Explorer 9’s Tracking Protection Lists, and Firefox’s “Do Not Track” functionality, but little else.

Not surprisingly, we’re generally feeling like legislators are barking up the wrong tree by pushing to limit rather than expand legitimate uses of data in hard-to-enforce ways (e.g. “Do Not Track,” data deletion) without actually providing standards and guidance where government regulation could be truly useful and effective (e.g. providing a technical definition of “anonymous” for the industry and standardizing “privacy risk” accounting methods).

Last but not least, we’re dismayed that no one seems to be worried about the lack of public access to all this data.

In response, we sent the following letter to the editor to the New York Times on March 23, 2011 in response to the first appearance of the issue in their pages – an opinion piece titled “A New Internet Privacy Law,” published on March 18, 2011.


While it is heartening to see Washington finally paying attention to online privacy, the new regulations appear to miss the point.

What’s needed is more data, more creative re-uses of data and more public access to data.

Instead, current proposals are headed in the direction of unenforceable regulations that hope to limit data collection and use.

So, what *should* regulators care about?

1. Much valuable data analysis can and should be done without identifying individuals. However, there is as yet, no widely accepted technical definition of “anonymous.” As a result, data is bought, sold and shared with “third-parties” with wildly varying degrees of privacy protection. Regulation can help standardize anonymization techniques which would create a freer, safer market for data-sharing.

2. The data stockpiles being amassed in the private sector have enormous value to the public, yet we have little to no access to it. Lawmakers should explore ways to encourage or require companies to donate data to the public.

The future will be about making better decisions with data, and the public is losing out.

Alex Selkirk
The Common Data Project – Working towards a public trust of sensitive data


In The Mix…predicting the future; releasing healthcare claims; and $1.5 millions awarded to data privacy

Tuesday, November 30th, 2010

Some people out there think they can predict the future by scraping content off the web. Does it work simply because web 2.0 technologies are great at creating echo chambers? Is this just another way of amplifying that echo chamber and generating yet more self-fulfilling trend prophecies? See the Future with a Search (MIT Technology Review)

The U.S. Office of Personnel Management wants to create a huge database that contains healthcare claims of millions of. Many are concerned for how the data will be protected and used. More federal health database details coming following privacy alarm (Computer World)

Researchers at Purdue were awarded $1.5 million to investigate how well current techniques for anonymizing data are working and whether there’s a need for better methods. It would be interesting to know what they think of differential privacy. They  appear to be actually doing the dirty work of figuring out whether theoretical re-identification is more than just a theory. National Science Foundation Funds Purdue Data-Anonymization Project (Threat Post)

@IAPP Privacy Foo Camp 2010: What Is Anonymous Enough?

Tuesday, October 26th, 2010

Editor’s Note: Becky Pezely is an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Becky’s work, like Tony’s, touches on many of the privacy challenges that CDP hopes to address with the datatrust.  We’re happy to have her guest blogging about IAPP Academy 2010 here.

Several weeks ago we attended the 2010 Global Privacy Summit (IAPP 2010) in Baltimore, Maryland.   

In addition to some engaging high-profile keynotes – including FTC Bureau of Consumer Protection Director David Vladeck – we got to participate in the first ever IAPP Foo Camp

The Foo Camp was comprised of four discussion topics aimed at covering the top technology concerns facing a wide-range of privacy professionals.

The session we ran was titled “Low Impact Data Mining”.  The intention was to discuss, and better understand, the current challenges in managing data within an organization.  All with a lens on managing data in a way that is “low impact” on resources while returning “high (positive) impact” on the business.

The individuals in our group represented a vast array of industries including: financial services, insurance, pharmaceutical, law enforcement, online marketing, health care, retail and telecommunications.  It was fascinating that, even across such a wide range of industries, that there could be such a pervasive set of privacy  challenges that were common among them.

Starting with:

What is “anonymous enough”?

If all you need is gender, zip code and birthdate to re-identify someone then what data, when released, is truly “anonymous enough”?  Can a baseline be defined, and enforced, within our organization that ensures customer protection?

It feels safe to say that this was the root-challenge from which all the others stemmed.  Today the release of data is mostly controlled, and subsequently managed, by a trusted person(s). The individual(s) is the ones responsible for “sanitizing” the data that gets released internally, or externally, to the organization.  They are charged with managing the release of data to fulfill everything from understanding business performance to fulfilling business obligations with partners.  And their primary concern is to know how well they are protecting their customer’s information, not only from the perspective of company policy, but also from a perspective of personal morals. They are they gatekeepers for assessing the level of protection provided based on which data they released to whom and they want to have some guarantee that what they are releasing is “anonymous enough” to have the level of protection they want to achieve.  These gatekeepers want to know when the data they release is “anonymous enough” and how they can employ a definition, or measurement, that guarantees the right level of anonymity for their customers.

This challenge compounds for these individuals, and their organizations, when adding in various other truths of the nature of data today:

The silos are getting joined.

The convention that used to be held was that data within an organization was in a silo – all on it’s own and protected – such that anyone looking at the data, would only see that set of data.  Now, it’s starting to become the reality that these data sets are getting joined and it’s not always known where, when, how, with whom the join originated. Nor is it known where the joined data set could is currently stored since it was modified from its original silo.  Soon that joined data-set takes on a life of its own and makes its way around the institution.  Given the likelihood of this occurring, how can the person(s) responsible for being the gatekeeper(s) of the data, and assessing the level of protection provided to customers, do so with any kind of reliable measurement that guarantees the right level of anonymity?

And now there’s data in the public market.

Not only is the data joined with data (from other silos) within the organization, but also with data outside the organization sold in the public market.  This prospect has increased the ability for organizations to produce data that is “high impact” for the business – because they now know WAY MORE about their customers.  But does the benefit outweigh the liability? As the ability to know more about individual customers increases, so does the level of sensitivity and the concern for privacy.    How do organizations successfully navigate mounting privacy concerns as they move from in silos, to joined-silos, to joined-silos combined with public data?   

The line between “data analytics” and looking at “raw data” is blurring.

Because the data is richer, and more plentiful, the act of data analysis isn’t as benign as it might once have been.  The definition of “data analytics” has evolved from something high-level (to know, for example, how many new customers are using the service this quarter) to something that  looks a lot more like looking at raw data to target specific parts of their business to specific customers (to, for example, sell <these products> to customers that make <this much money>, are females ages 30 – 35 and live in <this neighborhood> and typically spend <this much> on <these types of products>, etc…).

And the data has different ways of exiting the system.

The truth is, as scary as this data can be, everyone wants to get their hands on it, because the data leads to awareness that is meaningful and valuable for the business.  Thus, the data is shared everywhere – inside and outside the organization.  With that fact comes a whole set of challenges emerge when considering all the ways data might be exiting any given “silo”, such as: Where is all the data going?  How is it getting modified (joined, sanitized, rejoined) and at which point is it no longer the data that needs to be protected by the organization? How much data needs to be released externally to fulfill partner/customer business obligations? Once the data has exited, can the organization’s privacy practices still be enforced? 

Brand affects privacy policy.  Privacy policy affects brand.

Privacy is a concern of the whole business, not just the resources that manage the data, nor solely the resources that manage liability.  In the event of a “big oopsie” where there is a data/privacy breach, it will be the communication with customers before, during and after the incident that determines the internal and external impact on the brand and the perception of the organization.  And that communication is dictated by both what the privacy policy enforces and what brand “allows”.  In today’s age of data, how can an organization have an open dialog with customers about their data if the brand does not support having that kind of a conversation?  No surprise that Facebook is the exemplary case for this: Facebook continues to pave a new path, and draw customers, to share and disclose more information about themselves.  As a result they have experienced the backlash from customers when they take it too far. The line of communication is very open – customers have a clear way to lash back when Facebook has gone too far, and Facebook has a way of visibly standing behind their decision or admitting their mistake.  Either way, it is now commonplace for Facebook’s customers to expect that there will be more incidents like this and that Facebook has a way (apparently suitable enough to keep most customers) of dealing with it.  Their “policy” allowed them to respond this way, and now it’s become a part of who Facebook is.  And now the policy that evolves to support this behavior moving forward.

In the discussion of data and privacy, it seems inherently obvious that the mountain of challenges we face is large, complicated and impacts the core of all our businesses.  Nonetheless, it is still fascinating to have been able to witness first-hand – and to now be able to specifically articulate – how similar the challenges are across a diverse group of businesses and how similar the concerns are across job-function. 

We want to re-thank everyone from IAPP that joined in on the discussions that we had at Foo Camp and throughout the conference.  We look forward to an opportunity to deep dive into these types of problems.

Post Script: Meanwhile, the challenges, and related questions, around the anonymization of data with some kind of measurable privacy guarantee that came up at Foo Camp are ones that we have been discussing on our blog for quite some time.  These are precisely the sorts of challenges that have motivated us to create a datatrust.  While we typically envision the datatrust being used in scenarios where there isn’t direct access to data, we walked away with specific examples from our discussions at IAPP Foo Camp where direct access to the data is required – particularly to fulfill business obligation – as a type of collateral (or currency). 

The concept of data as the new currency of today’s economy has emerged.  Not only did it come up at the IAPP Foo Camp, it also came up back in August where we heard Marc Davis talk about this at IPP 2010. With all of this in mind, it is interesting evaluate the possibility of the datatrust being able to act as a special type of data broker in these exchanges.  The idea being that the datatrust is a sanctioned data broker (by the industry, or possibly by the government), that inherently meets federal, local, municipal regulations and protects the consumers of business partners who want to exchange data as “currency,” while alleviating businesses and their partners from the headaches of managing data use/reuse.  The “tax” on using the service is that these aggregates are stored and made available to the public to query in the way we imagine (no direct access to the data) for policy-making and research.  This is something that feels compelling to us and will influence our thinking as we continue to move forward with our work.

Common Data Project at IAPP Privacy Academy 2010

Monday, September 13th, 2010

We will be giving a Lightning Talk on “Low-Impact Data-Mining” and running two breakout sessions at the IT Privacy Foo Camp – Preconference Session, Wednesday Sept 29.

Below is a preview of our slides and handout for the conference. Unlike our previous presentations, we won’t be talking about CDP and the Datatrust at all. Instead, we’ll be focused on presenting on how SGM helps companies minimize the privacy impact of their data-mining.

More specifically, we’ll be stepping through the symbiotic documentation system we’ve created between the product development/data science folks collecting and making use of the data and the privacy/legal folks trying to regulate and monitor compliance with privacy policies. We will be using the SGM Data Dictionary as a case study in the breakout sessions.

Still, we expect that many of issues we’ve been grappling with from the datatrust perspective (e.g. public perception, trust, ownership of data, meaningful privacy guarantees) will come up as they are universal issues that are central to any meaningful discussion about privacy today.


What is data science?

An introduction to data-mining from O’Reilly Radar that provides a good explanation of how data-mining is distinct from previous uses of data and provides plenty of examples of how data-mining is changing products and services today.

The “Anonymous” Promise and De-indentification

  1. How you can be re-identified: Zip code + Birth date + Gender = Identity
  2. Promising new technologies for anonymization: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm.

Differential Privacy: A Programmatic Way to Enforce Your Privacy Guarantee?

  1. A Microsoft Research Implementation: PINQ
  2. CDP’s write-up about PINQ.
  3. A deeper look at how differential privacy’s mathematical guarantee might translate into laymen’s terms.

Paradigms of Data Ownership: Individuals vs Companies

  1. Markets and Privacy by Kenneth C. Laudon
  2. Privacy as Property by Lawrence Lessig
  3. CDP explores the advantages and challenges to a “Creative Commons-style” model for licensing personal information?
  4. CDP’s Guide to How to Read a Privacy Policy

In the mix…nonprofit technology failures; not counting religion; medical privacy after death; and the business of open data

Friday, August 20th, 2010

1) Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support.  If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.

2) I was really surprised to find out the U.S. Census doesn’t ask about religion.  It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly?  The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?

3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea.  Is there a way that this information can be made available without revealing individual identity?  That’s the essential problem the datatrust is trying to solve.

4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction.  As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.

We at CDP have been talking about how the datatrust might change the business of data.  We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold.  Already, the business of buying and selling data has changed so much in the past 10 years.  Exciting years ahead.

In the mix…data-sharing’s impact on Alzheimer’s research, the limits of a Do Not Track registry, meaning of data ownership and more

Friday, August 13th, 2010

1)  It’s heartening that an article on how data-sharing led to a breakthrough in Alzheimer’s research is the Most Emailed article on the NYTimes website right now. The reasons for resisting data-sharing are the same in so many contexts:

At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.

But Alzheimer’s researchers and drug companies realized they had little choice.

“Companies were caught in a prisoner’s dilemma,” said Dr. Jason Karlawish, an Alzheimer’s researcher at the University of Pennsylvania. “They all wanted to move the field forward, but no one wanted to take the risks of doing it.”

2) Google agonizes on privacy. The Wall Street Journal article discusses a confidential Google document that reveals the disagreements within the company on how it should use its data.  Interestingly, all the scenarios in which Google considers using its data involve targeted advertising; none involve sharing that data with Google users in a broader, more extensive way than they do now.  Google believes it owns the data it’s collected, but it also clearly senses that ownership of such data has implications that are different from ownership of other assets.  There are individuals who are implicated — what claims might they have to how that data is used?

3) Some people have suggested that if people are unhappy with targeted advertising, the government should come up with a Do Not Track registry, similar to the Do Not Call list.  But as Harlan Yu notes, Do Not Track would not be as simple as it sounds. He notes that the challenges involve both technology and policy:

Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.

4) What happens to a business’s data when it goes bankrupt? The former publisher and partners of a magazine and dating website for gay youth were fighting over ownership of the company’s assets, including its databases.  They recently came to an agreement to destroy the dataEFF argues that the Bankruptcy Code should be amended to require such outcomes for data assets.  I don’t know enough about bankruptcy law to have an opinion on that, but this conflict illuminates what’s so problematic about the way we treat data and property.  No one can own a fact, but everyone acts like they own data.  Something fundamental needs to be thrashed out.

5) Geotags are revealing more locational info than the photographers intended.

6) The owner of an ISP that resisted an FBI request for information can finally reveal his identity. Nicholas Merrill can now reveal that he was the plaintiff behind an ACLU lawsuit that challenged the legality of national security letter, by which the FBI can request information without a court order or proving just cause.  In fact, the FBI can even impose a gag order prohibiting the recipient of the NSL from telling anyone about the NSL, which is what happened to Merrill.

In the mix…WSJ’s “What They Know”; data potential in healthcare; and comparing the privacy bills in Congress

Monday, August 9th, 2010

1.  The Wall Street Journal Online has a new feature section, ominously named, “What They Know.” The section highlights articles that focus on technology and tracking.  The tone feels a little overwrought, with language that evokes spies, like “Stalking by Cellphone” and “The Web’s New Gold Mine: Your Secrets.”  Some of their methodology is a little simplistic.  Their study on how much people are “exposed” online was based on simply counting tracking tools, such as cookies and beacons, installed by certain websites.

It is interesting, though, to see that the big, bad wolves of privacy, like Facebook and Google, are pretty low on the WSJ’s exposure scale, while sites people don’t really think about, like, are very high.  The debate around online data collection does need to shift to include companies that aren’t so name-brand.

2.  In response to WSJ’s feature, AdAge published this article by Erin Jo Richey, a digital marketing analyst, addressing whether “online marketers are actually spies.” She argues that she doesn’t know that much about the people she’s tracking, but she does admit she could know more:

I spend most of my time looking at trends and segments of visitors with shared characteristics rather than focusing on profiles of individual browsers. However, if I already know that Mary Smith bought a black toaster with product number 08971 on Monday morning, I can probably isolate the anonymous profile that represents Mary’s visit to my website Monday morning.

3.  A nice graphic illustrates how data could transform healthcare. Are there people making these kinds of detailed arguments made for other industries and areas of research and policy?

4.  There are now two proposed privacy bills in Congress, the BEST PRACTICES bill proposed by Representative Bobby Rush and the draft proposed by Representatives Rick Boucher and Cliff Stearns.  CDT has released a clear and concise table breaking down the differences between these two proposed bills and what CDT recommends.  Some things that jumped out at us:

  • Both bills make exceptions for aggregated or de-identified data.  The BEST PRACTICES bill has a more descriptive definition of what that means, stating that it excepts aggregated information and information from which identifying information has been obscured or removed, such that there is no reasonable basis to believe that the information could be used to identify an individual or a computer used by the individual.  CDT supports the BEST PRACTICES exception.
  • Both bills make some, though not sweeping provisions, for consumer access to the information collected about them.  CDT endorses neither, and would support a bill that would generally require covered entities to make available to consumers the covered information possessed about them along with a reasonable method of correction.  Some companies, including a start-up called Bynamite, have already begun to show consumers what’s being collected, albeit in rather limited ways.  We at the Common Data Project hope this push to access also includes access to the richness of the information collected from all of us, and not just the interests asssociated with me.  It’ll be interesting to see where this legislation goes, and how it might affect the development of our datatrust.

Get Adobe Flash player