Posts Tagged ‘anonymization’

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)


Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Response to: “A New Internet Privacy Law?” (New York Times – Opinion, March 18)

Wednesday, April 6th, 2011

There has been scant detailed coverage of the current discussions in Congress around an online privacy bill. The Wall Street Journal has published several pieces on it in their “What They Know” section but I’ve had a hard time finding anything that actually details the substance of the proposed legislation. There are mentions of Internet Explorer 9’s Tracking Protection Lists, and Firefox’s “Do Not Track” functionality, but little else.

Not surprisingly, we’re generally feeling like legislators are barking up the wrong tree by pushing to limit rather than expand legitimate uses of data in hard-to-enforce ways (e.g. “Do Not Track,” data deletion) without actually providing standards and guidance where government regulation could be truly useful and effective (e.g. providing a technical definition of “anonymous” for the industry and standardizing “privacy risk” accounting methods).

Last but not least, we’re dismayed that no one seems to be worried about the lack of public access to all this data.

In response, we sent the following letter to the editor to the New York Times on March 23, 2011 in response to the first appearance of the issue in their pages – an opinion piece titled “A New Internet Privacy Law,” published on March 18, 2011.

 

While it is heartening to see Washington finally paying attention to online privacy, the new regulations appear to miss the point.

What’s needed is more data, more creative re-uses of data and more public access to data.

Instead, current proposals are headed in the direction of unenforceable regulations that hope to limit data collection and use.

So, what *should* regulators care about?

1. Much valuable data analysis can and should be done without identifying individuals. However, there is as yet, no widely accepted technical definition of “anonymous.” As a result, data is bought, sold and shared with “third-parties” with wildly varying degrees of privacy protection. Regulation can help standardize anonymization techniques which would create a freer, safer market for data-sharing.

2. The data stockpiles being amassed in the private sector have enormous value to the public, yet we have little to no access to it. Lawmakers should explore ways to encourage or require companies to donate data to the public.

The future will be about making better decisions with data, and the public is losing out.

Alex Selkirk
The Common Data Project – Working towards a public trust of sensitive data
http://commondataproject.org

 

In The Mix…predicting the future; releasing healthcare claims; and $1.5 millions awarded to data privacy

Tuesday, November 30th, 2010

Some people out there think they can predict the future by scraping content off the web. Does it work simply because web 2.0 technologies are great at creating echo chambers? Is this just another way of amplifying that echo chamber and generating yet more self-fulfilling trend prophecies? See the Future with a Search (MIT Technology Review)

The U.S. Office of Personnel Management wants to create a huge database that contains healthcare claims of millions of. Many are concerned for how the data will be protected and used. More federal health database details coming following privacy alarm (Computer World)

Researchers at Purdue were awarded $1.5 million to investigate how well current techniques for anonymizing data are working and whether there’s a need for better methods. It would be interesting to know what they think of differential privacy. They  appear to be actually doing the dirty work of figuring out whether theoretical re-identification is more than just a theory. National Science Foundation Funds Purdue Data-Anonymization Project (Threat Post)

@IAPP Privacy Foo Camp 2010: What Is Anonymous Enough?

Tuesday, October 26th, 2010

Editor’s Note: Becky Pezely is an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Becky’s work, like Tony’s, touches on many of the privacy challenges that CDP hopes to address with the datatrust.  We’re happy to have her guest blogging about IAPP Academy 2010 here.

Several weeks ago we attended the 2010 Global Privacy Summit (IAPP 2010) in Baltimore, Maryland.   

In addition to some engaging high-profile keynotes – including FTC Bureau of Consumer Protection Director David Vladeck – we got to participate in the first ever IAPP Foo Camp

The Foo Camp was comprised of four discussion topics aimed at covering the top technology concerns facing a wide-range of privacy professionals.

The session we ran was titled “Low Impact Data Mining”.  The intention was to discuss, and better understand, the current challenges in managing data within an organization.  All with a lens on managing data in a way that is “low impact” on resources while returning “high (positive) impact” on the business.

The individuals in our group represented a vast array of industries including: financial services, insurance, pharmaceutical, law enforcement, online marketing, health care, retail and telecommunications.  It was fascinating that, even across such a wide range of industries, that there could be such a pervasive set of privacy  challenges that were common among them.

Starting with:

What is “anonymous enough”?

If all you need is gender, zip code and birthdate to re-identify someone then what data, when released, is truly “anonymous enough”?  Can a baseline be defined, and enforced, within our organization that ensures customer protection?

It feels safe to say that this was the root-challenge from which all the others stemmed.  Today the release of data is mostly controlled, and subsequently managed, by a trusted person(s). The individual(s) is the ones responsible for “sanitizing” the data that gets released internally, or externally, to the organization.  They are charged with managing the release of data to fulfill everything from understanding business performance to fulfilling business obligations with partners.  And their primary concern is to know how well they are protecting their customer’s information, not only from the perspective of company policy, but also from a perspective of personal morals. They are they gatekeepers for assessing the level of protection provided based on which data they released to whom and they want to have some guarantee that what they are releasing is “anonymous enough” to have the level of protection they want to achieve.  These gatekeepers want to know when the data they release is “anonymous enough” and how they can employ a definition, or measurement, that guarantees the right level of anonymity for their customers.

This challenge compounds for these individuals, and their organizations, when adding in various other truths of the nature of data today:

The silos are getting joined.

The convention that used to be held was that data within an organization was in a silo – all on it’s own and protected – such that anyone looking at the data, would only see that set of data.  Now, it’s starting to become the reality that these data sets are getting joined and it’s not always known where, when, how, with whom the join originated. Nor is it known where the joined data set could is currently stored since it was modified from its original silo.  Soon that joined data-set takes on a life of its own and makes its way around the institution.  Given the likelihood of this occurring, how can the person(s) responsible for being the gatekeeper(s) of the data, and assessing the level of protection provided to customers, do so with any kind of reliable measurement that guarantees the right level of anonymity?

And now there’s data in the public market.

Not only is the data joined with data (from other silos) within the organization, but also with data outside the organization sold in the public market.  This prospect has increased the ability for organizations to produce data that is “high impact” for the business – because they now know WAY MORE about their customers.  But does the benefit outweigh the liability? As the ability to know more about individual customers increases, so does the level of sensitivity and the concern for privacy.    How do organizations successfully navigate mounting privacy concerns as they move from in silos, to joined-silos, to joined-silos combined with public data?   

The line between “data analytics” and looking at “raw data” is blurring.

Because the data is richer, and more plentiful, the act of data analysis isn’t as benign as it might once have been.  The definition of “data analytics” has evolved from something high-level (to know, for example, how many new customers are using the service this quarter) to something that  looks a lot more like looking at raw data to target specific parts of their business to specific customers (to, for example, sell <these products> to customers that make <this much money>, are females ages 30 – 35 and live in <this neighborhood> and typically spend <this much> on <these types of products>, etc…).

And the data has different ways of exiting the system.

The truth is, as scary as this data can be, everyone wants to get their hands on it, because the data leads to awareness that is meaningful and valuable for the business.  Thus, the data is shared everywhere – inside and outside the organization.  With that fact comes a whole set of challenges emerge when considering all the ways data might be exiting any given “silo”, such as: Where is all the data going?  How is it getting modified (joined, sanitized, rejoined) and at which point is it no longer the data that needs to be protected by the organization? How much data needs to be released externally to fulfill partner/customer business obligations? Once the data has exited, can the organization’s privacy practices still be enforced? 

Brand affects privacy policy.  Privacy policy affects brand.

Privacy is a concern of the whole business, not just the resources that manage the data, nor solely the resources that manage liability.  In the event of a “big oopsie” where there is a data/privacy breach, it will be the communication with customers before, during and after the incident that determines the internal and external impact on the brand and the perception of the organization.  And that communication is dictated by both what the privacy policy enforces and what brand “allows”.  In today’s age of data, how can an organization have an open dialog with customers about their data if the brand does not support having that kind of a conversation?  No surprise that Facebook is the exemplary case for this: Facebook continues to pave a new path, and draw customers, to share and disclose more information about themselves.  As a result they have experienced the backlash from customers when they take it too far. The line of communication is very open – customers have a clear way to lash back when Facebook has gone too far, and Facebook has a way of visibly standing behind their decision or admitting their mistake.  Either way, it is now commonplace for Facebook’s customers to expect that there will be more incidents like this and that Facebook has a way (apparently suitable enough to keep most customers) of dealing with it.  Their “policy” allowed them to respond this way, and now it’s become a part of who Facebook is.  And now the policy that evolves to support this behavior moving forward.

In the discussion of data and privacy, it seems inherently obvious that the mountain of challenges we face is large, complicated and impacts the core of all our businesses.  Nonetheless, it is still fascinating to have been able to witness first-hand – and to now be able to specifically articulate – how similar the challenges are across a diverse group of businesses and how similar the concerns are across job-function. 

We want to re-thank everyone from IAPP that joined in on the discussions that we had at Foo Camp and throughout the conference.  We look forward to an opportunity to deep dive into these types of problems.

Post Script: Meanwhile, the challenges, and related questions, around the anonymization of data with some kind of measurable privacy guarantee that came up at Foo Camp are ones that we have been discussing on our blog for quite some time.  These are precisely the sorts of challenges that have motivated us to create a datatrust.  While we typically envision the datatrust being used in scenarios where there isn’t direct access to data, we walked away with specific examples from our discussions at IAPP Foo Camp where direct access to the data is required – particularly to fulfill business obligation – as a type of collateral (or currency). 

The concept of data as the new currency of today’s economy has emerged.  Not only did it come up at the IAPP Foo Camp, it also came up back in August where we heard Marc Davis talk about this at IPP 2010. With all of this in mind, it is interesting evaluate the possibility of the datatrust being able to act as a special type of data broker in these exchanges.  The idea being that the datatrust is a sanctioned data broker (by the industry, or possibly by the government), that inherently meets federal, local, municipal regulations and protects the consumers of business partners who want to exchange data as “currency,” while alleviating businesses and their partners from the headaches of managing data use/reuse.  The “tax” on using the service is that these aggregates are stored and made available to the public to query in the way we imagine (no direct access to the data) for policy-making and research.  This is something that feels compelling to us and will influence our thinking as we continue to move forward with our work.

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

In the mix

Wednesday, September 9th, 2009

OpenID Pilot Program to be Announced by U.S. Government (ReadWriteWeb)

Stimulus Funding Map is “Slick as Hell” (FlowingData)

Why Anonymized Data Isn’t (Slashdot)

In the mix

Wednesday, May 27th, 2009

Data.gov: Unlocking the Federal Filing Cabinets. (NYT Bits)

On the Anonymity of Home/Work Location Pairs. (Schneier on Security)

Do People Care About Data Correlation?. (Kim Cameron’s Identity Blog)

In the mix

Wednesday, May 20th, 2009

Site Lets Writers Sell Digital Copies. (NY Times)

Linked Data is Blooming: Why You Should Care (ReadWriteWeb)

Mint Considers Selling Anonymized Data From Its Users (ReadWriteWeb)

The Growing Popularity of Popularity Lists (The Numbers Guy/Wall Street Journal)


Get Adobe Flash player