Posts Tagged ‘Privacy’

The CDP Private Map Maker v0.2

Wednesday, April 27th, 2011

We’ve released version 0.2 of the CDP Private Map Maker – A new way to release sensitive map data! (Requires Silverlight.)

Speedy, but is it safe?

Today, releasing sensitive data safely on a map is not a trivial task. The common anonymization methods tend to either be manual and time consuming, or create a very low resolution map.

Compared to current manual anonymization methods, which can take months if not years, our map maker leverages differential privacy to generate a map programmatically in much less time. For the sample datasets included, this process took a couple of minutes.

However, speed is not the map maker’s most important feature, safety is, through the ability to quantify privacy risk.

Accounting for Privacy Risk, Literally and Figuratively

We’re still leveraging the same differential privacy principles we’ve been working with all along. Differential privacy not only allows us to (mostly) automate the process of generating the maps, it also allows us to quantitatively balance the accuracy of the map against the privacy risk incurred when releasing the data.  (The purpose of the post is not to discuss whether differential privacy works–it’s an area of privacy research that has been around for several years and there are others better equipped to defend its capabilities.)

Think of it as a form of accounting. Rather than buying what appears to be cost-effective and hoping for the best, you can actually see the price of each item (privacy risk) AND know how accurate it will be.

Previous implementations of differential privacy (including our own) have done this accounting in code. The new map maker provides a graphical user interface so you can play with the settings yourself.
More details on how this works below.

Compared to v0.1

Version 0.2 updates our first test-drive of differential privacy.  Our first iteration allowed you to query the number of people in an arbitrary region of the map, returning meaningful results about the area as a whole without exposing individuals in the dataset.

The flexibility that application provided as compared to pre-bucketed data is great if you have a specific question, but the workflow of looking at a blank map and choosing an area to query doesn’t align with how people often use maps and data.  We generally like to see the data at a high level, and then dig deeper as needed.

In this round, we’re aiming for a more intuitive user experience. Our two target users are:

  1. Data Releaser The person releasing the data who wants to make intelligent decisions about how to balance privacy risk and data utility.
  2. Data User The person trying to make use of the the data, who would like to have a general overview of a data set before delving in with more specific questions.

As a result, we’ve flipped our workflow on it’s head. Rather than providing a blank map for you to query, the map maker now immediately produces populated maps at different levels of accuracy and privacy risk.

We’ve also added the ability to upload your own datasets and choose your own privacy settings to see how the private map maker works.

However, please do not upload actually sensitive data to this demo.

v.02 is for demonstration purposes only. Our hope is to create a forum where organizations with real data release scenarios can begin to engage with the differential privacy research community. If you’re interested in a more serious experiment with real data, please contact us.

Any data you do upload is available publicly to other users until it is deleted. (You can delete any uploaded dataset through the map maker interface.) The sample data sets provided cannot be deleted, and were synthetically generated – please do not use the sample data for any purpose other than seeing how the map maker works – the data is fake.

You can play with the demo here. (Requires Silverlight.)

Finally, a subtle, but significant change we should call out: – Our previous map demo leveraged an implementation of differential privacy called PINQ, developed at Microsoft Research.  Creating the grids for this map maker required a different workflow so we wrote our own implementation to add noise to the cell counts, using the same fundamentals of differential privacy.

More Details on How the Private Map Maker Works

How exactly do we generate the maps? One option – Nudge each data point a little

The key to differential privacy is adding random noise to each answer.  It only returns aggregates so we can’t ask it to ‘make a data point private’, but what if we added noise to each data point by moving it slightly?  The person consuming the map then wouldn’t know exactly where the data point originated from making it private, right?

The problem with this process is that we can’t automate adding this random noise because external factors might cause the noise to be ineffective.  Consider the red data point below.

If we nudge it randomly, there’s a pretty good chance we’ll nudge it right into the water.  Since there aren’t residences in the middle of Manhasset Bay, this could significantly narrow down the possibilities for the actual origin of the data point.  (One of the more problematic scenarios is pictured above.)  And water isn’t the only issue—if we’re dealing with residences, nudging into a strip mall, school, etc. could cause the same problem.  Because of these external factors, the process is manual and time consuming.   On top of that, unlike differential privacy, there’s no mathematical measure about how much information is being divulged—you’re relying on the manual review to catch any privacy issues.

Another Option – Grids

As a compromise between querying a blank map, and the time consuming (and potentially error prone) process of nudging data points, we decided to generate grid squares based on noisy answers—the darker the grid square, the higher the answer.  The grid is generated simply by running one differential privacy-protected query for each square.  Here’s an example grid from a fake dataset:

“But Tony!” you say, “Weren’t you just telling us how much better arbitrary questions are as compared to the bucketing we often see?”  First, this isn’t meant to necessarily replace the ability to ask arbitrary questions, but instead provides another tool allowing you to see the data first.  And second, compared to the way released data is often currently pre-bucketed, we’re able to offer more granular grids.

Choosing a Map

Now comes the manual part. There are two variables you can adjust when choosing a map: grid size and margin of error.  While this step is manual, most of the work is done for you, so it’s much less time-intensive than moving data points around. For demonstration purposes, we currently generate several options which you can select from in the gallery view. You could release any of the maps that are pre-generated as they are all protected by differential privacy with the given +/- –but some are not useful and others may be wasting privacy currency.

Grid size is simply the area of each cell.  Since a cell is the smallest area you can compare (with either another cell or 0), you must set it to accommodate the minimum resolution required for your analysis.  For example, using the map to allocate resources at the borough level vs. the block level require different resolutions to be effective. You also have to consider the density of the dataset. If your analysis is at the block level, but the dataset is very sparse such that there’s only about one point per block, the noise will protect those individuals, and the map will be uniformly noisy.

Margin of error specifies a range that the noisy answer will likely fall within.  The higher the margin of error, the less the noisy answer tells us about specific data points within the cell.  A cell with answer 20 +/- 3 means the real answer is likely between 17 and 23.  While an answer of 20 +/- 50 means the real answer is likely between -30 and 70, and thus it’s reasonably likely that there are no data points within that cell at all.

To select a map, first pan and zoom the map to show the portion you’re interested in, and then click the target icon for a dataset.

Map Maker Target Button

When you click the target, a gallery with previews of the nine pre-generated options are displayed.

As an example, let’s imagine that I’m doing block level analysis, so I’m only interested in the third column:

This sample dataset has a fairly small amount of data, such that in the top cell (+/- 50) and to some extent the middle cell (+/- 9), the noise overwhelms the data. In this case, we would have to consider tuning down the privacy protection towards the +/- 3 cell, in order to have a useful map at that resolution. (For this demo, the noise level is hard-coded.)  The other option is to sacrifice resolution (moving left in the gallery view), so there are more data points in a given square and thus won’t be drowned out by higher noise levels.

Once you have selected a grid, you can pan and zoom the map to the desired scale. The legend is currently dynamic such that it will adjust as necessary to the magnitude of the data in your current view.

Comments on Richard Thaler “Show Us the Data. (It’s Ours, After All.)” NYT 4/23/11

Tuesday, April 26th, 2011

Professor Richard Thaler, a professor from the University of Chicago wrote a piece in the New York Times this weekend with an idea that is dear to CDP’s mission: making data available to the individuals it was collected from.

Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.

1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.

2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.

3. I’m also not sure why Professor Thaler says that the Kerry-McCain Commercial Privacy Bill of Rights Act of 2011 doesn’t cover this issue. My reading of the bill is that it’s covered in the general sense of access to your information – Section 202(4) reads:

to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-

(A) access to such information; and

(B) mechanisms to correct such information to improve the accuracy of such information;

Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.

I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.


In The Mix…predicting the future; releasing healthcare claims; and $1.5 millions awarded to data privacy

Tuesday, November 30th, 2010

Some people out there think they can predict the future by scraping content off the web. Does it work simply because web 2.0 technologies are great at creating echo chambers? Is this just another way of amplifying that echo chamber and generating yet more self-fulfilling trend prophecies? See the Future with a Search (MIT Technology Review)

The U.S. Office of Personnel Management wants to create a huge database that contains healthcare claims of millions of. Many are concerned for how the data will be protected and used. More federal health database details coming following privacy alarm (Computer World)

Researchers at Purdue were awarded $1.5 million to investigate how well current techniques for anonymizing data are working and whether there’s a need for better methods. It would be interesting to know what they think of differential privacy. They  appear to be actually doing the dirty work of figuring out whether theoretical re-identification is more than just a theory. National Science Foundation Funds Purdue Data-Anonymization Project (Threat Post)

@IAPP Privacy Foo Camp 2010: What Is Anonymous Enough?

Tuesday, October 26th, 2010

Editor’s Note: Becky Pezely is an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Becky’s work, like Tony’s, touches on many of the privacy challenges that CDP hopes to address with the datatrust.  We’re happy to have her guest blogging about IAPP Academy 2010 here.

Several weeks ago we attended the 2010 Global Privacy Summit (IAPP 2010) in Baltimore, Maryland.   

In addition to some engaging high-profile keynotes – including FTC Bureau of Consumer Protection Director David Vladeck – we got to participate in the first ever IAPP Foo Camp

The Foo Camp was comprised of four discussion topics aimed at covering the top technology concerns facing a wide-range of privacy professionals.

The session we ran was titled “Low Impact Data Mining”.  The intention was to discuss, and better understand, the current challenges in managing data within an organization.  All with a lens on managing data in a way that is “low impact” on resources while returning “high (positive) impact” on the business.

The individuals in our group represented a vast array of industries including: financial services, insurance, pharmaceutical, law enforcement, online marketing, health care, retail and telecommunications.  It was fascinating that, even across such a wide range of industries, that there could be such a pervasive set of privacy  challenges that were common among them.

Starting with:

What is “anonymous enough”?

If all you need is gender, zip code and birthdate to re-identify someone then what data, when released, is truly “anonymous enough”?  Can a baseline be defined, and enforced, within our organization that ensures customer protection?

It feels safe to say that this was the root-challenge from which all the others stemmed.  Today the release of data is mostly controlled, and subsequently managed, by a trusted person(s). The individual(s) is the ones responsible for “sanitizing” the data that gets released internally, or externally, to the organization.  They are charged with managing the release of data to fulfill everything from understanding business performance to fulfilling business obligations with partners.  And their primary concern is to know how well they are protecting their customer’s information, not only from the perspective of company policy, but also from a perspective of personal morals. They are they gatekeepers for assessing the level of protection provided based on which data they released to whom and they want to have some guarantee that what they are releasing is “anonymous enough” to have the level of protection they want to achieve.  These gatekeepers want to know when the data they release is “anonymous enough” and how they can employ a definition, or measurement, that guarantees the right level of anonymity for their customers.

This challenge compounds for these individuals, and their organizations, when adding in various other truths of the nature of data today:

The silos are getting joined.

The convention that used to be held was that data within an organization was in a silo – all on it’s own and protected – such that anyone looking at the data, would only see that set of data.  Now, it’s starting to become the reality that these data sets are getting joined and it’s not always known where, when, how, with whom the join originated. Nor is it known where the joined data set could is currently stored since it was modified from its original silo.  Soon that joined data-set takes on a life of its own and makes its way around the institution.  Given the likelihood of this occurring, how can the person(s) responsible for being the gatekeeper(s) of the data, and assessing the level of protection provided to customers, do so with any kind of reliable measurement that guarantees the right level of anonymity?

And now there’s data in the public market.

Not only is the data joined with data (from other silos) within the organization, but also with data outside the organization sold in the public market.  This prospect has increased the ability for organizations to produce data that is “high impact” for the business – because they now know WAY MORE about their customers.  But does the benefit outweigh the liability? As the ability to know more about individual customers increases, so does the level of sensitivity and the concern for privacy.    How do organizations successfully navigate mounting privacy concerns as they move from in silos, to joined-silos, to joined-silos combined with public data?   

The line between “data analytics” and looking at “raw data” is blurring.

Because the data is richer, and more plentiful, the act of data analysis isn’t as benign as it might once have been.  The definition of “data analytics” has evolved from something high-level (to know, for example, how many new customers are using the service this quarter) to something that  looks a lot more like looking at raw data to target specific parts of their business to specific customers (to, for example, sell <these products> to customers that make <this much money>, are females ages 30 – 35 and live in <this neighborhood> and typically spend <this much> on <these types of products>, etc…).

And the data has different ways of exiting the system.

The truth is, as scary as this data can be, everyone wants to get their hands on it, because the data leads to awareness that is meaningful and valuable for the business.  Thus, the data is shared everywhere – inside and outside the organization.  With that fact comes a whole set of challenges emerge when considering all the ways data might be exiting any given “silo”, such as: Where is all the data going?  How is it getting modified (joined, sanitized, rejoined) and at which point is it no longer the data that needs to be protected by the organization? How much data needs to be released externally to fulfill partner/customer business obligations? Once the data has exited, can the organization’s privacy practices still be enforced? 

Brand affects privacy policy.  Privacy policy affects brand.

Privacy is a concern of the whole business, not just the resources that manage the data, nor solely the resources that manage liability.  In the event of a “big oopsie” where there is a data/privacy breach, it will be the communication with customers before, during and after the incident that determines the internal and external impact on the brand and the perception of the organization.  And that communication is dictated by both what the privacy policy enforces and what brand “allows”.  In today’s age of data, how can an organization have an open dialog with customers about their data if the brand does not support having that kind of a conversation?  No surprise that Facebook is the exemplary case for this: Facebook continues to pave a new path, and draw customers, to share and disclose more information about themselves.  As a result they have experienced the backlash from customers when they take it too far. The line of communication is very open – customers have a clear way to lash back when Facebook has gone too far, and Facebook has a way of visibly standing behind their decision or admitting their mistake.  Either way, it is now commonplace for Facebook’s customers to expect that there will be more incidents like this and that Facebook has a way (apparently suitable enough to keep most customers) of dealing with it.  Their “policy” allowed them to respond this way, and now it’s become a part of who Facebook is.  And now the policy that evolves to support this behavior moving forward.

In the discussion of data and privacy, it seems inherently obvious that the mountain of challenges we face is large, complicated and impacts the core of all our businesses.  Nonetheless, it is still fascinating to have been able to witness first-hand – and to now be able to specifically articulate – how similar the challenges are across a diverse group of businesses and how similar the concerns are across job-function. 

We want to re-thank everyone from IAPP that joined in on the discussions that we had at Foo Camp and throughout the conference.  We look forward to an opportunity to deep dive into these types of problems.

Post Script: Meanwhile, the challenges, and related questions, around the anonymization of data with some kind of measurable privacy guarantee that came up at Foo Camp are ones that we have been discussing on our blog for quite some time.  These are precisely the sorts of challenges that have motivated us to create a datatrust.  While we typically envision the datatrust being used in scenarios where there isn’t direct access to data, we walked away with specific examples from our discussions at IAPP Foo Camp where direct access to the data is required – particularly to fulfill business obligation – as a type of collateral (or currency). 

The concept of data as the new currency of today’s economy has emerged.  Not only did it come up at the IAPP Foo Camp, it also came up back in August where we heard Marc Davis talk about this at IPP 2010. With all of this in mind, it is interesting evaluate the possibility of the datatrust being able to act as a special type of data broker in these exchanges.  The idea being that the datatrust is a sanctioned data broker (by the industry, or possibly by the government), that inherently meets federal, local, municipal regulations and protects the consumers of business partners who want to exchange data as “currency,” while alleviating businesses and their partners from the headaches of managing data use/reuse.  The “tax” on using the service is that these aggregates are stored and made available to the public to query in the way we imagine (no direct access to the data) for policy-making and research.  This is something that feels compelling to us and will influence our thinking as we continue to move forward with our work.

In the mix…new organizational structures, giant list of data brokers, governments sharing citizens’ financial data, and what IT security has to do with Lady Gaga

Friday, July 9th, 2010

1) More on new kinds of organizational structures for entities that want to form for philanthropic purposes but not fit into the IRS definition of a nonprofit.

2) CDT shone a spotlight on Spokeo, a data broker last week.  Who are other data brokers? Don’t be shocked, there are A LOT of them.  What they do, they mainly do out of the spotlight shone on companies like Facebook, but with very real effects.  In 2005, ChoicePoint sold data to identity thieves posing as a legitimate business.

3) The U.S. has come to an agreement with Europe on sharing finance data, which the U.S. argues is an essential tool of counterterrorism.  The article doesn’t say exactly how these investigations work, whether specific suspects are targeted or whether large amounts of financial data are combed for suspicious activity.  It does make me wonder, given how data crosses borders more easily than any other resource, how will Fourth Amendment protections in the U.S. (and similar protections in other countries) apply to these international data exchanges?  There is also this pithy quote:

Giving passengers a way to challenge the sharing of their personal data in United States courts is a key demand of privacy advocates in Europe — though it is not clear under what circumstances passengers would learn that their records were being misused or were inaccurate.

4) Don’t mean to focus so much on scary data stuff, but 41% of IT professionals admit to abusing privileges.  In a related vein, it turns out a disgruntled soldier accused of illegally downloading classified data managed to do it by disguising his CDs as Lady Gaga CDs.  Even better,

He was able to avoid detection not because he kept a poker face, they said, but apparently because he hummed and lip-synched to Lady Gaga songs to make it appear that he was using the classified computer’s CD player to listen to music.

The New York Times is definitely getting cheekier.

In the mix…philanthropic entities, who’s online doing what, data brokers, and data portability

Monday, July 5th, 2010

1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one.  We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online.  (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation.  I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook.  We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch)  It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying  privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

In the mix…data for coupons, information literacy, most-visited sites

Friday, June 4th, 2010

1) There’s obviously an increasing move to a model of data collection in which the company says, “give us your data and get something in return,” a quid pro quo.  But as Marc Rotenberg at EPIC points out,

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data.  The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again.  These data-centered companies are creating a network of users whose data are continually used in the business.  Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.

2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access.  A similar argument can be made about the concept of privacy.  In “real life” (i.e., offline life), no one aims for total privacy.  Everyday, we make decisions about what we want to share with whom.  Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy.  For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends.  So perhaps we’ll need to work on privacy and data literacy as well?

3) Google recently released a list of the most visited sites on the Internet. Two questions jump out: a) Where is Google on this list? and b) Could the list be a proxy for the biggest data collectors online?

Measuring the privacy cost of “free” services.

Wednesday, June 2nd, 2010

There was an interesting pair of pieces on this Sunday’s “On The Media.”

The first was “The Cost of Privacy,” a discussion of Facebook’s new privacy settings, which presumably makes it easier for users to clamp down on what’s shared.

A few points that resonated with us:

  1. Privacy is a commodity we all trade for things we want (e.g. celebrity, discounts, free online services).
  2. Going down the path of having us all set privacy controls everywhere we go on internet is impractical and unsustainable.
  3. If no one is willing to share their data, most of the services we love to get for free would disappear. Randall Rothenberg.
  4. The services collecting and using data don’t really care about you the individual, they only care about trends and aggregates. Dr. Paul H. Rubin.

We wish one of the interviewees had gone even farther to make the point that since we all make decisions every day to trade a little bit of privacy in exchange for services, privacy policies really need to be built around notions of buying and paying where what you “buy” are services and how you pay for them are with “units” of privacy risk (as in risk of exposure).

  1. Here’s what you get in exchange for letting us collect data about you.”
  2. Here’s the privacy cost of what you’re getting (in meaningful and quantifiable terms).

(And no, we don’t believe that deleting data after 6 months and/or listing out all the ways your data will be used is an acceptable proxy for calculating “privacy cost.” Besides, such policies inevitably severely limit the utility of data and stifle innovation to boot.)

Gaining clarity around privacy cost is exactly where we’re headed with the datatrust. What’s going to make our privacy policy stand out is not that our privacy “guarantee” will be 100% ironclad.

We can’t guarantee total anonymity. No one can. Instead, what we’re offering is an actual way to “quantify” privacy risk so that we can track and measure the cost of each use of your data and we can “guarantee” that we will never use more than the amount you agreed to.

This in turn is what will allow us to make some measurable guarantees around the “maximum amount of privacy risk” you will be exposed to by having your data in the datatrust.

The second segment on privacy rights and issues of due process vis-a-vis the government and data-mining.

Kevin Bankston from EFF gave a good run-down how ECPA is laughably ill-equipped to protect individuals using modern-day online services from unprincipled government intrusions.

One point that wasn’t made was that unlike search and seizure of physical property, the privacy impact of data-mining is easily several orders of magnitude greater. Like most things in the digital realm, it’s incredibly easy to sift through hundreds of thousands of user accounts whereas it would be impossibly onerous to search 100,000 homes or read 100,000 paper files.

This is why we disagree with the idea that we should apply old standards created for a physical world to the new realities of the digital one.

Instead, we need to look at actual harm and define new standards around limiting the privacy impact of investigative data-mining.

Again, this would require a quantitative approach to measuring privacy risk.

(Just to be clear, I’m not suggesting that we limit the size of the datasets being mined, that would defeat the purpose of data-mining. Rather, I’m talking about process guidelines for how to go about doing low-(privacy) impact data-mining. More to come on this topic.)

In the mix…Everyone’s obsessed with Facebook

Friday, May 7th, 2010

UPDATE: One more Facebook-related bit, a great graphic illustrating how Facebook’s default sharing settings have changed over the past five years by Matt McKeon. Highly recommend that you click through and watch how the wheel changes.

1) I love when other people agree with me, especially on subjects like Facebook’s continuing clashes with privacy advocates. Says danah boyd,

Facebook started out with a strong promise of privacy…You had to be at a university or some network to sign up. That’s part of how it competed with other social networks, by being the anti-MySpace.

2) EFF has a striking post on the changes made to Facebook’s privacy policy over the last five years.

3) There’s a new app for people who are worried about Facebook having their data, but it means you have to hand it over to this company which also states, it “may use your info to serve up ads that target your interests.” Hmm.

4) Consumer Reports is worried that we’re oversharing, but if we followed all its tips on how to be safe, what would be the point of being on a social network? On its list of things we shouldn’t do:

  • Posting a child’s name in a caption
  • Mentioning being away from home
  • Letting yourself be found by a search engine

What’s the fun of Facebook if you can’t brag about the pina colada you’re drinking on the beach right at that moment? I’m joking, but this list just underscores that we can’t expect to control safety issues solely through consumer choices. Another thing we shouldn’t do is put our full birthdate on display, though given how many people put details about their education, it wouldn’t necessarily be hard to guess which year someone was born. Consumer Reports is clearly focusing on its job, warning consumers, but it’s increasingly obvious privacy is not just a matter of personal responsibility.

5) In a related vein, there’s an interesting Wall St. Journal article on whether the Internet is increasing public humiliation. One WSJ reader, Paul Cooper, had this to say:

The simple rule here is that one should always assume that everything one does will someday be made public. Behave accordingly. Don’t do or say things you don’t want reported or repeated. At least not where anyone can see or hear you doing it. Ask yourself whether you trust the person who wants to take nude pictures of you before you let them take the pictures. It is not society’s job to protect your reputation; it’s your job. If you choose to act like a buffoon, chances are someone is going to notice.

Like I said above, privacy in a world where the word “public” means really really public forever and ever, and “private” means whatever you manage to keep hidden from everyone you know, protecting “privacy” isn’t only a matter of personal responsibility. The Internet easily takes actions that are appropriate in certain contexts and republishes them in other contexts. People change, which is part of the fun of being human. Even if you’re not ashamed of your past, you may not want it following you around in persistent web form.

Perhaps on the bright side, we’ll get to a point where we can all agree everyone has done things that are embarrassing at some point and no one can walk around in self-righteous indignation. We’ve seen norms change elsewhere. When Bill Clinton was running for president, he felt compelled to say that he had smoked marijuana but had never inhaled. When Barack Obama ran for president 16 years later, he could say, “I inhaled–that was the point,” and no one blinked.

6) The draft of a federal online privacy bill has been released. In its comments, Truste notes, “The current draft language positions the traditional privacy policy as the go to standard for ‘notice’ — this is both a good and bad thing.” If nothing else, the “How to Read a Privacy Policy” report we published last year had a similar conclusion, that privacy policies are not going to save us.

Get Adobe Flash player