Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

So privacy is about control…but what if you don’t even know what you’re controlling?

Friday, April 16th, 2010

It’s becoming practically a mantra, the way it’s being repeated everywhere: privacy is about control.  And a newish location-based social network seems to be taking this to heart.  As ReadWriteWeb describes, Rally Up has settings that allow you to control how information is being disseminated to your “real friends.”  Definitely interesting.

But then there’s the week’s big news about the Library of Congress archiving Twitter.  Not surprisingly, some people are nervous.  Even if all the Tweets archived were public Tweets, it’s unclear if those people equated the public nature of their Tweets with consent to being archived.  Even those who actively, purposefully, consciously were public in their Tweets may have ended up revealing more information about themselves than they intended.  Reading the pattern of tweets could allow researchers and others to deduce information people didn’t know they were broadcasting.  If that seems implausible, keep in mind that the data is valuable precisely because there’s information in it that’s not immediately obvious.

The difficulty with privacy in our age is that embarrassing things, which people have been doing since the dawn of time, are now so easily memorialized and stored for a very, very long time.  Fifteen years ago, you could do something stupid on spring break and at worst, be a laughingstock among your friends and their friends.  Now, a photo of you doing something stupid could stick around and impact your life years later, when a potential employer is checking you out.

It’s definitely troubling, and not an issue that is resolved only through multiple-choice settings.  On the other hand, maybe we’ll all just get used to it once we’ve been living in that world long enough.  Eventually, the generation that has embarrassing photos on Facebook will grow up and be hiring people themselves.  Maybe they won’t care so much when they find a drunken photo of a potential hire.

Completely not there versus almost not there.

Wednesday, April 14th, 2010


Picture taken by Stephan Delange

In my last post where I tried to quantify the concept of “discernibility” I left off at the point where I said I was going to try out my “50/50″ definition on the PINQ implementation of differential privacy.

It turned out to be a rather painful process. Both because I can be rather literal-minded in an unhelpful way at times and because it is plain hard to figure this stuff out.

To backtrack a bit, let’s first make some rather obvious statements to get a running start in preparation for wading through some truly non-obvious ones.

Crossing the discernibility line.

In the extreme case, we know that if there was no privacy protection whatsoever and the datatrust just gave out straight answers, then we would definitely cross the “discernibility line” and violate our privacy guarantee. So if we go back to my pirate friend again and ask, “How many people with skeletons in their closet wear an eye-patch and live in my building?” If you (my rather distinctive eye-patch wearing neighbor) exist in the data set, the answer will be 1. If you are not in the data set, the answer will be 0.

With no privacy protection, the presence or absence of your record in the data set makes a huge difference to the answers I get and are therefore extremely discernible.

Thankfully, PINQ doesn’t give straight answers. It adds “noise” to answers to obfuscate them.

Now when I ask, “How many people in this data set of people with skeletons in their closet wear an eye-patch and live in my building?” PINQ counts the number of people who meet these criteria and then decides to either “remove” some of those people or “add” some “fake” people to give me a “noisy” answer to my question.

How it chooses to do so is governed by a distribution curve developed and named for the French marquis Pierre-Simon La Place. (I don’t know why it has to be this particular curve, but I am curious to learn why.)

You can see the curve illustrated below in two distinct postures that illustrate very little privacy protection and quite a lot of privacy protection, respectively.

  • The point of the curve is centered on the “real answer.”
  • The width of the curve shows the range of possible “noisy answers” PINQ will choose from.
  • The height of the curve shows the relative probability of one noisy answer being chosen over another noisy answer.

A quiet curve with few “fake” answers for PINQ to choose from:

A noisy curve with many “fake” answers for PINQ to choose from:

More noise equals less discernibility.

It’s easy to wave your hands around and see in your mind’s eye how if you randomly add and remove people from “real answers” to questions, as you turn up the amount of noise you’re adding, the presence or absence of a particular record becomes increasingly irrelevant and therefore increasingly indiscernible. This in turn means that it will also be increasingly difficult to confidently isolate and identify a particular individual in the data set precisely because you can’t really ever get a “straight” answer out of PINQ that is accurate down to the individual.

With differential privacy, I can’t ever know that my eye-patch wearing neighbor has a skeleton in his closet. I can only conclude that he might or might not be in the dataset to varying degrees of certainty depending on how much noise is applied to the “real answer.”

Below, you can see how if you get a noisy answer of 2, it is about 7x more likely that the “real answer” is 1, than that the “real answer” is 0. A flatter, more noisy curve would yield a substantially smaller margin.

But wait a minute, we started out saying that our privacy guarantee, guarantees that individuals will be completely non-discernible. Is non-discernible the same thing as hardly discernible?

Clearly not.

Is complete indiscernibility even possible with differential privacy?

Apparently not…

On the question of “Discernibility”

Tuesday, April 13th, 2010

Where's Waldo?Where’s Waldo?

In my last post about PINQ and meaningful privacy guarantees, we defined “privacy guarantee” as a guarantee that the presence or absence of a single record will not be discernible.

Sounds reasonable enough, until you ask yourself, what exactly do we mean by “discernible”? And by “exactly”, I mean, “quantitatively” what do we mean by “discernible”? After all, differential privacy’s central value proposition is that it’s going to bring quantifiable, accountable math to bear on privacy, an area of policy that heretofore has been largely preoccupied with placing limitations on collecting and storing data or fine-print legalese and bald-faced marketing.

However, PINQ (a Microsoft Research implementation of differential privacy we’ve been working with) doesn’t have a built-in mathematical definition of “discernible” either. A human being (aka one of us) has to do that.

A human endeavors to come up with a machine definition of discernibility.

At our symposium last Fall, we talked about using a legal-ish framework for addressing this very issue of discernibility: Reasonable Suspicion, Probable Cause, Preponderence of Evidence, Clear and Convincing Evidence, Beyond a Reasonable Doubt.

Even if we decided to use such a framework, we would still need to figure out how these legal concepts translate into something quantifiable that PINQ can work with.

“Not Discernible” means seeing 50/50.

My initial reaction when I first starting thinking about this problem was that clearly, discernibility or lack thereof needed to revolve around some concept of 50/50, as in “odds of,” “chances are.”

Whatever answer you got out of PINQ, you should never get even a hint of an idea that any one number was more likely to be the real answer than the numbers to either of side of that number. (In other words, x and x+/-1 should be equally likely candidates for “real answerhood.”)

Testing discernibility with a “Worst-Case Scenario”

I ask a rather “pointed” question about my neighbor, one that essentially amounts to “Is so-and-so in this data set? Yes or no?” without actually naming names (or social security numbers, email addresses, cell phone numbers or any other unique identifiers). e.g. “How many people in this data set of ‘people with skeletons in their closet’ wear an eye-patch and live in my building?” Ideally, I should walk away with an answer that says,

“You know what, your guess is as good as mine, it is just as likely that the answer is 0, as it is that the answer is 1.”

In such a situation, I would be comfortable saying that I have received ZERO ADDITIONAL INFORMATION on the question of a certain eye-patched individual in my building and whether or not he has skeletons in his closets. I may as well have tossed a coin. My pirate neighbor is truly invisible in the dataset, if indeed he’s in there at all.

Armed with this idea, I set out to understand how this might be implemented with differential privacy...

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

Yea or Nay: Credit Checks on Job Applicants

Monday, April 12th, 2010

Should employers continue to be allowed to check your credit history as a part of the job application process?

View Results

Loading ... Loading ...

The biggest argument against this appears to be the lack of evidence showing a connection between credit history and job performance.

Sort of interesting to think about this in the context of other things employers ask about that may or may not have anything to do with job performance.

  1. Have you ever set a world record in anything?
  2. Do you play World of Warcraft?
  3. You have one fox and two chickens…

In the mix

Monday, April 5th, 2010

1) Slate had an interesting take on the bullying story in Massachusetts and the prosecutor’s anger at Facebook for not providing information, i.e., evidence of the bullying.  Apparently, Facebook provided basic subscriber information, but resisted providing more without a search warrant.  Emily Bazelon points out how this area of law is murky, and references the coalition forming around reforming the Electronic Communications Privacy Act, but her larger point is an extra-legal one.  The evidence of bullying the DA was looking for was at one point public, even if eventually deleted. She points out that it may be hard for kids or parents who are upset to have the presence of mind to do this, but that they could take screenshots and preserve evidence themselves.

The case raises a lot of interesting questions about anonymity, privacy, and the values we have online.  Anonymity on the Internet has been a rallying cry for so many people, but I wonder, if something is illegal in the offline world, should it suddenly be legal online because you can be anonymous and avoid prosecution?  (Sexual harassment is a crime in the subway, too!)  We now live in a world where many of us occupy space both online and offline.  We used to think of them as completely separate spaces, and it’s true that the Internet gives us opportunities to do things, both good and bad, that we wouldn’t have offline.  But it’s increasingly obvious that we need to transfer some of the rules we have about the offline world into the online one.  For disability rights advocates, that includes pushing the definition of “public accommodation” to include online stores like Target, and suing them if their sites are not accessible to the blind using screen readers.  For privacy advocates, that includes acknowledging that people have an expectation of privacy in their emails as well as their snail mail.  Free speech in the offline world doesn’t mean you can say anything you want anywhere you want.  Maybe it’s time to be more nuanced about how we protect free speech online as well.

2) It turns out Twitter is pretty good at predicting box office returns – what else might it predict?

3) Cases like this amaze me, because the parties are litigating a question that seems like a no-brainer.  A New Jersey court upheld recently that an employee had an expectation of privacy in her Yahoo personal account, even if she accessed it on a company computer. Would we ever litigate whether an employee had an expectation of privacy in a piece of personal mail she brought to the office and decided to read at her desk?

4) The New York Times is acknowledging their readers’ online comments in separate articles, namely, this one describing readers’ reactions to federal mortgage aid.  It’s a smart way to give online readers a sense that their comments are being read.  I wonder if this is where the “Letters to the Editor” page is going.  I’ve been wondering, who are these readers who are so happy to be the 136th comment on an article?  But the people who write letters to the editor have always been people who have extra time and energy.  In a way, online comments expands the world of people who are willing to write a letter to the editor.

5) Would we feel differently about government data mining if the government were better at it? Mimi and I went to a talk at the NYU Colloquium on Information Technology and Society where Joel Reidenberg, a law professor at Fordham, talked about how transparency of personal information online is eroding the rule of law.  One of the arguments he made against government data mining was that it doesn’t work, with the example of airport security, its inability to stop the underwear bomber, and its terribly inaccurate no-fly lists.  Well, the Obama administration just announced a new system of airport security checks that uses intelligence-based data mining that is meant to be more targeted.  It’s hard to know now whether the new system will be better and smarter, but it raises a point those opposed to data mining don’t seem to consider — what if the government were better at it?  Could data mining be so precise that it avoids racial profiling?  Are there other dangers to consider, and can they be warded off without shutting down data mining altogether?

In the mix

Wednesday, March 31st, 2010

1) Exciting news!  A diverse coalition of left-leaning and right-leaning organizations, as well as a bunch of big corporations, has formed around the goal of revising the Electronic Communications Privacy Act.  This law, from 1986, clearly didn’t anticipate the world we live in now, the extent to which we use emails, the “expectation of privacy” we have in email, and the extent to which we store our data and our documents in the cloud.  This law will greatly impact our work at the Common Data Project, but even without a professional stake in this, I’d be pretty excited.  After all, we all (except my mom who doesn’t use computers) have a personal stake in this.

2) The full text of danah boyd’s talk at SXSW is available on her blog.  This is my favorite line:

For the parents and educators in the room… Many of you are struggling to help young people navigate this new world of privacy and publicity, but many of you are confused yourself. The worst thing you can do is start a sentence with “back in my day.” Back in your day doesn’t matter.

It’s an obvious but useful point for privacy and information issues in general.  The ECPA from back in the day of 1986 can’t deal with today.  It’s time to really think, which of our assumptions about privacy still hold true?

3) David Brooks’s column this week got me thinking.  If we agree with him, which I do, that a country’s success cannot be measured simply with things like GDP, what else should we measure and how? My friends who work in social sciences are initially skeptical when I talk about the data collection potential of something like the Common Data Project’s datatrust.  They’re distrustful of self-reported data, even as they acknowledge that their existing methodologies are imperfect.  But with things that are hard to measure, self-reporting is often the only way to go.  The datatrust, the Internet, and its measurable guarantees of privacy could dramatically change how self-reported data is collected, analyzed, and published.

4) Facebook data destroyed: Pete Warden, who had created a database from 210 million public Facebook profiles, was prepared to release the data to social scientists who were fascinated by the potential to research social connections, particularly as mashed up with census data on income, mobility and employment.  But then Facebook said he had violated its terms of use, and unable to defend a potential lawsuit, he destroyed the data.

Argh, isn’t there a better way?  The decision to make one’s profile public on profile may not equal a decision to consent to be in such a database, and that Warden’s planned “anonymization” was unlikely to be very robust, but this situation is a perfect example of why the Common Data Project was founded: to create a new norm, with strong privacy and sharing standards, that makes such data truly, safely available.

In the mix

Monday, March 22nd, 2010

1) EFF is posting documents as it gets them indicating how the government uses social networks in law enforcement investigations. The Fourth Amendment is what requires the police to have a search warrant when they come to search your house.  The cases interpreting the Fourth Amendment that led to such requirements were based on expectations of privacy that are rooted in physical spaces.  But as we start to live more of our lives in an online space our founding fathers could never have imagined, how should we change the laws protecting our rights?

2) An overview of the history of people challenging the constitutionality of the U.S. Census. Personally, I love filling out the census form.  I wish I’d gotten the American Community Survey.

3) The Transaction Records Access Clearinghouse, a data research organization at Syracuse University studying federal spending, enforcement, and staffing recently got a $100,000+ bill for a FOIA request. The bill was based on the calculation that 861 man hours were required to create a description of what is in the U.S. Citizenship and Immigration Service’s database of claims for U.S. citizenship.  As an immigration lawyer, I used to deal with USCIS all the time, and even I am surprised that the agency would need that much time just to figure out what’s in the database.  You almost hope that the bill was calculated just to rebuff TRAC’s FOIA request, because the alternative, that the database is that screwed up, is pretty awful.

4) danah boyd at Microsoft Research gave the keynote at SXSW on “Privacy and Publicity” last week, challenging the idea that personal information is on a binary spectrum of public and private.  It’s great to hear more and more people making this point, which is at the heart of CDP’s mission.

5) Google now has a service that lets you place your own ad on TV.  Really shockingly simple and easy, and fascinating in light of the growing fear that evil advertisers are taking over our lives.  Would it make a difference if we could all become advertisers, too?

Smart Grid Data: Unexpected and Amazing Reuses?

Tuesday, March 16th, 2010

As noted in “In the Mix,” the Center for Democracy and Technology and the Electronic Freedom Foundation recently issued joint comments to the California Public Utilities Commission regarding proposed policies around the use of smart grids and smart meters.

(via Flowing Data.)

And then a few days later, I saw this: EPCOR, a Canadian water utility company, issued a graph plotting water usage during the Olympic men’s hockey final.  Notice the spikes in water consumption (and toilet flushing) immediately after the first period, second period, third period, and finally when Canada wins the gold medal.

Is this our worst nightmare?  That someone will find out when we’re peeing?

That’s a bad joke. Plotting a large area’s water consumption in aggregate is not the same as what some of these smart meters are able to measure in terms of energy consumption.

But I do have a more serious point to make.  One of the points CDT and EFF make repeatedly in their comments is that we should avoid “unnecessary” data collection and destroy any “unnecessary” data.

What exactly does “unnecessary” mean?

Does it mean any purpose that is not related to the work of a utility company?  Who decides what’s unnecessary and should they decide what’s unnecessary and necessary now?

The beauty of data is that its potential value is unknown.  A single dataset, collected for one purpose, can be used for other purposes that are socially beneficial but rather unexpected.  For example, Google Trends was created for advertisers so that they can track what search terms are popular.  The CDC, however, has been using Google Trends to track flu outbreaks, by watching where people are Googling flu symptoms, data which is more quickly collected than reports from doctors.  The reason governments all over the world are pushing for open data is because we don’t know yet all that can be done.  By giving access to everyone, we expect interesting, useful, imaginative things to come out of the data we never might have imagined.

Data from the smart grids, in particular, will also require smart visualizations that are easy for individual consumers to understand and access.  Data alone isn’t going to change behavior.  You can imagine open data inviting developers to create easy to use apps that allow consumers to identify easily and painlessly ways to reduce energy consumption.  Some may even choose to share that information and compete with others, the way several universities have set up competitions between dorms.  As much as Al Gore was embarrassed by news revealing how much energy his mansion used, others may be eager to brag about how little energy they use.

Can we protect privacy while also creating room for imaginative and innovative reuse of data?

There are definitely privacy issues we have to consider.  I agree with a lot of the points made in CDT and EFF’s comments.  That “customer information” shouldn’t be limited to “personally identifying information.”  The misuse and misapplication of phrases like “personal information” is something we’ve been harping on for a while.  That customers should have access to the data collected from them and the power to correct mistakes.  That law enforcement shouldn’t be allowed to troll this information without a warrant, that civil litigants shouldn’t be allowed to access this information without a court order based on a showing of compelling interest and after notifying the customer to provide her with a chance to object.

But rather than talking about barring “unnecessary” data collection and data use, we should be thinking of ways to make the data safely available, regardless of whether someone has decided it’s necessary or not.  The data from smart grids is going to be both dangerous and valuable because it is so fine-grained; we clearly can’t just plop it online.  Anonymizing data is really hard.  So at CDP, we’re working hard at thinking about ways to come up with measurable privacy guarantees and testing technologies like PINQ that promise to provide access to raw data without indicating the existence of any particular individual in a dataset.  Other organizations may have different ideas.  I’m grateful for the existence of organizations that imagine the worst-case scenarios around data collection to protect our civil rights.  I also hope to see the growth of more organizations that try to imagine the best-case scenarios.

In the mix

Friday, March 12th, 2010

1) The CDC recently used shopper-card data to track a salmonella outbreak that sickened 245 in 44 states.  It turned out the pepper in salami made in Rhode Island was the culprit.  Although the CDC began to suspect through interviews and questionnaires that some sort of Italian meat product was the problem, the people they talked to couldn’t remember precisely what they had bought and the shopper-card records helped them identify the actual product.

Great story, right?  Unless you’re the director of Consumers Against Supermarket Privacy Invasion and Numbering, in which case, the story smacks of privacy invasion by the government.  The CDC got the records with the permission of the account holders, but to Katherine Albrecht and several of the commenters to the Yahoo News Story, that didn’t assuage their fears.

Here’s a choice quote: “I’d rather have a few die from poisoning and then they fix the problem then have the entire country enslaved, thank you very much.”

There was at least one person who pointed out commenting on a Yahoo news story wasn’t going to do much to preserve their privacy either.

2) MySpace is selling bulk user data! I’m with ReadWriteWeb:

I think the world is an awfully unfair mess and I’m hoping that data analysis will help illuminate some of the hows and the whys. Like the way that real-estate redlining was exposed back in the day by cross referencing census data around racial demographics and housing loan data. That illuminated systematic discrimination against black families in applying for home loans in certain parts of town. So too I think we’ll find a lot of undeniable proof of injustices and clues for how we might deal with them in big data today.

We don’t want another AOL debacle on our hands, but we also don’t want to give up on the possibilities of “big data” because we prematurely assume better privacy-creating techniques and standards aren’t available.

3) My, it’s a privacy-obsessed week!  Here’s one person’s argument “why no one cares about privacy.” It’s a good round-up of pithy quotes from people like Judge Posner, new “talk about me” sites like Blippy.com, and surveys demonstrating the change in the public’s attitude over time.  Wow, in 1998, 80% of people in a Harris poll said they were hesitant to shop online because of privacy worries.

Still, articles like this and the comments to the Yahoo CDC-shopper data article show how much our discussion of privacy involves people yelling at each other across a very big divide.  Is the choice really a binary one?  Privacy + a few deaths versus Big Brother + public health data?  I don’t care if the CDC has access to my grocery records; at the same time, I don’t plan to sign up for Blippy.com and broadcast my purchase of kale and four kinds of cheese this morning.  (Oops, I just did.)  Maybe we should stop talking about “privacy” and start talking about specific situations.

Get Adobe Flash player