Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

Measuring the privacy cost of “free” services.

Wednesday, June 2nd, 2010

There was an interesting pair of pieces on this Sunday’s “On The Media.”

The first was “The Cost of Privacy,” a discussion of Facebook’s new privacy settings, which presumably makes it easier for users to clamp down on what’s shared.

A few points that resonated with us:

  1. Privacy is a commodity we all trade for things we want (e.g. celebrity, discounts, free online services).
  2. Going down the path of having us all set privacy controls everywhere we go on internet is impractical and unsustainable.
  3. If no one is willing to share their data, most of the services we love to get for free would disappear. Randall Rothenberg.
  4. The services collecting and using data don’t really care about you the individual, they only care about trends and aggregates. Dr. Paul H. Rubin.

We wish one of the interviewees had gone even farther to make the point that since we all make decisions every day to trade a little bit of privacy in exchange for services, privacy policies really need to be built around notions of buying and paying where what you “buy” are services and how you pay for them are with “units” of privacy risk (as in risk of exposure).

  1. Here’s what you get in exchange for letting us collect data about you.”
  2. Here’s the privacy cost of what you’re getting (in meaningful and quantifiable terms).

(And no, we don’t believe that deleting data after 6 months and/or listing out all the ways your data will be used is an acceptable proxy for calculating “privacy cost.” Besides, such policies inevitably severely limit the utility of data and stifle innovation to boot.)

Gaining clarity around privacy cost is exactly where we’re headed with the datatrust. What’s going to make our privacy policy stand out is not that our privacy “guarantee” will be 100% ironclad.

We can’t guarantee total anonymity. No one can. Instead, what we’re offering is an actual way to “quantify” privacy risk so that we can track and measure the cost of each use of your data and we can “guarantee” that we will never use more than the amount you agreed to.

This in turn is what will allow us to make some measurable guarantees around the “maximum amount of privacy risk” you will be exposed to by having your data in the datatrust.

The second segment on privacy rights and issues of due process vis-a-vis the government and data-mining.

Kevin Bankston from EFF gave a good run-down how ECPA is laughably ill-equipped to protect individuals using modern-day online services from unprincipled government intrusions.

One point that wasn’t made was that unlike search and seizure of physical property, the privacy impact of data-mining is easily several orders of magnitude greater. Like most things in the digital realm, it’s incredibly easy to sift through hundreds of thousands of user accounts whereas it would be impossibly onerous to search 100,000 homes or read 100,000 paper files.

This is why we disagree with the idea that we should apply old standards created for a physical world to the new realities of the digital one.

Instead, we need to look at actual harm and define new standards around limiting the privacy impact of investigative data-mining.

Again, this would require a quantitative approach to measuring privacy risk.

(Just to be clear, I’m not suggesting that we limit the size of the datasets being mined, that would defeat the purpose of data-mining. Rather, I’m talking about process guidelines for how to go about doing low-(privacy) impact data-mining. More to come on this topic.)

Recap and Proposal: 95/5, The Statistically Insignificant Privacy Guarantee

Wednesday, May 26th, 2010

Image from: xkcd.

In our search for a privacy guarantee that is both measurable and meaningful to the general public, we’ve traveled a long way in and out of the nuances of PINQ and differential privacy: A relatively new, quantitative approach to protecting privacy. Here’s a short summary of where we’ve been followed by a proposal built around the notion of statistical significance for where we might want to go.

The “Differential Privacy” Privacy Guarantee

Differential privacy guarantees that no matter what questions are asked and how answers to those questions are crossed with outside data, your individual record will remain “almost indiscernible” in a data set protected by differential privacy. (The corollary to that is that the impact of your individual record on the answers given out by differential privacy will be “negligeable.”)

For a “quantitative” approach to protecting privacy, the differential privacy guarantee is remarkably NOT quantitative.

So I began by proposing the idea that the probability of a single record being present in a data set should equal the probability of that single record not being present in that data set (50/50).

I introduced the idea of worst-case scenario where a nosy neighbor asks a pointed question that essentially reduces to a “Yes or no? Is my neighbor in this data set?” sort of question and I proposed that the nosy neighbor should get an equivocal (50/50) answer: “Maybe yes, but then again, (equally) maybe no.”

(In other words, “almost indiscernible” is hard to quantify. But completely indiscernible is easy to quantify.)

We took this 50/50 definition and tried to bring it to bear on the reality of how differential privacy applies noise to “real answers” to produce identity-obfuscating “noisy aswers.”

I quickly discovered that no matter what, differential privacy’s noisy answers always imply that one answer is more likely than another.

My latest post was a last gasp explaining why there really is no way to deliver on the completely invisible, completely non-discernible 50/50 privacy guarantee (even if we abandoned Laplace).

(But I haven’t given up on quantifying the privacy guarantee.)

Now we’re looking at statistical significance as a way to draw a quantitative boundary around a differential privacy guarantee.

Below is a proposal that we’re looking for feedback on. We’re also curious to know if anyone else tried to come up with a way to quantify the differential privacy guarantee?

What is Statistical Significance? Is it appropriate for our privacy guarantee?

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. Applied to our privacy guarantee, you might ask the question this way: When you get an answer about a protected data set, are the implications of that “differentially private” answer (as in implications about what the “real answer” might be) significant or are they simply the product of chance?

Is this an appropriate way to define a quantifiable privacy guarantee, we’re not sure.

Thought Experiment: Tossing a Weighted Coin

You have a coin. You know that one side is heavier than the other side. You have only 1 chance to spin the coin and draw a conclusion about which side is heavier.

At what weight distribution split does the result of that 1 coin spin start to be statistically significant?

Well, if you take the “conventional” definition of statistical significance where results start to be statistically significant when you have less than a 5% chance of being wrong, the boundary in our weighted coin example would be 95/5 where 95% of the weight is on one side of the coin and 5% is on the other.

What does this have to do with differential privacy?

Mapped onto differential privacy, the weight distribution split is the moral equivalent of the probability split between two possible “real answers.”

The 1 coin toss is the moral equivalent of being able to ask 1 question of the data set.

With a sample size of 1 question, the probability split between two possible, adjacent “real answers” would need to be at least 95/5 before the result of that 1 question was statistically significant.

That in turn means that at 95/5, the presence or absence of a single individual’s record in a data set won’t have a statistically significant impact on the noisy answer given out through differential privacy.

(Still 95% certainty doesn’t sound very good.)

Postscript Obviously, we don’t want to be a situation where asking just 1 question of a data set brings it to the brink of violating the privacy guarantee. However, thinking in terms of 1 question is helpful way to figure out the “total” amount of privacy risk the system can tolerate. And since the whole point of differential privacy is that it offers a quantitative way to track privacy risk, we can take that “total” amount and divide it by the number of questions we want to be able to dole out per data set and arrive at a per-question risk threshold.

Really? 50/50 privacy guarantee is truly impossible?

Monday, May 24th, 2010

At the end of my last post, we came to the rather sad conclusion that as far as differential privacy is concerned, it is not possible to offer a 50/50, “you might as well not be in the data set” privacy guarantee because, well, the Laplace distribution curves used to apply identity-obfuscating noise in differential privacy are too…curvy.

No matter how much noise you add, answers you get out of differential privacy will always imply that one number is more likely to be the “real answer” than another. (Which as we know from our “nosy-neighbor-worst-case-scenario,” can translate into revealing the presence of an individual in a data set: The very thing differential privacy is supposed to protect against.)

Still, “50/50 is impossible” is predicated on the nature of the Laplace curves. What would happen if we got rid of them? Are there any viable alternatives?

Apparently, no. 50/50 truly is impossible.

There are a few ways to understand why and how.

The first is a mental sleight of hand. A 50/50 guarantee is impossible because that would mean that the presence of an individual’s data literally has ZERO impact on the answers given out by PINQ, which would effectively cancel out differential privacy’s ability to provide more or less accurate answers.

Back to our worst-case scenario, in a 50/50 world, a PINQ answer of 3.7 would not only equally imply that the real answer was 0 as that it was 1, it would also equally imply that the real answer was 8, as that it was 18K or 18MM. Differential privacy answers would effectively be completely meaningless.

Graphically speaking, to get 50/50, the currently pointy noise distribution curves would have to be perfectly horizontal, stretching out to infinity in both directions on the number line.

What about a bounded flat curve?

(If pressed, this is probably the way most people would understand what is meant when someone says an answer has a noise level or margin of error of +/-50.)

Well, if you were to apply noise with a rectangular curve, in our worst-case scenario, with +/-50 noise, there would be a 1 in 100 chance that you get an answer that definitively tells you the real answer.

If the real answer is 0, with a rectangular noise level +/- 50 would yield answers from -50 to +50.

If the real answer is 1, a rectangular noise level +/-50 would yield answers from -49 to +51.

If you get a PINQ answer of 37, you’re set. It’s equally likely that the answer is 0 as that the answer is 1. 50/50 achieved.

If you get a PINQ answer of 51, well you’ll know for sure that the real answer is 1, not 0. And there’s a 1 in a 100 chance that you’ll get an answer of 51.

Meaning there’s a 1% chance that in the worst-case scenario you’ll get 100% “smoking gun” confirmation of that someone is definitely present in a data set.

As it turns out, rectangular curves are a lot dumber than those pointy Laplace things because they don’t have asymptotes to plant a nagging seed of doubt. In PINQ, all noise distribution curves have an asymptote of zero (as in zero likelihood of being chosen as a noisy answer).

In plain English, that means that every number on the real number line has a chance (no matter how tiny) of being chosen as a noisy answer, no matter what the “real answer” is. In other words, there are no “smoking guns.”

So now we’re back to where we left off in our last post, trying to pick an arbitrary arbitrary probability split for our privacy guarantee.

Or maybe not. Could statistical significance come and save the day?

Could we quantify our privacy guarantee by saying that the presence or absence of a single record will not affect the answers we give out to a statistically significant degree?

In the mix…DNA testing for college kids, Germany trying to get illegally gathered Google data, and the EFF’s privacy bill of rights for social networks

Friday, May 21st, 2010

1) UC Berkeley’s incoming class will all get DNA tests to identify genes that show how well you metabolize alcohol, lactose, and folates. “After the genetic testing, the university will offer a campuswide lecture by Mr. Rine about the three genetic markers, along with other lectures and panels with philosophers, ethicists, biologists and statisticians exploring the benefits and risks of personal genomics.”

Obviously, genetic testing is not something to take lightly, but the objections quoted sounded a little paternalistic. For example, “They may think these are noncontroversial genes, but there’s nothing noncontroversial about alcohol on campus,” said George Annas, a bioethicist at the Boston University School of Public Health. “What if someone tests negative, and they don’t have the marker, so they think that means they can drink more? Like all genetic information, it’s potentially harmful.”

Isn’t this the reasoning of people who preach abstinence-only sex education?

2) Google recently admitted they were collecting wifi information during their Streetview runs.  Germany’s reaction? To ask for the data so they can see if there’s reason to charge Google criminally.  I don’t understand this.  Private information is collected illegally so it should just be handed over to the government?  Are there useful ways to review this data and identify potential illegalities without handing the raw data over to the government?  Another example of why we can’t rest on our laurels — we need to find new ways to look at private data.

3) EFF issued a privacy bill of rights for social network users.  Short and simple.  It’s gotten me thinking, though, about what it means that we’re demanding rights from a private company. Not to get all Rand Paul on people (I really believe in the Civil Rights Act, all of it), but users’ frustrations with Facebook and their unwillingness to actually leave makes clear that the service Facebook is offering is not just a service provided to just a customer.  danah boyd has a suggestion — let’s think of Facebook as a utility and regulate it the way we regulate electric, water, and other similar utilities.

In the mix…Linkedin v. Facebook, online identities, and diversity in online communities

Friday, May 14th, 2010

1) Is Linkedin better than Facebook with privacy? I’m not sure this is the right question to ask. I’m also not sure the measures Cline uses to evaluate “better privacy” get to the heart of the problem.  The existence of a privacy seal of approval, the level of detail in the privacy policy, the employment of certified privacy professionals … none of these factors address what users are struggling to understand, that is, what’s happening to their information.  73% of adult Facebook users think they only share content with friends, but only 42% have customized their privacy settings.

Ultimately, Linkedin and Facebook are apples to oranges.  As Cline points out himself, people on Linkedin are in a purely professional setting.  People who share information on Linkedin do so for a specific, limited purpose — to promote themselves professionally.  In contrast, people on Facebook have to navigate being friends with parents, kids, co-workers, college buddies, and acquaintances.  Every decision to share information is much more complicated — who will see it, what will they think, how will it reflect on the user?  Facebook’s constant changes to how user information makes these decisions even more complicated — who can keep track?

In this sense, Linkedin is definitely easier to use.  If privacy is about control, then Linkedin is definitely easier to control.  But does this mean something like Facebook, where people share in a more generally social context, will always be impossible to navigate?

2) Mark Zuckerberg thinks everyone should have a single identity (via Michael Zimmer).  Well, that would certainly be one way to deal with it.

3) But most people, even the “tell-all” generation, don’t really want to go there.

4) In a not unrelated vein, Sunlight Labs has a new app that allows you to link data on campaign donations to people who email you through Gmail.  At least with regards to government transparency, Sunlight Labs seems to agree with Mark Zuckerberg.  I think information about who I’ve donated money to should be public (go ahead, look me up), but it does unnerve me a little to think that I could email someone on Craigslist about renting an apartment and have this information just pop up.  I don’t know, does the fact that it unnerves me mean that it’s wrong?  Maybe not.

5) Finally, a last bit on the diversity of online communitiesit may be more necessary than I claimed, though with a slightly different slant on diversity.  A new study found that the healthiest communities are “diverse” in that new members are constantly being added.  Although they were looking at chat rooms, which to me seems like the loosest form of community, the finding makes a lot of sense to me.  A breast cancer survivors’ forum may not care whether they have a lot of men, but they do need to attract new participants to stay vibrant.

In the mix…Everyone’s obsessed with Facebook

Friday, May 7th, 2010

UPDATE: One more Facebook-related bit, a great graphic illustrating how Facebook’s default sharing settings have changed over the past five years by Matt McKeon. Highly recommend that you click through and watch how the wheel changes.

1) I love when other people agree with me, especially on subjects like Facebook’s continuing clashes with privacy advocates. Says danah boyd,

Facebook started out with a strong promise of privacy…You had to be at a university or some network to sign up. That’s part of how it competed with other social networks, by being the anti-MySpace.

2) EFF has a striking post on the changes made to Facebook’s privacy policy over the last five years.

3) There’s a new app for people who are worried about Facebook having their data, but it means you have to hand it over to this company which also states, it “may use your info to serve up ads that target your interests.” Hmm.

4) Consumer Reports is worried that we’re oversharing, but if we followed all its tips on how to be safe, what would be the point of being on a social network? On its list of things we shouldn’t do:

  • Posting a child’s name in a caption
  • Mentioning being away from home
  • Letting yourself be found by a search engine

What’s the fun of Facebook if you can’t brag about the pina colada you’re drinking on the beach right at that moment? I’m joking, but this list just underscores that we can’t expect to control safety issues solely through consumer choices. Another thing we shouldn’t do is put our full birthdate on display, though given how many people put details about their education, it wouldn’t necessarily be hard to guess which year someone was born. Consumer Reports is clearly focusing on its job, warning consumers, but it’s increasingly obvious privacy is not just a matter of personal responsibility.

5) In a related vein, there’s an interesting Wall St. Journal article on whether the Internet is increasing public humiliation. One WSJ reader, Paul Cooper, had this to say:

The simple rule here is that one should always assume that everything one does will someday be made public. Behave accordingly. Don’t do or say things you don’t want reported or repeated. At least not where anyone can see or hear you doing it. Ask yourself whether you trust the person who wants to take nude pictures of you before you let them take the pictures. It is not society’s job to protect your reputation; it’s your job. If you choose to act like a buffoon, chances are someone is going to notice.

Like I said above, privacy in a world where the word “public” means really really public forever and ever, and “private” means whatever you manage to keep hidden from everyone you know, protecting “privacy” isn’t only a matter of personal responsibility. The Internet easily takes actions that are appropriate in certain contexts and republishes them in other contexts. People change, which is part of the fun of being human. Even if you’re not ashamed of your past, you may not want it following you around in persistent web form.

Perhaps on the bright side, we’ll get to a point where we can all agree everyone has done things that are embarrassing at some point and no one can walk around in self-righteous indignation. We’ve seen norms change elsewhere. When Bill Clinton was running for president, he felt compelled to say that he had smoked marijuana but had never inhaled. When Barack Obama ran for president 16 years later, he could say, “I inhaled–that was the point,” and no one blinked.

6) The draft of a federal online privacy bill has been released. In its comments, Truste notes, “The current draft language positions the traditional privacy policy as the go to standard for ‘notice’ — this is both a good and bad thing.” If nothing else, the “How to Read a Privacy Policy” report we published last year had a similar conclusion, that privacy policies are not going to save us.

Building a community: the implications of Facebook’s new features for privacy and community

Thursday, May 6th, 2010

As I described in my last post, the differences between MySpace and Facebook are so stark, they don’t feel like natural competitors to me.  One isn’t necessarily better than the other.  Rather, one is catering to people who are looking for more of a public, party atmosphere, and the other is catering to people who want to feel like they can go to parties that are more exclusive and/or more intimate, even when they have 1000 friends.

But this difference doesn’t mean that one’s personal information on Facebook is necessarily more “private” than on MySpace.  MySpace can feel more public.  There is no visible wall between the site and the rest of the Internet-browsing community.  But Facebook’s desire to make more of its users’ information public is no secret.  For Facebook to maintain its brand, though, it can’t just make all information public by default.  This is a company that grew by promising Harvard students a network just for them, then Ivy League students a network just for them, and even now, it promises a network just for you and the people you want to connect with.

Facebook needs to remain a space where people feel like they can define their connections, rather than be open to anyone and everyone, even as more information is being shared.

And just in time for this post, Facebook rolled out new features that demonstrate how it is trying to do just that.

Facebook’s new system of Connections, for example, links information from people’s personal profiles to community pages, so that everyone who went to Yale Law School, for example, can link to that page. Although you could see other “Fans” of the school on the school’s own page before, the Community page puts every status update that mentions the school in one place, so that you’re encouraged to interact with others who mention the school.  The Community Pages make your presence on Facebook visible in new ways, but primarily to people who went to the same school as you, who grew up in the same town, who have the same interests.

Thus, even as information is shared beyond current friends, Facebook is trying to reassure you that mini-communities still exist.  You are not being thrown into the open.

Social plug-ins similarly “personalize” a Facebook user’s experience by accessing the user’s friends.  If you go to, you’ll see which stories your friends have recommended.  If you “Like” a story on that site, it will appear as an item in your Facebook newsfeed.  The information that is being shared thus maps onto your existing connections.

The “Personalization” feature is a little different in that it’s not so much about your interactions with other Facebook users, but about your interaction with other websites.  Facebook shares the public information on your profile with certain partners.  For example, if you are logged into Facebook and you go to the music site Pandora, Pandora will access public information on your profile and play music based on the your “Likes.”

This experience is significantly different from the way people explore music on MySpace.  MySpace has taken off as a place for bands to promote themselves because people’s musical preferences are public.  MySpace users actively request to be added to their favorite bands’ pages, they click on music their friends like, and thus browse through new music.  All of these actions are overt.

Pandora, on the other hand, recommends new music to you based on music you’ve already indicated you “Like” on your profile.   But it’s not through any obvious activity on your part.  You may have noted publicly that you “Like” Alicia Keys on your Facebook profile page, but you didn’t decide to actively plug that information into Pandora.  Facebook has done it for you.

Depending on how you feel about Facebook, you may think that’s wonderfully convenient or frighteningly intrusive.

And this is ultimately why Facebook’s changes feel so troubling for many people.

Although they aren’t ripping down the walls of its convention center and declaring an open party. As Farhad Manjoo at Slate says, Facebook is not tearing down its walls but “expanding them.”

Facebook is making peepholes in certain walls, or letting some people (though not everyone) into the parties users thought were private.

This reinforces the feeling that mini-communities continue to exist within Facebook, something the company should try to do as it’s a major draw for many of its users.

Yet the multiplication of controls on Facebook for adjusting your privacy settings makes clear how difficult it is to share information and maintain this sense of mini-communities.  There are some who suspect Facebook is purposefully making it difficult to opt-out.  But even if we give Facebook the benefit of the doubt, it’s undeniable that the controls as they were, plus the controls that now exist for all the new features, are bewildering.  Just because users have choices doesn’t mean they feel confident about exercising them.

On MySpace, the prevailing ethos of being more public has its own pitfalls.  A teenager posting suggestive photos of herself may not fully appreciate what she’s doing.  At the least, though, she knows her profile is public to the world.

On Facebook, users are increasingly unsure of what information is public and to whom.  That arguably is more unsettling than total disclosure.

In the mix — open data issues, bad econ stats, Facebook gaydar, and fraud detection in data

Friday, April 30th, 2010

1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:

  • It’s important that data not only be open but be available in real time.  In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure.  This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
  • Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps.  It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons.  Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend.  This is a really fascinating issue, and I’m curious to see where else this gets picked up.

2) Existing economic statistics are riddled with problems.  I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.

3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves.  Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?

4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot.  There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten.  We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things.  There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names?  Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?

In the mix…Google reveals how many government requests for data it gets, Amazon tries First Amendment privacy argument, and the World Bank opens its databases

Wednesday, April 21st, 2010

1) Google is providing data on how many government requests they get for data. As various people have pointed out, the site has its limitations, but it’s still fascinating.  We’ve been thinking a lot about how attractive our datatrust would be to governments, and how we can best deal with requests and remain transparent.  This seems like a good option and maybe something all companies should consider doing.

2) In related news, Amazon is refusing the state of North Carolina’s request for its customer data. North Carolina wants the names and addresses of every customer and what they bought since 2003!  They want to audit Amazon’s compliance with North Carolina’s state tax laws.  I think NC’s request is nuts–are they really prepared to go through 50 million purchases?  It may just be legal posturing, given Amazon already gave them anonymized data on the purchases of NC residents, but what’s really interesting to me is Amazon’s argument that its customers have First Amendment rights in their purchases.  I heard a similar argument at a talk at NYU a few months ago, that instead of arguing privacy rights, which are not explicitly defined in the Constitution, we should be arguing for freedom of association rights when we seek to protect ourselves from data requests like this.  Interesting to see where this goes.

3) The World Bank is opening up its development data. This is data people used to pay for and now it’s free, so it’s exciting news.  But as with most public data out there, it’s really just indicators, aggregates, statistics, and such, rather than raw data you can query in an open-ended way.  Wouldn’t that be really exciting?

Can differential privacy be as good as tossing a coin?

Tuesday, April 20th, 2010

At the end of my last post, I had reasoned my way to understanding how differential privacy is capable of doing a really good job of erasing almost all traces of an individual in a dataset, no matter how much “external information” you are armed with and no matter how pointed your questions are.

Now, I’m going to attempt to explain why we can’t quite clear the final hurdle to truly and completely eradicate an individual’s presence from a dataset.

  • If coins are actually weighted such that one side is just ever-so-slightly heavier than the other side.
  • And such a coin is spun by a platonically balanced machine.
  • And the coin falls with the head’s side facing up.
  • And I only get one “spin” to decide which side is heavier.
  • Probabilistically, (by an extremely slim margin) I’m better off claiming that the tail’s side is heavier.

Translate this slightly weighted coin toss example into the world of differential privacy and PINQ and we have an explanation for why complete non-discernibility is also non-possible.

I have a question. I know ahead of time that the only two valid answers are 0 and 1. PINQ gives me 1.7.

Probabilistically, I’m better off betting that 1 is the real answer.

In fact, PINQ doesn’t even have to give me an answer so close to the real answer. Even if I were to ask my question with a lot of noise, if PINQ says -10,000,000,374, then probabilistically, I’m still better off claiming that 0 is the real answer. (I’d be a gigantic fool for thinking I’ve actually gotten any real information out of PINQ to help me make my bet. But lacking any other additional information, I’d be an even gigantic-er fool to bet in the other direction, even if only by a virtually non-existent slim margin.)

The only answer that would give me absolutely zero “new information” about the “real answer” is 0.5 (where the two distribution curves for 0 and 1 intersect). An answer of 0.5 makes no implications about whether 0 or 1 is the “real answer.” Both are equally likely. 50/50 odds.

But most of the time…and I really mean most of the time, PINQ is going to give me an answer that implies either 0 or 1, no matter how much noise I add.

Does this matter? you ask.

It’s easy to argue that if PINQ gives out answers that imply the “real answer” over “the only other possible answer” by a margin of, say, 0.000001%, who could possibly accuse us of false advertising if we claimed to guarantee total non-discernibility of individual records?

(As it turns out, coin tosses aren’t really a 50/50 proposition. they’re actually more of 51/49 proposition. So perhaps the way you would answer the “Does it matter?” question depends on whether you’d be the kind of person to take “The Strategy of Coin Flipping” seriously.)

Nevertheless, a real problem arises when you try to actually draw a definitive line in the sand about when it’s no longer okay for us to claim total non-discernibility in our privacy guarantee.

If 50/50 odds are the ideal when it comes to true and complete non-discernibility, then is 49/51 still okay? 45/55? What about 33/66? That seems like too much. 33/66 means that if the only two possible answers are 0 and 1, PINQ is going to be twice as likely to give me an answer that implies 1 than as to give me answer that implies 0.

Yet still I wonder, does this really count as discernment?

Technically speaking, sure.

But what if discernment in the real world can really only happen over time with multiple tries?

If I ask a question and I get 4 as an answer. Rationally, I can know that a “real answer” of 1 is twice as likely to yield a PINQ answer of 4 as a “real answer” of 0. But I’m not sure if viewed through the lens of human psychology, that makes a whole lot of sense.

After all, there are those psychology studies that show that people need to see 3 options before they feel comfortable making a decision. Maybe it takes “best out of 3” for people to ever feel like they can “discern” any kind of pattern. (I know I’ve read this in multiple places, but Google is failing me right now.)

Here’s psychologist Dan Gilbert on how we evaluate numbers (including odds and value) based on context and repeated past experience.

These two threads on the difference between the probability of a coin landing heads n-times versus the probability of the next coin landing heads after it has already landed n-times further illustrates how context and experience cloud our judgement around probabilities.

If my instincts are correct, what does all this mean for our poor, beleaguered privacy guarantee?

Get Adobe Flash player