At the end of my last post, we came to the rather sad conclusion that as far as differential privacy is concerned, it is not possible to offer a 50/50, “you might as well not be in the data set” privacy guarantee because, well, the Laplace distribution curves used to apply identity-obfuscating noise in differential privacy are too…curvy.
No matter how much noise you add, answers you get out of differential privacy will always imply that one number is more likely to be the “real answer” than another. (Which as we know from our “nosy-neighbor-worst-case-scenario,” can translate into revealing the presence of an individual in a data set: The very thing differential privacy is supposed to protect against.)
Still, “50/50 is impossible” is predicated on the nature of the Laplace curves. What would happen if we got rid of them? Are there any viable alternatives?
Apparently, no. 50/50 truly is impossible.
There are a few ways to understand why and how.
The first is a mental sleight of hand. A 50/50 guarantee is impossible because that would mean that the presence of an individual’s data literally has ZERO impact on the answers given out by PINQ, which would effectively cancel out differential privacy’s ability to provide more or less accurate answers.
Back to our worst-case scenario, in a 50/50 world, a PINQ answer of 3.7 would not only equally imply that the real answer was 0 as that it was 1, it would also equally imply that the real answer was 8, as that it was 18K or 18MM. Differential privacy answers would effectively be completely meaningless.
Graphically speaking, to get 50/50, the currently pointy noise distribution curves would have to be perfectly horizontal, stretching out to infinity in both directions on the number line.
What about a bounded flat curve?
(If pressed, this is probably the way most people would understand what is meant when someone says an answer has a noise level or margin of error of +/-50.)
Well, if you were to apply noise with a rectangular curve, in our worst-case scenario, with +/-50 noise, there would be a 1 in 100 chance that you get an answer that definitively tells you the real answer.
If the real answer is 0, with a rectangular noise level +/- 50 would yield answers from -50 to +50.
If the real answer is 1, a rectangular noise level +/-50 would yield answers from -49 to +51.
If you get a PINQ answer of 37, you’re set. It’s equally likely that the answer is 0 as that the answer is 1. 50/50 achieved.
If you get a PINQ answer of 51, well you’ll know for sure that the real answer is 1, not 0. And there’s a 1 in a 100 chance that you’ll get an answer of 51.
Meaning there’s a 1% chance that in the worst-case scenario you’ll get 100% “smoking gun” confirmation of that someone is definitely present in a data set.
As it turns out, rectangular curves are a lot dumber than those pointy Laplace things because they don’t have asymptotes to plant a nagging seed of doubt. In PINQ, all noise distribution curves have an asymptote of zero (as in zero likelihood of being chosen as a noisy answer).
In plain English, that means that every number on the real number line has a chance (no matter how tiny) of being chosen as a noisy answer, no matter what the “real answer” is. In other words, there are no “smoking guns.”
So now we’re back to where we left off in our last post, trying to pick an arbitrary arbitrary probability split for our privacy guarantee.
Or maybe not. Could statistical significance come and save the day?
Could we quantify our privacy guarantee by saying that the presence or absence of a single record will not affect the answers we give out to a statistically significant degree?