Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”. For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis. Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust. Here’s a look at a couple of examples from the workshop:
Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”. Manholes provide access to the city’s underground electrical system. When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion. The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).
Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind. One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable. While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this. First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall. The cleaning was an iterative process that involved some manual work with these text blobs.
For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust. This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.
Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob. Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob. So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.
Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”. The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy. If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy. Graham presented a different technique for anonymizing data. I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.
Graham’s technique anonymizes graphs, particularly social network interaction graphs. In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact. Then there are edges connecting the people to the interactions. Here is an example of a portion of a graph:
Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.
How does he achieve this? The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes. (We would also replace the names with demographic information to allow us to make general conclusions.)
Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer. (One of Mimi and Grace poked Grant or Alex!) Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.
You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:
PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.