This gets at the heart of the issue of imbalance between private and public sectors when it comes to access to sensitive information.
From our perspective, it doesn’t seem like a good idea to limit data usage. If the drug companies are smart, they’re also using the same data to figure out things like what drugs are being prescribed in combination and how that affects the effectiveness of their products.
Instead, we should be thinking of ways to expand access so that for every drug company buying data for marketing and product development, there is an active community of researchers, public advocates and policymakers who have low-cost or free access to the same data.
Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.
1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.
2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.
to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-
(A) access to such information; and
(B) mechanisms to correct such information to improve the accuracy of such information;
Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.
I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.
1)Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support. If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.
2) I was really surprised to find out the U.S. Census doesn’t ask about religion. It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly? The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?
3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea. Is there a way that this information can be made available without revealing individual identity? That’s the essential problem the datatrust is trying to solve.
4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction. As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.
We at CDP have been talking about how the datatrust might change the business of data. We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold. Already, the business of buying and selling data has changed so much in the past 10 years. Exciting years ahead.
1) It’s heartening that an article on how data-sharing led to a breakthrough in Alzheimer’s research is the Most Emailed article on the NYTimes website right now. The reasons for resisting data-sharing are the same in so many contexts:
At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.
But Alzheimer’s researchers and drug companies realized they had little choice.
“Companies were caught in a prisoner’s dilemma,” said Dr. Jason Karlawish, an Alzheimer’s researcher at the University of Pennsylvania. “They all wanted to move the field forward, but no one wanted to take the risks of doing it.”
2) Google agonizes on privacy. The Wall Street Journal article discusses a confidential Google document that reveals the disagreements within the company on how it should use its data. Interestingly, all the scenarios in which Google considers using its data involve targeted advertising; none involve sharing that data with Google users in a broader, more extensive way than they do now. Google believes it owns the data it’s collected, but it also clearly senses that ownership of such data has implications that are different from ownership of other assets. There are individuals who are implicated — what claims might they have to how that data is used?
3) Some people have suggested that if people are unhappy with targeted advertising, the government should come up with a Do Not Track registry, similar to the Do Not Call list. But as Harlan Yu notes, Do Not Track would not be as simple as it sounds. He notes that the challenges involve both technology and policy:
Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.
4) What happens to a business’s data when it goes bankrupt? The former publisher and partners of a magazine and dating website for gay youth were fighting over ownership of the company’s assets, including its databases. They recently came to an agreement to destroy the data. EFF argues that the Bankruptcy Code should be amended to require such outcomes for data assets. I don’t know enough about bankruptcy law to have an opinion on that, but this conflict illuminates what’s so problematic about the way we treat data and property. No one can own a fact, but everyone acts like they own data. Something fundamental needs to be thrashed out.
1) Facebook’s in privacy trouble again. Ron Bowes created a downloadable file containing information on 100 million searchable Facebook profiles, including the URL, name, and unique ID. What’s interesting is that it’s not exactly a breach. As Facebook pointed out, the information was already public. What Facebook will likely never admit, though, is that there is a qualitative difference between information that is publicly available, and information that is organized into an easily searchable database. This is what we as a society are struggling to define — if “public” means more public than ever before, how do we balance our societal interests in both privacy and disclosure?
2) Can data mining go mainstream? The article doesn’t actually say much, but it does at least raise an important question. The value of data and data-mining is immense, as corporations and large government agencies know well. Will those tools every be available to individuals? Smaller businesses and organizations? And what would that mean for them? It’s a big motivator for us at the Common Data Project — if data doesn’t belong to anyone, and it’s been collected from us, shouldn’t we all be benefiting from data?
Although the bill is still being debated and rewritten, some of its provisions indicate that the author of the bill knows a bit more about data and privacy issues than many other Congressional representatives.
The information regulated by the Act goes beyond the traditional, American definition of personally identifiable information. “The definition of “covered information” in the Act does not require such a combination – each data element stands on its own and may not need to be tied to or identify a specific person. If I, as an individual, had an email address that was firstname.lastname@example.org, that would would appear to satisfy the definition of covered information even if my name was not associated with it.”
Notice is required when information will be merged or combined with other data.
There’s some limited push to making more information accessible to users: “covered entities, upon request, must provide individuals with access to their personal files.” However, they only have to if “the entity stores such file in a manner that makes it accessible in the normal course of business,” which I’m guessing would apply to much of the data collected by internet companies.
1) Infochimpslaunched their API. People often ask, are you guys doing something similar? Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.
2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing. The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits. Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.” It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.
3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.
It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.
Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”. For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis. Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust. Here’s a look at a couple of examples from the workshop:
Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”. Manholes provide access to the city’s underground electrical system. When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion. The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).
Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind. One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable. While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this. First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall. The cleaning was an iterative process that involved some manual work with these text blobs.
For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust. This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.
Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob. Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob. So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.
Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”. The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy. If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy. Graham presented a different technique for anonymizing data. I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.
Graham’s technique anonymizes graphs, particularly social network interaction graphs. In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact. Then there are edges connecting the people to the interactions. Here is an example of a portion of a graph:
Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.
How does he achieve this? The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes. (We would also replace the names with demographic information to allow us to make general conclusions.)
Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer. (One of Mimi and Grace poked Grant or Alex!) Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.
You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:
PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.
The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.
It’s not enough to start with compensating consumers for their data. The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again. These data-centered companies are creating a network of users whose data are continually used in the business. Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.
1) UC Berkeley’s incoming class will all get DNA tests to identify genes that show how well you metabolize alcohol, lactose, and folates. “After the genetic testing, the university will offer a campuswide lecture by Mr. Rine about the three genetic markers, along with other lectures and panels with philosophers, ethicists, biologists and statisticians exploring the benefits and risks of personal genomics.”
Obviously, genetic testing is not something to take lightly, but the objections quoted sounded a little paternalistic. For example, “They may think these are noncontroversial genes, but there’s nothing noncontroversial about alcohol on campus,” said George Annas, a bioethicist at the Boston University School of Public Health. “What if someone tests negative, and they don’t have the marker, so they think that means they can drink more? Like all genetic information, it’s potentially harmful.”
Isn’t this the reasoning of people who preach abstinence-only sex education?
2) Google recently admitted they were collecting wifi information during their Streetview runs. Germany’s reaction? To ask for the data so they can see if there’s reason to charge Google criminally. I don’t understand this. Private information is collected illegally so it should just be handed over to the government? Are there useful ways to review this data and identify potential illegalities without handing the raw data over to the government? Another example of why we can’t rest on our laurels — we need to find new ways to look at private data.
3) EFF issued a privacy bill of rights for social network users. Short and simple. It’s gotten me thinking, though, about what it means that we’re demanding rights from a private company. Not to get all Rand Paul on people (I really believe in the Civil Rights Act, all of it), but users’ frustrations with Facebook and their unwillingness to actually leave makes clear that the service Facebook is offering is not just a service provided to just a customer. danah boyd has a suggestion — let’s think of Facebook as a utility and regulate it the way we regulate electric, water, and other similar utilities.