1)Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support. If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.
2) I was really surprised to find out the U.S. Census doesn’t ask about religion. It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly? The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?
3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea. Is there a way that this information can be made available without revealing individual identity? That’s the essential problem the datatrust is trying to solve.
4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction. As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.
We at CDP have been talking about how the datatrust might change the business of data. We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold. Already, the business of buying and selling data has changed so much in the past 10 years. Exciting years ahead.
1) It’s heartening that an article on how data-sharing led to a breakthrough in Alzheimer’s research is the Most Emailed article on the NYTimes website right now. The reasons for resisting data-sharing are the same in so many contexts:
At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.
But Alzheimer’s researchers and drug companies realized they had little choice.
“Companies were caught in a prisoner’s dilemma,” said Dr. Jason Karlawish, an Alzheimer’s researcher at the University of Pennsylvania. “They all wanted to move the field forward, but no one wanted to take the risks of doing it.”
2) Google agonizes on privacy. The Wall Street Journal article discusses a confidential Google document that reveals the disagreements within the company on how it should use its data. Interestingly, all the scenarios in which Google considers using its data involve targeted advertising; none involve sharing that data with Google users in a broader, more extensive way than they do now. Google believes it owns the data it’s collected, but it also clearly senses that ownership of such data has implications that are different from ownership of other assets. There are individuals who are implicated — what claims might they have to how that data is used?
3) Some people have suggested that if people are unhappy with targeted advertising, the government should come up with a Do Not Track registry, similar to the Do Not Call list. But as Harlan Yu notes, Do Not Track would not be as simple as it sounds. He notes that the challenges involve both technology and policy:
Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.
4) What happens to a business’s data when it goes bankrupt? The former publisher and partners of a magazine and dating website for gay youth were fighting over ownership of the company’s assets, including its databases. They recently came to an agreement to destroy the data. EFF argues that the Bankruptcy Code should be amended to require such outcomes for data assets. I don’t know enough about bankruptcy law to have an opinion on that, but this conflict illuminates what’s so problematic about the way we treat data and property. No one can own a fact, but everyone acts like they own data. Something fundamental needs to be thrashed out.
6) The owner of an ISP that resisted an FBI request for information can finally reveal his identity. Nicholas Merrill can now reveal that he was the plaintiff behind an ACLU lawsuit that challenged the legality of national security letter, by which the FBI can request information without a court order or proving just cause. In fact, the FBI can even impose a gag order prohibiting the recipient of the NSL from telling anyone about the NSL, which is what happened to Merrill.
1) Facebook’s in privacy trouble again. Ron Bowes created a downloadable file containing information on 100 million searchable Facebook profiles, including the URL, name, and unique ID. What’s interesting is that it’s not exactly a breach. As Facebook pointed out, the information was already public. What Facebook will likely never admit, though, is that there is a qualitative difference between information that is publicly available, and information that is organized into an easily searchable database. This is what we as a society are struggling to define — if “public” means more public than ever before, how do we balance our societal interests in both privacy and disclosure?
2) Can data mining go mainstream? The article doesn’t actually say much, but it does at least raise an important question. The value of data and data-mining is immense, as corporations and large government agencies know well. Will those tools every be available to individuals? Smaller businesses and organizations? And what would that mean for them? It’s a big motivator for us at the Common Data Project — if data doesn’t belong to anyone, and it’s been collected from us, shouldn’t we all be benefiting from data?
Although the bill is still being debated and rewritten, some of its provisions indicate that the author of the bill knows a bit more about data and privacy issues than many other Congressional representatives.
The information regulated by the Act goes beyond the traditional, American definition of personally identifiable information. “The definition of “covered information” in the Act does not require such a combination – each data element stands on its own and may not need to be tied to or identify a specific person. If I, as an individual, had an email address that was wildwolf432@hotmail.com, that would would appear to satisfy the definition of covered information even if my name was not associated with it.”
Notice is required when information will be merged or combined with other data.
There’s some limited push to making more information accessible to users: “covered entities, upon request, must provide individuals with access to their personal files.” However, they only have to if “the entity stores such file in a manner that makes it accessible in the normal course of business,” which I’m guessing would apply to much of the data collected by internet companies.
1) Infochimpslaunched their API. People often ask, are you guys doing something similar? Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.
2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing. The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits. Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.” It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.
3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.
4) Great discussion continues on the concept of “data literacy.” I love this guest post from David Eaves on the Open Knowledge Foundation blog, with the awesome line:
It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.
Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”. For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis. Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust. Here’s a look at a couple of examples from the workshop:
Exploding Manholes!
Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”. Manholes provide access to the city’s underground electrical system. When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion. The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).
Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind. One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable. While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this. First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall. The cleaning was an iterative process that involved some manual work with these text blobs.
For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust. This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.
Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob. Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob. So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.
Anonymizing Relationships
Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”. The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy. If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy. Graham presented a different technique for anonymizing data. I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.
Graham’s technique anonymizes graphs, particularly social network interaction graphs. In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact. Then there are edges connecting the people to the interactions. Here is an example of a portion of a graph:
Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.
How does he achieve this? The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes. (We would also replace the names with demographic information to allow us to make general conclusions.)
Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer. (One of Mimi and Grace poked Grant or Alex!) Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.
You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:
PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.
What kind of data-mining is the IRS doing within the U.S.? The Right to Financial Privacy Act protects our personal banking data from government searches.
However, should the government be asking for aggregate data from banks about customer account activity that could help them identify suspicious behavior?
The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.
It’s not enough to start with compensating consumers for their data. The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again. These data-centered companies are creating a network of users whose data are continually used in the business. Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.
2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access. A similar argument can be made about the concept of privacy. In “real life” (i.e., offline life), no one aims for total privacy. Everyday, we make decisions about what we want to share with whom. Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy. For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends. So perhaps we’ll need to work on privacy and data literacy as well?
1) UC Berkeley’s incoming class will all get DNA tests to identify genes that show how well you metabolize alcohol, lactose, and folates. “After the genetic testing, the university will offer a campuswide lecture by Mr. Rine about the three genetic markers, along with other lectures and panels with philosophers, ethicists, biologists and statisticians exploring the benefits and risks of personal genomics.”
Obviously, genetic testing is not something to take lightly, but the objections quoted sounded a little paternalistic. For example, “They may think these are noncontroversial genes, but there’s nothing noncontroversial about alcohol on campus,” said George Annas, a bioethicist at the Boston University School of Public Health. “What if someone tests negative, and they don’t have the marker, so they think that means they can drink more? Like all genetic information, it’s potentially harmful.”
Isn’t this the reasoning of people who preach abstinence-only sex education?
2) Google recently admitted they were collecting wifi information during their Streetview runs. Germany’s reaction? To ask for the data so they can see if there’s reason to charge Google criminally. I don’t understand this. Private information is collected illegally so it should just be handed over to the government? Are there useful ways to review this data and identify potential illegalities without handing the raw data over to the government? Another example of why we can’t rest on our laurels — we need to find new ways to look at private data.
3) EFF issued a privacy bill of rights for social network users. Short and simple. It’s gotten me thinking, though, about what it means that we’re demanding rights from a private company. Not to get all Rand Paul on people (I really believe in the Civil Rights Act, all of it), but users’ frustrations with Facebook and their unwillingness to actually leave makes clear that the service Facebook is offering is not just a service provided to just a customer. danah boyd has a suggestion — let’s think of Facebook as a utility and regulate it the way we regulate electric, water, and other similar utilities.
1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:
It’s important that data not only be open but be available in real time. In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure. This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps. It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons. Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend. This is a really fascinating issue, and I’m curious to see where else this gets picked up.
2) Existing economic statistics are riddled with problems. I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.
3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves. Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?
4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot. There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten. We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things. There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names? Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?
1) Google is providing data on how many government requests they get for data. As various people have pointed out, the site has its limitations, but it’s still fascinating. We’ve been thinking a lot about how attractive our datatrust would be to governments, and how we can best deal with requests and remain transparent. This seems like a good option and maybe something all companies should consider doing.
2) In related news, Amazon is refusing the state of North Carolina’s request for its customer data. North Carolina wants the names and addresses of every customer and what they bought since 2003! They want to audit Amazon’s compliance with North Carolina’s state tax laws. I think NC’s request is nuts–are they really prepared to go through 50 million purchases? It may just be legal posturing, given Amazon already gave them anonymized data on the purchases of NC residents, but what’s really interesting to me is Amazon’s argument that its customers have First Amendment rights in their purchases. I heard a similar argument at a talk at NYU a few months ago, that instead of arguing privacy rights, which are not explicitly defined in the Constitution, we should be arguing for freedom of association rights when we seek to protect ourselves from data requests like this. Interesting to see where this goes.
3) The World Bank is opening up its development data. This is data people used to pay for and now it’s free, so it’s exciting news. But as with most public data out there, it’s really just indicators, aggregates, statistics, and such, rather than raw data you can query in an open-ended way. Wouldn’t that be really exciting?
Asking questions, highlighting news, inviting discussion, and announcing developments related to The Common Data Project's mission:
To encourage and enable the disclosure of personal data for public re-use through the creation of a technology and legal framework for anonymized data-sharing.
We at CDP believe that public access to the rapidly growing stores of privately held personal data is crucial to a healthy democracy, informed policy-making and intelligently regulated markets.