Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

In the mix…nonprofit technology failures; not counting religion; medical privacy after death; and the business of open data

Friday, August 20th, 2010

1) Impressive nonprofit transparency around technology failures. It might seem odd for us to highlight technology failures when we’re hoping to make CDP and its technology useful to nonprofits, but the transparency demonstrated by these nonprofits talking openly about their mistakes is precisely the kind of transparency we hope to support.  If nonprofits, or any other organization, is going to share more of their data with the public, they have to be willing to share the bad with the good, all in the hope of actually doing better.

2) I was really surprised to find out the U.S. Census doesn’t ask about religion.  It’s a sensitive subject, but is it really more sensitive than race and ethnicity, which the U.S. Census asks about quite openly?  The article goes through why having a better count of different religions could be useful to a lot of people. What are other things we’re afraid to count, and how might that be holding us back from important knowledge?

3) How long should we protect people’s privacy around their medical history? HHS proposes to remove protections that prevent researchers and archivists from accessing medical records for people who have been dead for 50 years; CDT thinks this is a bad idea.  Is there a way that this information can be made available without revealing individual identity?  That’s the essential problem the datatrust is trying to solve.

4) It may be counterintuitive, but open data can foster industry and business. Clay Johnson, formerly at the Sunlight Foundation, writes about how weather data, collected by the U.S. government, became open data, thereby creating a whole new industry around weather prediction.  As he points out, though, that $1.5 billion industry is now not that excited by the National Weather Service expanding into providing data directly to citizens.

We at CDP have been talking about how the datatrust might change the business of data.  We think that it could enable all kinds of new business and new services, but it will likely change how data is bought and sold.  Already, the business of buying and selling data has changed so much in the past 10 years.  Exciting years ahead.

In the mix…data-sharing’s impact on Alzheimer’s research, the limits of a Do Not Track registry, meaning of data ownership and more

Friday, August 13th, 2010

1)  It’s heartening that an article on how data-sharing led to a breakthrough in Alzheimer’s research is the Most Emailed article on the NYTimes website right now. The reasons for resisting data-sharing are the same in so many contexts:

At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.

But Alzheimer’s researchers and drug companies realized they had little choice.

“Companies were caught in a prisoner’s dilemma,” said Dr. Jason Karlawish, an Alzheimer’s researcher at the University of Pennsylvania. “They all wanted to move the field forward, but no one wanted to take the risks of doing it.”

2) Google agonizes on privacy. The Wall Street Journal article discusses a confidential Google document that reveals the disagreements within the company on how it should use its data.  Interestingly, all the scenarios in which Google considers using its data involve targeted advertising; none involve sharing that data with Google users in a broader, more extensive way than they do now.  Google believes it owns the data it’s collected, but it also clearly senses that ownership of such data has implications that are different from ownership of other assets.  There are individuals who are implicated — what claims might they have to how that data is used?

3) Some people have suggested that if people are unhappy with targeted advertising, the government should come up with a Do Not Track registry, similar to the Do Not Call list.  But as Harlan Yu notes, Do Not Track would not be as simple as it sounds. He notes that the challenges involve both technology and policy:

Privacy isn’t a single binary choice but rather a series of individually-considered decisions that each depend on who the tracking party is, how much information can be combined and what the user gets in return for being tracked. This makes the general concept of online Do Not Track—or any blanket opt-out regime—a fairly awkward fit. Users need simplicity, but whether simple controls can adequately capture the nuances of individual privacy preferences is an open question.

4) What happens to a business’s data when it goes bankrupt? The former publisher and partners of a magazine and dating website for gay youth were fighting over ownership of the company’s assets, including its databases.  They recently came to an agreement to destroy the dataEFF argues that the Bankruptcy Code should be amended to require such outcomes for data assets.  I don’t know enough about bankruptcy law to have an opinion on that, but this conflict illuminates what’s so problematic about the way we treat data and property.  No one can own a fact, but everyone acts like they own data.  Something fundamental needs to be thrashed out.

5) Geotags are revealing more locational info than the photographers intended.

6) The owner of an ISP that resisted an FBI request for information can finally reveal his identity. Nicholas Merrill can now reveal that he was the plaintiff behind an ACLU lawsuit that challenged the legality of national security letter, by which the FBI can request information without a court order or proving just cause.  In fact, the FBI can even impose a gag order prohibiting the recipient of the NSL from telling anyone about the NSL, which is what happened to Merrill.

In the mix…WSJ’s “What They Know”; data potential in healthcare; and comparing the privacy bills in Congress

Monday, August 9th, 2010

1.  The Wall Street Journal Online has a new feature section, ominously named, “What They Know.” The section highlights articles that focus on technology and tracking.  The tone feels a little overwrought, with language that evokes spies, like “Stalking by Cellphone” and “The Web’s New Gold Mine: Your Secrets.”  Some of their methodology is a little simplistic.  Their study on how much people are “exposed” online was based on simply counting tracking tools, such as cookies and beacons, installed by certain websites.

It is interesting, though, to see that the big, bad wolves of privacy, like Facebook and Google, are pretty low on the WSJ’s exposure scale, while sites people don’t really think about, like dictionary.com, are very high.  The debate around online data collection does need to shift to include companies that aren’t so name-brand.

2.  In response to WSJ’s feature, AdAge published this article by Erin Jo Richey, a digital marketing analyst, addressing whether “online marketers are actually spies.” She argues that she doesn’t know that much about the people she’s tracking, but she does admit she could know more:

I spend most of my time looking at trends and segments of visitors with shared characteristics rather than focusing on profiles of individual browsers. However, if I already know that Mary Smith bought a black toaster with product number 08971 on Monday morning, I can probably isolate the anonymous profile that represents Mary’s visit to my website Monday morning.

3.  A nice graphic illustrates how data could transform healthcare. Are there people making these kinds of detailed arguments made for other industries and areas of research and policy?

4.  There are now two proposed privacy bills in Congress, the BEST PRACTICES bill proposed by Representative Bobby Rush and the draft proposed by Representatives Rick Boucher and Cliff Stearns.  CDT has released a clear and concise table breaking down the differences between these two proposed bills and what CDT recommends.  Some things that jumped out at us:

  • Both bills make exceptions for aggregated or de-identified data.  The BEST PRACTICES bill has a more descriptive definition of what that means, stating that it excepts aggregated information and information from which identifying information has been obscured or removed, such that there is no reasonable basis to believe that the information could be used to identify an individual or a computer used by the individual.  CDT supports the BEST PRACTICES exception.
  • Both bills make some, though not sweeping provisions, for consumer access to the information collected about them.  CDT endorses neither, and would support a bill that would generally require covered entities to make available to consumers the covered information possessed about them along with a reasonable method of correction.  Some companies, including a start-up called Bynamite, have already begun to show consumers what’s being collected, albeit in rather limited ways.  We at the Common Data Project hope this push to access also includes access to the richness of the information collected from all of us, and not just the interests asssociated with me.  It’ll be interesting to see where this legislation goes, and how it might affect the development of our datatrust.

In the mix…Facebook “breach” of public data, data-mining for everyone, thinking through the Panton Principles, and BEST PRACTICES Act in Congress

Friday, July 30th, 2010

1) Facebook’s in privacy trouble again. Ron Bowes created a downloadable file containing information on 100 million searchable Facebook profiles, including the URL, name, and unique ID.  What’s interesting is that it’s not exactly a breach.  As Facebook pointed out, the information was already public.  What Facebook will likely never admit, though, is that there is a qualitative difference between information that is publicly available, and information that is organized into an easily searchable database.  This is what we as a society are struggling to define — if “public” means more public than ever before, how do we balance our societal interests in both privacy and disclosure?

2) Can data mining go mainstream? The article doesn’t actually say much, but it does at least raise an important question.  The value of data and data-mining is immense, as corporations and large government agencies know well.  Will those tools every be available to individuals?  Smaller businesses and organizations?  And what would that mean for them?  It’s a big motivator for us at the Common Data Project — if data doesn’t belong to anyone, and it’s been collected from us, shouldn’t we all be benefiting from data?

3) In the same vein is a new blog by Peter Murray-Rust discussing open knowledge/open data issues, focusing on the Panton Principles for open science data.

4) A new data privacy bill has been introduced in Congress called “Building Effective Strategies to Promote Responsibility Accountability Choice Transparency Innovation Consumer Expectations and Safeguards” Act, aka “BEST PRACTICES Act.”  The Information Law Group has posted Part One of FAQs on this proposed bill.

Although the bill is still being debated and rewritten, some of its provisions indicate that the author of the bill knows a bit more about data and privacy issues than many other Congressional representatives.

  • The information regulated by the Act goes beyond the traditional, American definition of personally identifiable information.  “The definition of “covered information” in the Act does not require such a combination – each data element stands on its own and may not need to be tied to or identify a specific person. If I, as an individual, had an email address that was wildwolf432@hotmail.com, that would would appear to satisfy the definition of covered information even if my name was not associated with it.”
  • Notice is required when information will be merged or combined with other data.
  • There’s some limited push to making more information accessible to users: “covered entities, upon request, must provide individuals with access to their personal files.” However, they only have to if “the entity stores such file in a manner that makes it accessible in the normal course of business,” which I’m guessing would apply to much of the data collected by internet companies.

In the mix…new organizational structures, giant list of data brokers, governments sharing citizens’ financial data, and what IT security has to do with Lady Gaga

Friday, July 9th, 2010

1) More on new kinds of organizational structures for entities that want to form for philanthropic purposes but not fit into the IRS definition of a nonprofit.

2) CDT shone a spotlight on Spokeo, a data broker last week.  Who are other data brokers? Don’t be shocked, there are A LOT of them.  What they do, they mainly do out of the spotlight shone on companies like Facebook, but with very real effects.  In 2005, ChoicePoint sold data to identity thieves posing as a legitimate business.

3) The U.S. has come to an agreement with Europe on sharing finance data, which the U.S. argues is an essential tool of counterterrorism.  The article doesn’t say exactly how these investigations work, whether specific suspects are targeted or whether large amounts of financial data are combed for suspicious activity.  It does make me wonder, given how data crosses borders more easily than any other resource, how will Fourth Amendment protections in the U.S. (and similar protections in other countries) apply to these international data exchanges?  There is also this pithy quote:

Giving passengers a way to challenge the sharing of their personal data in United States courts is a key demand of privacy advocates in Europe — though it is not clear under what circumstances passengers would learn that their records were being misused or were inaccurate.

4) Don’t mean to focus so much on scary data stuff, but 41% of IT professionals admit to abusing privileges.  In a related vein, it turns out a disgruntled soldier accused of illegally downloading classified data managed to do it by disguising his CDs as Lady Gaga CDs.  Even better,

He was able to avoid detection not because he kept a poker face, they said, but apparently because he hummed and lip-synched to Lady Gaga songs to make it appear that he was using the classified computer’s CD player to listen to music.

The New York Times is definitely getting cheekier.

In the mix…philanthropic entities, who’s online doing what, data brokers, and data portability

Monday, July 5th, 2010

1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one.  We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online.  (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation.  I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook.  We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch)  It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

Who has your data and how can the government get it?

Monday, June 28th, 2010

Who has your data? And how can the government get it?

The questions are more complicated than they might seem.

In the last month, we’ve seen Facebook criticized and scrutinized at every turn for the way they collect and share their users’ data.  Much of that criticism was deserved, but what was missing in that discussion were the companies that have your data without even your knowledge, let alone your consent.

The relationship between a user and Facebook is at least relatively straightforward.  The user knows his or her data has been placed in Facebook, and legislation could be updated relatively easily to protect his or her expectation of privacy in that data.

But what about the data consumer service companies share with third parties?

Pharmacies sell prescription data that includes you; cellphone-related businesses sell data that includes you.

So much of the data economy involves companies and businesses that don’t necessarily have you as a customer, and thus even less incentive to protect your interests.

What about data that’s supposedly de-identified or anonymized?  We know that such data can be combined with another dataset to re-identify people.  Could the government seek that kind of data and avoid getting even a subpoena?  Increasingly, the companies that have data about you aren’t even the companies you initially transacted with.  How will existing privacy laws, even proposed reforms by the Digital Due Process coalition, deal with this reality?

These are all questions that consume us at the Common Data Project for good reason.  As an organization dedicated to enabling the safe disclosure of personal information, we are committed to talking about privacy and anonymity in measurable ways, rather than with vague promises.

If you read a typical privacy policy, you’ll see language that goes something like this,

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:…

We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request

We think the datatrust needs to be do better than that. We want to know exactly what “enforceable government request” means.  We want to think creatively about what individual privacy rights mean when organizations are sharing information with each other. We’ve written up the aspects that seem most directly relevant to our project here, including 1) a quick overview of federal privacy law; 2) implications for data collectors today; and 3) implications for the datatrust.

We ultimately have more questions than answers.  But we definitely can’t assume we know everything there is to know.  Even at the Supreme Court, where the Justices seem to have some trouble understanding how pagers and text messages work, they understand that the world is changing quickly.  (See City of Ontario v. Quon.)  We all need to be asking questions together.

So take a look.  Let us know if there are issues we’re missing. What are some other questions we should be asking?

In the mix…democratizing access to data, data literacy, and predictable responses to proposed privacy bill

Friday, June 18th, 2010

1) Infochimps launched their API. People often ask, are you guys doing something similar?  Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.

2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing.  The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits.  Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.”  It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.

3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.

4) Great discussion continues on the concept of “data literacy.” I love this guest post from David Eaves on the Open Knowledge Foundation blog, with the awesome line:

It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.

In the mix…EU data retention laws, Wikipedia growing

Friday, June 11th, 2010

1) Australia thinking about requiring ISPs to record browsing histories (via Truste).

Electronic Frontier Australia (EFA) chair Colin Jacobs said the regime was “a step too far”.

“At some point data retention laws can be reasonable, but highly-personal information such as browsing history is a step too far,” Jacobs said. “You can’t treat everybody like a criminal. That would be like tapping people’s phones before they are suspected of doing any crime.”

Sounds shocking, but the EU already requires it.

2) European privacy officials are pointing out that Microsoft, Google and Yahoo’s methods of “anonymization” are not good enough to comply with EU requirements (via EFF).  As we’ve been saying for awhile, “anonymization” is not a very precise claim.  (Even though they also want ISPs to retain browsing histories for law enforcement–confused? I am.)

3) Wikipedia is adding two new executive roles.  In the process of researching our community study, it really struck me how small Wikipedia‘s staff was compared to the staff of more centralized, less community-run businesses like Yelp and Facebook.  Having two more staff members is not a huge increase, but it does make me wonder, is a larger staff inevitable when an organization tries to assert more editorial control over what the community produces?

Exploding Manholes and Anonymizing Relationships at CCICADA

Thursday, June 10th, 2010

Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”.  For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis.  Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust.   Here’s a look at a couple of examples from the workshop:

Exploding Manholes!

Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”.  Manholes provide access to the city’s underground electrical system.  When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion.  The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).

Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind.  One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable.  While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this.  First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall.  The cleaning was an iterative process that involved some manual work with these text blobs.

For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust.  This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.

Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob.  Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob.  So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.

Anonymizing Relationships

Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”.  The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy.  If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy.  Graham presented a different technique for anonymizing data.  I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.

Graham’s technique anonymizes graphs, particularly social network interaction graphs.  In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact.  Then there are edges connecting the people to the interactions.  Here is an example of a portion of a graph:

Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.

How does he achieve this?  The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes.  (We would also replace the names with demographic information to allow us to make general conclusions.)

Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer.  (One of Mimi and Grace poked Grant or Alex!)  Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.

You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:

PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.
Get Adobe Flash playerPlugin by wpburn.com wordpress themes