Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

In the mix…Facebook “breach” of public data, data-mining for everyone, thinking through the Panton Principles, and BEST PRACTICES Act in Congress

Friday, July 30th, 2010

1) Facebook’s in privacy trouble again. Ron Bowes created a downloadable file containing information on 100 million searchable Facebook profiles, including the URL, name, and unique ID.  What’s interesting is that it’s not exactly a breach.  As Facebook pointed out, the information was already public.  What Facebook will likely never admit, though, is that there is a qualitative difference between information that is publicly available, and information that is organized into an easily searchable database.  This is what we as a society are struggling to define — if “public” means more public than ever before, how do we balance our societal interests in both privacy and disclosure?

2) Can data mining go mainstream? The article doesn’t actually say much, but it does at least raise an important question.  The value of data and data-mining is immense, as corporations and large government agencies know well.  Will those tools every be available to individuals?  Smaller businesses and organizations?  And what would that mean for them?  It’s a big motivator for us at the Common Data Project — if data doesn’t belong to anyone, and it’s been collected from us, shouldn’t we all be benefiting from data?

3) In the same vein is a new blog by Peter Murray-Rust discussing open knowledge/open data issues, focusing on the Panton Principles for open science data.

4) A new data privacy bill has been introduced in Congress called “Building Effective Strategies to Promote Responsibility Accountability Choice Transparency Innovation Consumer Expectations and Safeguards” Act, aka “BEST PRACTICES Act.”  The Information Law Group has posted Part One of FAQs on this proposed bill.

Although the bill is still being debated and rewritten, some of its provisions indicate that the author of the bill knows a bit more about data and privacy issues than many other Congressional representatives.

  • The information regulated by the Act goes beyond the traditional, American definition of personally identifiable information.  “The definition of “covered information” in the Act does not require such a combination – each data element stands on its own and may not need to be tied to or identify a specific person. If I, as an individual, had an email address that was wildwolf432@hotmail.com, that would would appear to satisfy the definition of covered information even if my name was not associated with it.”
  • Notice is required when information will be merged or combined with other data.
  • There’s some limited push to making more information accessible to users: “covered entities, upon request, must provide individuals with access to their personal files.” However, they only have to if “the entity stores such file in a manner that makes it accessible in the normal course of business,” which I’m guessing would apply to much of the data collected by internet companies.

In the mix..government surveillance, HIPAA updates, and user control over online data

Monday, July 19th, 2010

1) The U.S. government comes up with some, um, interesting names for its surveillance programs.  “Perfect Citizen” sounds like it’s right out of Orwell. As the article points out, there are some major unanswered questions.  How do they collect this data?  Where do they get it?  Do they use it just to look for interesting patterns that then lead them to identify specific individuals, or are all the individuals apparent and visible from the get-go?  And what are the regulations around re-use of this data?

2) Health and Human Services has issued proposed updated regulations to HIPAA, the law regulating how personal health information is shared. CDT has made some comments about how these regulations will affect patient privacy, data security, and enforcement.  HIPAA, to some extent, lays out some useful standards on things like how electronic health information should transmitted.  But it also has been controversial for suppressing information-sharing, even when it is legal and warranted.

So what if instead of talking about what we can’t do, what if we started talking about what we can do with electronic health data?  I’m not imagining a list of uses where anything outside of the list is barred, but rather an outline of the kinds of uses that are useful.  The whole point of electronic health records is to make information more easily shareable so care is more continuous and comprehensive and research more efficient and effective.

I love this bit from an interview with a neuroscientist who studies dog brains because, “dogs aren’t covered by Hipaa! Their records aren’t confidential!”

3) A start-up called Bynamite is trying to give users control over the information they share with advertisers online. It’s another take on something we’ve seen from Google and BlueKai, where users get to see what interests have been associated with them.  Like those services, Bynamite allows you to remove interests that don’t pertain to you or that you don’t want to share.  Bynamite then goes further by opting you out of networks that won’t let you make these choices.  That definitely sounds easier to managing P3P, and easier than reading through the policies of all the companies that participate in the National Advertising Initiative.

I agree with Professor Acquisti that all of us, when we use Google or any other free online service, are paying for our use of the service with our personal information, and that Bynamite is trying to make that transaction more explicit.  But I wonder if the value of the data companies have gained is explicit.  Is the price of the transaction fair?  Does 1 hour of free Google search equal x amounts of personal data bits?  Can you even put a dollar value on that transaction, given that the true value of all this data is in aggregate?

The accompanying blog post to this article cites a study demonstrating how hard it is to assign a dollar value to privacy.  The study subjects clearly did value “privacy,” but the price they put on it depended on how much they felt they had any privacy to begin with!

In the mix…new organizational structures, giant list of data brokers, governments sharing citizens’ financial data, and what IT security has to do with Lady Gaga

Friday, July 9th, 2010

1) More on new kinds of organizational structures for entities that want to form for philanthropic purposes but not fit into the IRS definition of a nonprofit.

2) CDT shone a spotlight on Spokeo, a data broker last week.  Who are other data brokers? Don’t be shocked, there are A LOT of them.  What they do, they mainly do out of the spotlight shone on companies like Facebook, but with very real effects.  In 2005, ChoicePoint sold data to identity thieves posing as a legitimate business.

3) The U.S. has come to an agreement with Europe on sharing finance data, which the U.S. argues is an essential tool of counterterrorism.  The article doesn’t say exactly how these investigations work, whether specific suspects are targeted or whether large amounts of financial data are combed for suspicious activity.  It does make me wonder, given how data crosses borders more easily than any other resource, how will Fourth Amendment protections in the U.S. (and similar protections in other countries) apply to these international data exchanges?  There is also this pithy quote:

Giving passengers a way to challenge the sharing of their personal data in United States courts is a key demand of privacy advocates in Europe — though it is not clear under what circumstances passengers would learn that their records were being misused or were inaccurate.

4) Don’t mean to focus so much on scary data stuff, but 41% of IT professionals admit to abusing privileges.  In a related vein, it turns out a disgruntled soldier accused of illegally downloading classified data managed to do it by disguising his CDs as Lady Gaga CDs.  Even better,

He was able to avoid detection not because he kept a poker face, they said, but apparently because he hummed and lip-synched to Lady Gaga songs to make it appear that he was using the classified computer’s CD player to listen to music.

The New York Times is definitely getting cheekier.

In the mix…philanthropic entities, who’s online doing what, data brokers, and data portability

Monday, July 5th, 2010

1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one.  We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online.  (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation.  I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook.  We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch)  It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

Who has your data and how can the government get it?

Monday, June 28th, 2010

Who has your data? And how can the government get it?

The questions are more complicated than they might seem.

In the last month, we’ve seen Facebook criticized and scrutinized at every turn for the way they collect and share their users’ data.  Much of that criticism was deserved, but what was missing in that discussion were the companies that have your data without even your knowledge, let alone your consent.

The relationship between a user and Facebook is at least relatively straightforward.  The user knows his or her data has been placed in Facebook, and legislation could be updated relatively easily to protect his or her expectation of privacy in that data.

But what about the data consumer service companies share with third parties?

Pharmacies sell prescription data that includes you; cellphone-related businesses sell data that includes you.

So much of the data economy involves companies and businesses that don’t necessarily have you as a customer, and thus even less incentive to protect your interests.

What about data that’s supposedly de-identified or anonymized?  We know that such data can be combined with another dataset to re-identify people.  Could the government seek that kind of data and avoid getting even a subpoena?  Increasingly, the companies that have data about you aren’t even the companies you initially transacted with.  How will existing privacy laws, even proposed reforms by the Digital Due Process coalition, deal with this reality?

These are all questions that consume us at the Common Data Project for good reason.  As an organization dedicated to enabling the safe disclosure of personal information, we are committed to talking about privacy and anonymity in measurable ways, rather than with vague promises.

If you read a typical privacy policy, you’ll see language that goes something like this,

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:…

We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request

We think the datatrust needs to be do better than that. We want to know exactly what “enforceable government request” means.  We want to think creatively about what individual privacy rights mean when organizations are sharing information with each other. We’ve written up the aspects that seem most directly relevant to our project here, including 1) a quick overview of federal privacy law; 2) implications for data collectors today; and 3) implications for the datatrust.

We ultimately have more questions than answers.  But we definitely can’t assume we know everything there is to know.  Even at the Supreme Court, where the Justices seem to have some trouble understanding how pagers and text messages work, they understand that the world is changing quickly.  (See City of Ontario v. Quon.)  We all need to be asking questions together.

So take a look.  Let us know if there are issues we’re missing. What are some other questions we should be asking?

In the mix…democratizing access to data, data literacy, and predictable responses to proposed privacy bill

Friday, June 18th, 2010

1) Infochimps launched their API. People often ask, are you guys doing something similar?  Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.

2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing.  The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits.  Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.”  It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.

3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.

4) Great discussion continues on the concept of “data literacy.” I love this guest post from David Eaves on the Open Knowledge Foundation blog, with the awesome line:

It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.

In the mix…EU data retention laws, Wikipedia growing

Friday, June 11th, 2010

1) Australia thinking about requiring ISPs to record browsing histories (via Truste).

Electronic Frontier Australia (EFA) chair Colin Jacobs said the regime was “a step too far”.

“At some point data retention laws can be reasonable, but highly-personal information such as browsing history is a step too far,” Jacobs said. “You can’t treat everybody like a criminal. That would be like tapping people’s phones before they are suspected of doing any crime.”

Sounds shocking, but the EU already requires it.

2) European privacy officials are pointing out that Microsoft, Google and Yahoo’s methods of “anonymization” are not good enough to comply with EU requirements (via EFF).  As we’ve been saying for awhile, “anonymization” is not a very precise claim.  (Even though they also want ISPs to retain browsing histories for law enforcement–confused? I am.)

3) Wikipedia is adding two new executive roles.  In the process of researching our community study, it really struck me how small Wikipedia‘s staff was compared to the staff of more centralized, less community-run businesses like Yelp and Facebook.  Having two more staff members is not a huge increase, but it does make me wonder, is a larger staff inevitable when an organization tries to assert more editorial control over what the community produces?

Exploding Manholes and Anonymizing Relationships at CCICADA

Thursday, June 10th, 2010

Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”.  For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis.  Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust.   Here’s a look at a couple of examples from the workshop:

Exploding Manholes!

Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”.  Manholes provide access to the city’s underground electrical system.  When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion.  The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).

Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind.  One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable.  While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this.  First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall.  The cleaning was an iterative process that involved some manual work with these text blobs.

For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust.  This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.

Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob.  Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob.  So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.

Anonymizing Relationships

Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”.  The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy.  If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy.  Graham presented a different technique for anonymizing data.  I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.

Graham’s technique anonymizes graphs, particularly social network interaction graphs.  In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact.  Then there are edges connecting the people to the interactions.  Here is an example of a portion of a graph:

Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.

How does he achieve this?  The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes.  (We would also replace the names with demographic information to allow us to make general conclusions.)

Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer.  (One of Mimi and Grace poked Grant or Alex!)  Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.

You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:

PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.

In the mix…data for coupons, information literacy, most-visited sites

Friday, June 4th, 2010

1) There’s obviously an increasing move to a model of data collection in which the company says, “give us your data and get something in return,” a quid pro quo.  But as Marc Rotenberg at EPIC points out,

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data.  The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again.  These data-centered companies are creating a network of users whose data are continually used in the business.  Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.

2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access.  A similar argument can be made about the concept of privacy.  In “real life” (i.e., offline life), no one aims for total privacy.  Everyday, we make decisions about what we want to share with whom.  Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy.  For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends.  So perhaps we’ll need to work on privacy and data literacy as well?

3) Google recently released a list of the most visited sites on the Internet. Two questions jump out: a) Where is Google on this list? and b) Could the list be a proxy for the biggest data collectors online?

Mark Zuckerberg: It takes a village to build trust.

Friday, June 4th, 2010

This whole brouhaha over Facebook privacy appears to be stuck revolving around Mark Zuckerberg.

We seem to be stuck in a personal tug-of-war with the CEO of Facebook frustrated that a 26 year-old personally has so much power over so many.

Meanwhile, Mark Z. is personally reassuring us that we can trust Facebook which on some level implies we must trust him.

But should any single individual really be entrusted with so much? Especially “a 26 year-old nervous, sweaty guy who dodges the questions.” Harsh, but not a completely invalid point.

As users of Facebook, we all know that it is the content of all our lives and our relationships to each other that make Facebook special. As a result, we feel a sense of entitlement about Facebook policy-making that we don’t feel about services that are in many ways way more intrusive and/or less disciplined about protecting privacy (e.g. ISPs, cellphone providers, search).

Another way of putting it is, Facebook is not Apple! and as a result, needs a CEO who is a community leader, not a dictator of cool.

So we start asking questions like, why should Facebook make the big bucks at the expense of my privacy? Shouldn’t I get a piece of that?

(Google’s been doing this for over a decade now, but the privacy exposure over at Google is invisible to the end-user.)

At some point, will we decide we would rather pay for a service than feel like we’re being manipulated by companies who know more about us than we do and can decide whether to use that information to help us or hurt us depending on profit margin. Here’s another example.

Or are there other ways to counterbalance the corporate monopoly on personal information? We think so.

In order for us to trust Facebook, Facebook needs to stop feeling like a benevolent dictatorship, albeit one open to feedback, but also one with a dictator who looks like he’s in need of a regent.

Instead Facebook the company should consider adopting some significant community-driven governance reforms that will at least give it the patina of a democracy.


(Even if at the end of the day, it is beholden to its owners and investors.

For some context, this was the sum total of what Mark Z. had to say about how important decisions are made at Facebook:

We’re a company where there’s a lot of open dialogue. We have crazy dialogue and arguments. Every Friday, I have an open Q&A where people can come and ask me whatever questions they want. We try to do what we think is right, but we also listen to feedback and use it to improve. And we look at data about how people are using the site. In response to the most recent changes we made, we innovated, we did what we thought was right about the defaults, and then we listened to the feedback and then we holed up for two weeks to crank out a new privacy system.

Nothing outrageous. About par for your average web service. (But then again, Facebook isn’t your average web service.)

However, this is what should have been the meat of the discussion about how Facebook is going to address privacy concerns: community agency and decision-making, not Mark Z.’s personal vision of an interwebs brimming with serendipitous happenings.

Facebook the organization needs to be trusted. So it might be best if Mark Z. backed out of the limelight and stopped being the lone face of Facebook.

How might have that D8 interview have turned out if he had come on stage with a small group of Facebook users?

What governance changes would make you feel more empowered as a Facebook user?


Get Adobe Flash player