Posts Tagged ‘Privacy’

In the mix…new organizational structures, giant list of data brokers, governments sharing citizens’ financial data, and what IT security has to do with Lady Gaga

Friday, July 9th, 2010

1) More on new kinds of organizational structures for entities that want to form for philanthropic purposes but not fit into the IRS definition of a nonprofit.

2) CDT shone a spotlight on Spokeo, a data broker last week.  Who are other data brokers? Don’t be shocked, there are A LOT of them.  What they do, they mainly do out of the spotlight shone on companies like Facebook, but with very real effects.  In 2005, ChoicePoint sold data to identity thieves posing as a legitimate business.

3) The U.S. has come to an agreement with Europe on sharing finance data, which the U.S. argues is an essential tool of counterterrorism.  The article doesn’t say exactly how these investigations work, whether specific suspects are targeted or whether large amounts of financial data are combed for suspicious activity.  It does make me wonder, given how data crosses borders more easily than any other resource, how will Fourth Amendment protections in the U.S. (and similar protections in other countries) apply to these international data exchanges?  There is also this pithy quote:

Giving passengers a way to challenge the sharing of their personal data in United States courts is a key demand of privacy advocates in Europe — though it is not clear under what circumstances passengers would learn that their records were being misused or were inaccurate.

4) Don’t mean to focus so much on scary data stuff, but 41% of IT professionals admit to abusing privileges.  In a related vein, it turns out a disgruntled soldier accused of illegally downloading classified data managed to do it by disguising his CDs as Lady Gaga CDs.  Even better,

He was able to avoid detection not because he kept a poker face, they said, but apparently because he hummed and lip-synched to Lady Gaga songs to make it appear that he was using the classified computer’s CD player to listen to music.

The New York Times is definitely getting cheekier.

In the mix…philanthropic entities, who’s online doing what, data brokers, and data portability

Monday, July 5th, 2010

1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one.  We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online.  (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation.  I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook.  We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch)  It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

A big update for the Common Data Project

Tuesday, June 29th, 2010

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: myplaceinthecrowd.org covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

In the mix…data for coupons, information literacy, most-visited sites

Friday, June 4th, 2010

1) There’s obviously an increasing move to a model of data collection in which the company says, “give us your data and get something in return,” a quid pro quo.  But as Marc Rotenberg at EPIC points out,

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data.  The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again.  These data-centered companies are creating a network of users whose data are continually used in the business.  Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.

2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access.  A similar argument can be made about the concept of privacy.  In “real life” (i.e., offline life), no one aims for total privacy.  Everyday, we make decisions about what we want to share with whom.  Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy.  For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends.  So perhaps we’ll need to work on privacy and data literacy as well?

3) Google recently released a list of the most visited sites on the Internet. Two questions jump out: a) Where is Google on this list? and b) Could the list be a proxy for the biggest data collectors online?

Measuring the privacy cost of “free” services.

Wednesday, June 2nd, 2010

There was an interesting pair of pieces on this Sunday’s “On The Media.”

The first was “The Cost of Privacy,” a discussion of Facebook’s new privacy settings, which presumably makes it easier for users to clamp down on what’s shared.

A few points that resonated with us:

  1. Privacy is a commodity we all trade for things we want (e.g. celebrity, discounts, free online services).
  2. Going down the path of having us all set privacy controls everywhere we go on internet is impractical and unsustainable.
  3. If no one is willing to share their data, most of the services we love to get for free would disappear. Randall Rothenberg.
  4. The services collecting and using data don’t really care about you the individual, they only care about trends and aggregates. Dr. Paul H. Rubin.

We wish one of the interviewees had gone even farther to make the point that since we all make decisions every day to trade a little bit of privacy in exchange for services, privacy policies really need to be built around notions of buying and paying where what you “buy” are services and how you pay for them are with “units” of privacy risk (as in risk of exposure).

  1. Here’s what you get in exchange for letting us collect data about you.”
  2. Here’s the privacy cost of what you’re getting (in meaningful and quantifiable terms).

(And no, we don’t believe that deleting data after 6 months and/or listing out all the ways your data will be used is an acceptable proxy for calculating “privacy cost.” Besides, such policies inevitably severely limit the utility of data and stifle innovation to boot.)

Gaining clarity around privacy cost is exactly where we’re headed with the datatrust. What’s going to make our privacy policy stand out is not that our privacy “guarantee” will be 100% ironclad.

We can’t guarantee total anonymity. No one can. Instead, what we’re offering is an actual way to “quantify” privacy risk so that we can track and measure the cost of each use of your data and we can “guarantee” that we will never use more than the amount you agreed to.

This in turn is what will allow us to make some measurable guarantees around the “maximum amount of privacy risk” you will be exposed to by having your data in the datatrust.


The second segment on privacy rights and issues of due process vis-a-vis the government and data-mining.

Kevin Bankston from EFF gave a good run-down how ECPA is laughably ill-equipped to protect individuals using modern-day online services from unprincipled government intrusions.

One point that wasn’t made was that unlike search and seizure of physical property, the privacy impact of data-mining is easily several orders of magnitude greater. Like most things in the digital realm, it’s incredibly easy to sift through hundreds of thousands of user accounts whereas it would be impossibly onerous to search 100,000 homes or read 100,000 paper files.

This is why we disagree with the idea that we should apply old standards created for a physical world to the new realities of the digital one.

Instead, we need to look at actual harm and define new standards around limiting the privacy impact of investigative data-mining.

Again, this would require a quantitative approach to measuring privacy risk.

(Just to be clear, I’m not suggesting that we limit the size of the datasets being mined, that would defeat the purpose of data-mining. Rather, I’m talking about process guidelines for how to go about doing low-(privacy) impact data-mining. More to come on this topic.)

In the mix…Everyone’s obsessed with Facebook

Friday, May 7th, 2010

UPDATE: One more Facebook-related bit, a great graphic illustrating how Facebook’s default sharing settings have changed over the past five years by Matt McKeon. Highly recommend that you click through and watch how the wheel changes.

1) I love when other people agree with me, especially on subjects like Facebook’s continuing clashes with privacy advocates. Says danah boyd,

Facebook started out with a strong promise of privacy…You had to be at a university or some network to sign up. That’s part of how it competed with other social networks, by being the anti-MySpace.

2) EFF has a striking post on the changes made to Facebook’s privacy policy over the last five years.

3) There’s a new app for people who are worried about Facebook having their data, but it means you have to hand it over to this company which also states, it “may use your info to serve up ads that target your interests.” Hmm.

4) Consumer Reports is worried that we’re oversharing, but if we followed all its tips on how to be safe, what would be the point of being on a social network? On its list of things we shouldn’t do:

  • Posting a child’s name in a caption
  • Mentioning being away from home
  • Letting yourself be found by a search engine

What’s the fun of Facebook if you can’t brag about the pina colada you’re drinking on the beach right at that moment? I’m joking, but this list just underscores that we can’t expect to control safety issues solely through consumer choices. Another thing we shouldn’t do is put our full birthdate on display, though given how many people put details about their education, it wouldn’t necessarily be hard to guess which year someone was born. Consumer Reports is clearly focusing on its job, warning consumers, but it’s increasingly obvious privacy is not just a matter of personal responsibility.

5) In a related vein, there’s an interesting Wall St. Journal article on whether the Internet is increasing public humiliation. One WSJ reader, Paul Cooper, had this to say:

The simple rule here is that one should always assume that everything one does will someday be made public. Behave accordingly. Don’t do or say things you don’t want reported or repeated. At least not where anyone can see or hear you doing it. Ask yourself whether you trust the person who wants to take nude pictures of you before you let them take the pictures. It is not society’s job to protect your reputation; it’s your job. If you choose to act like a buffoon, chances are someone is going to notice.

Like I said above, privacy in a world where the word “public” means really really public forever and ever, and “private” means whatever you manage to keep hidden from everyone you know, protecting “privacy” isn’t only a matter of personal responsibility. The Internet easily takes actions that are appropriate in certain contexts and republishes them in other contexts. People change, which is part of the fun of being human. Even if you’re not ashamed of your past, you may not want it following you around in persistent web form.

Perhaps on the bright side, we’ll get to a point where we can all agree everyone has done things that are embarrassing at some point and no one can walk around in self-righteous indignation. We’ve seen norms change elsewhere. When Bill Clinton was running for president, he felt compelled to say that he had smoked marijuana but had never inhaled. When Barack Obama ran for president 16 years later, he could say, “I inhaled–that was the point,” and no one blinked.

6) The draft of a federal online privacy bill has been released. In its comments, Truste notes, “The current draft language positions the traditional privacy policy as the go to standard for ‘notice’ — this is both a good and bad thing.” If nothing else, the “How to Read a Privacy Policy” report we published last year had a similar conclusion, that privacy policies are not going to save us.

Building a community: the implications of Facebook’s new features for privacy and community

Thursday, May 6th, 2010

As I described in my last post, the differences between MySpace and Facebook are so stark, they don’t feel like natural competitors to me.  One isn’t necessarily better than the other.  Rather, one is catering to people who are looking for more of a public, party atmosphere, and the other is catering to people who want to feel like they can go to parties that are more exclusive and/or more intimate, even when they have 1000 friends.

But this difference doesn’t mean that one’s personal information on Facebook is necessarily more “private” than on MySpace.  MySpace can feel more public.  There is no visible wall between the site and the rest of the Internet-browsing community.  But Facebook’s desire to make more of its users’ information public is no secret.  For Facebook to maintain its brand, though, it can’t just make all information public by default.  This is a company that grew by promising Harvard students a network just for them, then Ivy League students a network just for them, and even now, it promises a network just for you and the people you want to connect with.

Facebook needs to remain a space where people feel like they can define their connections, rather than be open to anyone and everyone, even as more information is being shared.

And just in time for this post, Facebook rolled out new features that demonstrate how it is trying to do just that.

Facebook’s new system of Connections, for example, links information from people’s personal profiles to community pages, so that everyone who went to Yale Law School, for example, can link to that page. Although you could see other “Fans” of the school on the school’s own page before, the Community page puts every status update that mentions the school in one place, so that you’re encouraged to interact with others who mention the school.  The Community Pages make your presence on Facebook visible in new ways, but primarily to people who went to the same school as you, who grew up in the same town, who have the same interests.

Thus, even as information is shared beyond current friends, Facebook is trying to reassure you that mini-communities still exist.  You are not being thrown into the open.

Social plug-ins similarly “personalize” a Facebook user’s experience by accessing the user’s friends.  If you go to CNN.com, you’ll see which stories your friends have recommended.  If you “Like” a story on that site, it will appear as an item in your Facebook newsfeed.  The information that is being shared thus maps onto your existing connections.

The “Personalization” feature is a little different in that it’s not so much about your interactions with other Facebook users, but about your interaction with other websites.  Facebook shares the public information on your profile with certain partners.  For example, if you are logged into Facebook and you go to the music site Pandora, Pandora will access public information on your profile and play music based on the your “Likes.”

This experience is significantly different from the way people explore music on MySpace.  MySpace has taken off as a place for bands to promote themselves because people’s musical preferences are public.  MySpace users actively request to be added to their favorite bands’ pages, they click on music their friends like, and thus browse through new music.  All of these actions are overt.

Pandora, on the other hand, recommends new music to you based on music you’ve already indicated you “Like” on your profile.   But it’s not through any obvious activity on your part.  You may have noted publicly that you “Like” Alicia Keys on your Facebook profile page, but you didn’t decide to actively plug that information into Pandora.  Facebook has done it for you.

Depending on how you feel about Facebook, you may think that’s wonderfully convenient or frighteningly intrusive.

And this is ultimately why Facebook’s changes feel so troubling for many people.

Although they aren’t ripping down the walls of its convention center and declaring an open party. As Farhad Manjoo at Slate says, Facebook is not tearing down its walls but “expanding them.”

Facebook is making peepholes in certain walls, or letting some people (though not everyone) into the parties users thought were private.

This reinforces the feeling that mini-communities continue to exist within Facebook, something the company should try to do as it’s a major draw for many of its users.

Yet the multiplication of controls on Facebook for adjusting your privacy settings makes clear how difficult it is to share information and maintain this sense of mini-communities.  There are some who suspect Facebook is purposefully making it difficult to opt-out.  But even if we give Facebook the benefit of the doubt, it’s undeniable that the controls as they were, plus the controls that now exist for all the new features, are bewildering.  Just because users have choices doesn’t mean they feel confident about exercising them.

On MySpace, the prevailing ethos of being more public has its own pitfalls.  A teenager posting suggestive photos of herself may not fully appreciate what she’s doing.  At the least, though, she knows her profile is public to the world.

On Facebook, users are increasingly unsure of what information is public and to whom.  That arguably is more unsettling than total disclosure.

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

In the mix

Monday, April 5th, 2010

1) Slate had an interesting take on the bullying story in Massachusetts and the prosecutor’s anger at Facebook for not providing information, i.e., evidence of the bullying.  Apparently, Facebook provided basic subscriber information, but resisted providing more without a search warrant.  Emily Bazelon points out how this area of law is murky, and references the coalition forming around reforming the Electronic Communications Privacy Act, but her larger point is an extra-legal one.  The evidence of bullying the DA was looking for was at one point public, even if eventually deleted. She points out that it may be hard for kids or parents who are upset to have the presence of mind to do this, but that they could take screenshots and preserve evidence themselves.

The case raises a lot of interesting questions about anonymity, privacy, and the values we have online.  Anonymity on the Internet has been a rallying cry for so many people, but I wonder, if something is illegal in the offline world, should it suddenly be legal online because you can be anonymous and avoid prosecution?  (Sexual harassment is a crime in the subway, too!)  We now live in a world where many of us occupy space both online and offline.  We used to think of them as completely separate spaces, and it’s true that the Internet gives us opportunities to do things, both good and bad, that we wouldn’t have offline.  But it’s increasingly obvious that we need to transfer some of the rules we have about the offline world into the online one.  For disability rights advocates, that includes pushing the definition of “public accommodation” to include online stores like Target, and suing them if their sites are not accessible to the blind using screen readers.  For privacy advocates, that includes acknowledging that people have an expectation of privacy in their emails as well as their snail mail.  Free speech in the offline world doesn’t mean you can say anything you want anywhere you want.  Maybe it’s time to be more nuanced about how we protect free speech online as well.

2) It turns out Twitter is pretty good at predicting box office returns – what else might it predict?

3) Cases like this amaze me, because the parties are litigating a question that seems like a no-brainer.  A New Jersey court upheld recently that an employee had an expectation of privacy in her Yahoo personal account, even if she accessed it on a company computer. Would we ever litigate whether an employee had an expectation of privacy in a piece of personal mail she brought to the office and decided to read at her desk?

4) The New York Times is acknowledging their readers’ online comments in separate articles, namely, this one describing readers’ reactions to federal mortgage aid.  It’s a smart way to give online readers a sense that their comments are being read.  I wonder if this is where the “Letters to the Editor” page is going.  I’ve been wondering, who are these readers who are so happy to be the 136th comment on an article?  But the people who write letters to the editor have always been people who have extra time and energy.  In a way, online comments expands the world of people who are willing to write a letter to the editor.

5) Would we feel differently about government data mining if the government were better at it? Mimi and I went to a talk at the NYU Colloquium on Information Technology and Society where Joel Reidenberg, a law professor at Fordham, talked about how transparency of personal information online is eroding the rule of law.  One of the arguments he made against government data mining was that it doesn’t work, with the example of airport security, its inability to stop the underwear bomber, and its terribly inaccurate no-fly lists.  Well, the Obama administration just announced a new system of airport security checks that uses intelligence-based data mining that is meant to be more targeted.  It’s hard to know now whether the new system will be better and smarter, but it raises a point those opposed to data mining don’t seem to consider — what if the government were better at it?  Could data mining be so precise that it avoids racial profiling?  Are there other dangers to consider, and can they be warded off without shutting down data mining altogether?

Would PINQ solve the problems with the Census data?

Friday, February 5th, 2010

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes