Posts Tagged ‘government data’

Common Data Project looking for a partner organization to open up access to sensitive data.

Wednesday, June 30th, 2010

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

In the mix…data for coupons, information literacy, most-visited sites

Friday, June 4th, 2010

1) There’s obviously an increasing move to a model of data collection in which the company says, “give us your data and get something in return,” a quid pro quo.  But as Marc Rotenberg at EPIC points out,

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data.  The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again.  These data-centered companies are creating a network of users whose data are continually used in the business.  Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.

2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access.  A similar argument can be made about the concept of privacy.  In “real life” (i.e., offline life), no one aims for total privacy.  Everyday, we make decisions about what we want to share with whom.  Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy.  For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends.  So perhaps we’ll need to work on privacy and data literacy as well?

3) Google recently released a list of the most visited sites on the Internet. Two questions jump out: a) Where is Google on this list? and b) Could the list be a proxy for the biggest data collectors online?

In the mix…Google reveals how many government requests for data it gets, Amazon tries First Amendment privacy argument, and the World Bank opens its databases

Wednesday, April 21st, 2010

1) Google is providing data on how many government requests they get for data. As various people have pointed out, the site has its limitations, but it’s still fascinating.  We’ve been thinking a lot about how attractive our datatrust would be to governments, and how we can best deal with requests and remain transparent.  This seems like a good option and maybe something all companies should consider doing.

2) In related news, Amazon is refusing the state of North Carolina’s request for its customer data. North Carolina wants the names and addresses of every customer and what they bought since 2003!  They want to audit Amazon’s compliance with North Carolina’s state tax laws.  I think NC’s request is nuts–are they really prepared to go through 50 million purchases?  It may just be legal posturing, given Amazon already gave them anonymized data on the purchases of NC residents, but what’s really interesting to me is Amazon’s argument that its customers have First Amendment rights in their purchases.  I heard a similar argument at a talk at NYU a few months ago, that instead of arguing privacy rights, which are not explicitly defined in the Constitution, we should be arguing for freedom of association rights when we seek to protect ourselves from data requests like this.  Interesting to see where this goes.

3) The World Bank is opening up its development data. This is data people used to pay for and now it’s free, so it’s exciting news.  But as with most public data out there, it’s really just indicators, aggregates, statistics, and such, rather than raw data you can query in an open-ended way.  Wouldn’t that be really exciting?

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

In the mix

Tuesday, March 2nd, 2010

1) I’m looking forward to reading this series of blog posts from the Freedom to Tinker blog at Princeton’s Center for Information Technology Policy on what government datasets should look like to facilitate innovation, as the first one is incredibly clear and smart.

2) The NYTimes Bits blog recently interviewed Esther Dyson, “Health Tech Investor and Space Tourist” as the Times calls her, where she shares her thoughts on why ordinary people might want to track their own data and why we shouldn’t worry so much about privacy.

3) A commenter on the Bits interview with Esther Dyson referenced this new 501(c)(6) nonprofit, CLOUD: Consortium for Local Ownership and Use of Data.  Their site says, “CLOUD has been formed to create standards to give people property rights in their personal information on the Web and in the cloud, including the right to decide how and when others might use personal information and whether others might be allowed to connect personal information with identifying information.”

We’ve been thinking about whether personal information could or should be viewed as personal property, as understood by the American legal system, for awhile now.  I’m not quite sure it’s the best or most practical solution, but I’m curious to see where CLOUD goes.

4) The German Federal Constitutional Court has ruled that the law requiring data retention for 6 months is unconstitutional.  Previously, all phone and email records had to be kept for 6 months for law enforcement purposes.  The court criticized the lack of data security and insufficient restrictions to access to the data.

Although Europe has more comprehensive and arguably “stricter” privacy laws, many countries also require data retention for law enforcement purposes.  We in the U.S. might think the Fourth Amendment is going to protect our phone and email records from being poked into unnecessarily by law enforcement, but existing law is even less clear than in Europe.  So much privacy law around telephone and email records is built around antiquated ideas of our “expectations,” with analogies to what’s “inside the envelope” and what’s “outside the envelope,” as if all our communications can be easily analogized to snail mail.  All these issues are clearly simmering to a boil.

5) Google’s introduced a new version of Chrome with more privacy controls that allow you to determine how browser cookies, plug-ins, pop-ups and more are handled on a site-by-site basis.  Of course, those controls won’t necessarily stop a publisher from selling your IP address to a third-party behavioral targeting company!

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

Friday, January 29th, 2010

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

Get Adobe Flash player