Archive for the ‘Interesting Uses of Data’ Category

In the mix — open data issues, bad econ stats, Facebook gaydar, and fraud detection in data

Friday, April 30th, 2010

1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:

  • It’s important that data not only be open but be available in real time.  In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure.  This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
  • Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps.  It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons.  Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend.  This is a really fascinating issue, and I’m curious to see where else this gets picked up.

2) Existing economic statistics are riddled with problems.  I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.

3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves.  Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?

4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot.  There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten.  We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things.  There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names?  Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?

In the mix…Google reveals how many government requests for data it gets, Amazon tries First Amendment privacy argument, and the World Bank opens its databases

Wednesday, April 21st, 2010

1) Google is providing data on how many government requests they get for data. As various people have pointed out, the site has its limitations, but it’s still fascinating.  We’ve been thinking a lot about how attractive our datatrust would be to governments, and how we can best deal with requests and remain transparent.  This seems like a good option and maybe something all companies should consider doing.

2) In related news, Amazon is refusing the state of North Carolina’s request for its customer data. North Carolina wants the names and addresses of every customer and what they bought since 2003!  They want to audit Amazon’s compliance with North Carolina’s state tax laws.  I think NC’s request is nuts–are they really prepared to go through 50 million purchases?  It may just be legal posturing, given Amazon already gave them anonymized data on the purchases of NC residents, but what’s really interesting to me is Amazon’s argument that its customers have First Amendment rights in their purchases.  I heard a similar argument at a talk at NYU a few months ago, that instead of arguing privacy rights, which are not explicitly defined in the Constitution, we should be arguing for freedom of association rights when we seek to protect ourselves from data requests like this.  Interesting to see where this goes.

3) The World Bank is opening up its development data. This is data people used to pay for and now it’s free, so it’s exciting news.  But as with most public data out there, it’s really just indicators, aggregates, statistics, and such, rather than raw data you can query in an open-ended way.  Wouldn’t that be really exciting?

The Common Data Project at the Open Knowledge Conference

Monday, April 19th, 2010

We’ll be at the Open Knowledge Conference in London on April 24th!  Alex Selkirk will be giving a lightning talk, “Can We Have Our Cake and Eat It Too?: The Potential of a “Datatrust” to Open Personal Data While Protecting Privacy.”  He’ll walk through an updated version of our datatrust demo that shows how differential privacy, in the form of PINQ, could be used to allow open-ended queries without revealing the presence of any one individual.  (The updated version isn’t available yet, but for a look at the first version of the demo, The updated version isn’t quite complete, but for a description of how the old one worked, check out Tony Gibbon’s blog post here.)

All of us here have been wrestling with the demo and how it could be used in real-world scenarios.  We’ve

  • Described the basic principles of differential privacy behind PINQ;
  • Illustrated a demo of PINQ;
  • Outlined what would go into a datatrust prototype; and
  • Imagined how PINQ would enable Census data to be open in new ways.

One of the biggest challenges is defining PINQ’s privacy guarantee for real-world use.  We’ve addressed that in these posts:

And there’s still more to come on that…

We’re really excited to be able to share what we’ve been wrestling with for the last couple of months with the people at the Open Knowledge Conference, who are all invested in open knowledge, “any content, information or data that people are free to use, re-use and redistribute — without any legal, technological or social restriction.”

We also look forward to hearing what others are doing to make information more publicly available.  We’re particularly interested in the panel on community driven research, as well as the multi-national panel on opening up government data.  It’s a great opportunity to hear from experts working on open government issues from a European perspective.  In all the talk of open government and transparency, we don’t hear much about how governments are going to deal with privacy issues, despite the fact that much of what governments collect is very personal.  We hope to hear about how these experts are dealing with these issues, especially given that the European understand of privacy seems to be very different from the American one, as evidenced by the Italian Google case.

Can we reconcile the goals of increased government transparency and more individual privacy?

Tuesday, April 13th, 2010

I really appreciate the Sunlight Foundation‘s continuing series on new data sets being made public by the federal government as part of the Open Government Directive.  Yesterday, I found out the Centers for Medicaid and Medicare Services will be releasing all kinds of new goodies.  As the Sunlight Foundation points out, the data so far is lacking granularity — comparisons of Medicare spending by state, rather than county.  But still all very exciting.

Yet not a single mention of privacy.  Even though, according to the blogger, the new claims database will include data for 5% of Medicare recipients.  After “strip[ping] all personal identification data out,” the database will “present it by service type (inpatient, outpatient, home health, prescription drug, etc.)” As privacy advocates have noted, that’s probably not going to do enough to anonymize it.

I don’t really mind not hearing about privacy every time someone talks about a database.  But it’s sort of funny.  Everyday, I read a bunch of blogs on open data and government transparency, as well as a bunch of blogs on privacy issues.  But I rarely read about both issues in the same place.  Shouldn’t we all be talking to each other more?

Number of subway passengers from Powell Station = retail revenues?

Thursday, April 8th, 2010

Spinn via Flickr/Creative Commons License Attribution

The Wall Street Journal reports that economists are looking to “oddball data” to see trends before official numbers are released.

We’re obviously a little obsessed with data reuse — the more imaginative, the better. There’s Ted Egan, the chief economist in the San Francisco Comptroller’s office, who looks at weekend passenger tallies for the Union Square shopping district rather than wait six months for the state’s official retail revenue numbers.  Then there’s Edward Learner, the economist who discovered diesel fuel sales on Interstate Highway 5 is a leading indicator of construction employment in California, while diesel sales on Interstate Highway 80 is an indicator of manufacturing employment.

The people who collected this data surely didn’t imagine it being used this way, which is why we should be really careful about closing off data reuse before we even know what the potential reuses are.  And, as these economists have found, these indicators are often faster and arguably, more accurate.

(P.S.  I used to live in San Francisco.  I know the Powell St. trolley is not the same as the Powell BART station.  Sorry.)

In the mix

Monday, April 5th, 2010

1) Slate had an interesting take on the bullying story in Massachusetts and the prosecutor’s anger at Facebook for not providing information, i.e., evidence of the bullying.  Apparently, Facebook provided basic subscriber information, but resisted providing more without a search warrant.  Emily Bazelon points out how this area of law is murky, and references the coalition forming around reforming the Electronic Communications Privacy Act, but her larger point is an extra-legal one.  The evidence of bullying the DA was looking for was at one point public, even if eventually deleted. She points out that it may be hard for kids or parents who are upset to have the presence of mind to do this, but that they could take screenshots and preserve evidence themselves.

The case raises a lot of interesting questions about anonymity, privacy, and the values we have online.  Anonymity on the Internet has been a rallying cry for so many people, but I wonder, if something is illegal in the offline world, should it suddenly be legal online because you can be anonymous and avoid prosecution?  (Sexual harassment is a crime in the subway, too!)  We now live in a world where many of us occupy space both online and offline.  We used to think of them as completely separate spaces, and it’s true that the Internet gives us opportunities to do things, both good and bad, that we wouldn’t have offline.  But it’s increasingly obvious that we need to transfer some of the rules we have about the offline world into the online one.  For disability rights advocates, that includes pushing the definition of “public accommodation” to include online stores like Target, and suing them if their sites are not accessible to the blind using screen readers.  For privacy advocates, that includes acknowledging that people have an expectation of privacy in their emails as well as their snail mail.  Free speech in the offline world doesn’t mean you can say anything you want anywhere you want.  Maybe it’s time to be more nuanced about how we protect free speech online as well.

2) It turns out Twitter is pretty good at predicting box office returns — what else might it predict?

3) Cases like this amaze me, because the parties are litigating a question that seems like a no-brainer.  A New Jersey court upheld recently that an employee had an expectation of privacy in her Yahoo personal account, even if she accessed it on a company computer. Would we ever litigate whether an employee had an expectation of privacy in a piece of personal mail she brought to the office and decided to read at her desk?

4) The New York Times is acknowledging their readers’ online comments in separate articles, namely, this one describing readers’ reactions to federal mortgage aid.  It’s a smart way to give online readers a sense that their comments are being read.  I wonder if this is where the “Letters to the Editor” page is going.  I’ve been wondering, who are these readers who are so happy to be the 136th comment on an article?  But the people who write letters to the editor have always been people who have extra time and energy.  In a way, online comments expands the world of people who are willing to write a letter to the editor.

5) Would we feel differently about government data mining if the government were better at it? Mimi and I went to a talk at the NYU Colloquium on Information Technology and Society where Joel Reidenberg, a law professor at Fordham, talked about how transparency of personal information online is eroding the rule of law.  One of the arguments he made against government data mining was that it doesn’t work, with the example of airport security, its inability to stop the underwear bomber, and its terribly inaccurate no-fly lists.  Well, the Obama administration just announced a new system of airport security checks that uses intelligence-based data mining that is meant to be more targeted.  It’s hard to know now whether the new system will be better and smarter, but it raises a point those opposed to data mining don’t seem to consider — what if the government were better at it?  Could data mining be so precise that it avoids racial profiling?  Are there other dangers to consider, and can they be warded off without shutting down data mining altogether?

The meaning of membership

Thursday, April 1st, 2010

BSA Member Card, Focht, Flickr/Creative Commons License Attribution-Noncommercial-No Derivative Works

We’ve been talking about a “datatrust” for awhile now, why we think we need one, how we envision it as a long-lasting institution, what kind of technologies we might employ for it to provide measurable guarantees around privacy.

But we’re now starting to get down to the nitty-gritty.  How will it actually work?  What will it mean to an actual researcher, nonprofit organization, policy-maker?  To you?

First and foremost, we imagine the datatrust as a member-based data bank where organizations and individuals can safely contribute personal information to inform research and public policy.

The member-based part is key.  We plan to be both non-partisan and absolutely transparent.  We have no particular academic or policy ax to grind. Our only goal is to maximize the quantity, quality and diversity of sensitive data that is made available to the public.  To ensure that decisions aren’t made even with an unconscious bias, we plan to build a decentralized structure that relies on the participation and contribution of members to build and sustain the datatrust.

But the word membership can mean a lot of different things.  When my local public radio station exhorts me to be a member, membership doesn’t seem to come with something more than a tote bag.  In contrast, if you’re a Wikipedian, it means you’ve actually written or edited an entry, and the more you participate, the more access and privileges you get, including the right to vote for members of the Wikimedia Foundation Board.

So for the past couple of months, I’ve been looking at member-based communities.  Not all of them would call themselves member-based communities, but they all have in common a structure that requires participation from a large group of people.  Some are nonprofits, some are businesses running social networks; most are online, a few are not.  Over the next couple of posts, I’m going to summarize how these communities work, what motivates the members, how the communities monitor themselves, and how diverse they are, because all of these issues will inform the decisions we make in creating our datatrust.

Here are the ten communities included in this study:

MySpace is one of the world’s largest social networks with about 125 million users, though Facebook has in the last year surpassed MySpace with the number of users and pageviews both in the U.S. and the rest of the world.  The look and feel of MySpace is very different from Facebook, since MySpace users are allowed to customize their pages.  There’s also been a lot of press about the demographic differences between MySpace and Facebook, but those differences are probably disappearing as Facebook simply grows and grows.  MySpace remains more popular than Facebook as a site for bands and music.

Facebook is the world’s largest social network with about 400 million users.  Despite its popularity and recent news that it even surpassed Google in Internet traffic, it’s also been the center of controversy, particularly regarding user privacy and terms of use, with each major change made to the site.

Yelp is a social network-based user review site for local businesses in multiple cities in the U.S.  It’s growing much faster than older sites like Citysearch, and its spawned offline events where really avid reviewers meet and socialize.  It has also gotten controversy with accusations that it extorts businesses to take out ads in return for highlighting good reviews or pulling bad ones.  Although Yelp has denied these accusations, a class-action lawsuit was recently filed against Yelp.

Flickr is a popular social network-based photo-sharing site.  Unlike many photo-sharing sites like Kodak Gallery or Photobucket, Flickr has emphasized sharing photos with the general public and organization by crowdsourcing via tags. Although it does have some services for printing photos and mugs, its main service is photo-hosting and storage, particularly for bloggers and photographers.  In addition to hosting photos, Flickr also manages projects like “The Commons” with the Library of Congress and other institutions interested in putting their public domain photos in wider circulation.

Slashdot is a news aggregator for self-professed nerds with estimated traffic of 5.5 million users per month.  It shares news stories contributed by its users, who also comment on the stories and moderate the comments.  Useful contribution is rewarded with karma points, which increases the privileges each user gets.

Wikipedia is “the free encyclopedia anyone can edit,” run by the nonprofit Wikimedia Foundation.  The number of named accounts for writers and editors is at about 11 million; about 300,000 have edited Wikipedia more than ten times.  Despite early skepticism, Wikipedia has become one of the most trafficked sites online and has expanded into multiple countries around the world.  Wikipedia has clearly developed a community of avid and enthusiastic users who contribute without monetary compensation, but in its tenth year, it is evaluating the lack of diversity among Wikipedians (only 13% of contributors are women, for one) and what steps it should take to provide access to a free encyclopedia all over the world. Wikipedia has also instituted a number of changes over the years to deal with vandalism and inaccuracies.

Open Source Software – rather than look at one particular open source project, for this study, I focused on the book Producing Open Source Software by Karl Fogel, which describes how projects should work.  Obviously, actual projects will vary widely, but we decided this was an area worth looking at because the open source movement has spent years figuring out how to structure shared work.

The Sierra Club is one of the oldest grassroots environmental organizations in the U.S.  It has 1.3 million members, but because it is not a primarily online organization, it isn’t easy to evaluate the activities of its members online.  However, it recently created a series of social media sites for online networking among Sierra Club members and supporters and our report focuses primarily on this aspect of their member activities.

The Park Slope Food Coop is a local cooperative grocery store in Park Slope Brooklyn.  (DISCLAIMER: I’ve been a member since 2005, and my research on how it works is based on my experiences there.)  Unlike many coops, membership is predicated on work.  All of its approximately 150,000 members are required to work a two hour-45 minute shift every four weeks, which reduces labor costs and thus reduces prices.  Despite being a place many people love to hate, it continues to thrive and attract new members.

Habitat for Humanity International is a major nonprofit organization that seeks to eliminate poverty housing and homelessness by building decent housing around the world.  (DISCLAIMER: I volunteered for Habitat for Humanity in high school and college and participated in a fundraising bike trip in 1999.)  Like the Sierra Club, it is also an offline organization, but its website provided more detailed information on how its affiliates work and I drew on my personal experience in trying to understand how Habitat encourages and retains volunteers.

In the mix

Wednesday, March 31st, 2010

1) Exciting news!  A diverse coalition of left-leaning and right-leaning organizations, as well as a bunch of big corporations, has formed around the goal of revising the Electronic Communications Privacy Act.  This law, from 1986, clearly didn’t anticipate the world we live in now, the extent to which we use emails, the “expectation of privacy” we have in email, and the extent to which we store our data and our documents in the cloud.  This law will greatly impact our work at the Common Data Project, but even without a professional stake in this, I’d be pretty excited.  After all, we all (except my mom who doesn’t use computers) have a personal stake in this.

2) The full text of danah boyd’s talk at SXSW is available on her blog.  This is my favorite line:

For the parents and educators in the room… Many of you are struggling to help young people navigate this new world of privacy and publicity, but many of you are confused yourself. The worst thing you can do is start a sentence with “back in my day.” Back in your day doesn’t matter.

It’s an obvious but useful point for privacy and information issues in general.  The ECPA from back in the day of 1986 can’t deal with today.  It’s time to really think, which of our assumptions about privacy still hold true?

3) David Brooks’s column this week got me thinking.  If we agree with him, which I do, that a country’s success cannot be measured simply with things like GDP, what else should we measure and how? My friends who work in social sciences are initially skeptical when I talk about the data collection potential of something like the Common Data Project’s datatrust.  They’re distrustful of self-reported data, even as they acknowledge that their existing methodologies are imperfect.  But with things that are hard to measure, self-reporting is often the only way to go.  The datatrust, the Internet, and its measurable guarantees of privacy could dramatically change how self-reported data is collected, analyzed, and published.

4) Facebook data destroyed: Pete Warden, who had created a database from 210 million public Facebook profiles, was prepared to release the data to social scientists who were fascinated by the potential to research social connections, particularly as mashed up with census data on income, mobility and employment.  But then Facebook said he had violated its terms of use, and unable to defend a potential lawsuit, he destroyed the data.

Argh, isn’t there a better way?  The decision to make one’s profile public on profile may not equal a decision to consent to be in such a database, and that Warden’s planned “anonymization” was unlikely to be very robust, but this situation is a perfect example of why the Common Data Project was founded: to create a new norm, with strong privacy and sharing standards, that makes such data truly, safely available.

In the mix

Monday, March 22nd, 2010

1) EFF is posting documents as it gets them indicating how the government uses social networks in law enforcement investigations. The Fourth Amendment is what requires the police to have a search warrant when they come to search your house.  The cases interpreting the Fourth Amendment that led to such requirements were based on expectations of privacy that are rooted in physical spaces.  But as we start to live more of our lives in an online space our founding fathers could never have imagined, how should we change the laws protecting our rights?

2) An overview of the history of people challenging the constitutionality of the U.S. Census. Personally, I love filling out the census form.  I wish I’d gotten the American Community Survey.

3) The Transaction Records Access Clearinghouse, a data research organization at Syracuse University studying federal spending, enforcement, and staffing recently got a $100,000+ bill for a FOIA request. The bill was based on the calculation that 861 man hours were required to create a description of what is in the U.S. Citizenship and Immigration Service’s database of claims for U.S. citizenship.  As an immigration lawyer, I used to deal with USCIS all the time, and even I am surprised that the agency would need that much time just to figure out what’s in the database.  You almost hope that the bill was calculated just to rebuff TRAC’s FOIA request, because the alternative, that the database is that screwed up, is pretty awful.

4) danah boyd at Microsoft Research gave the keynote at SXSW on “Privacy and Publicity” last week, challenging the idea that personal information is on a binary spectrum of public and private.  It’s great to hear more and more people making this point, which is at the heart of CDP’s mission.

5) Google now has a service that lets you place your own ad on TV.  Really shockingly simple and easy, and fascinating in light of the growing fear that evil advertisers are taking over our lives.  Would it make a difference if we could all become advertisers, too?

Yea or Nay: NYPD Skywatch crime surveillance…coming to a corner near you.

Friday, March 19th, 2010

One of these just showed up nearby. Here’s more info on what these things are.

Not the most subtle device in the world. But really that’s just the point?

Mobile crime surveillance units?

View Results

Loading ... Loading ...

Get Adobe Flash player