Posts Tagged ‘Data Collection’

Metrocard alibi

Wednesday, November 19th, 2008

Thought-provoking article in The New York Times today—a man suspected of murder was released after his Metrocard records corroborated his alibi, that he had been on a bus, then with friends, and then on the subway around the time of the murder.

So Big Brother ended up in this case being Big Exonerator.  Yes, there are definitely serious privacy implications to the amount of data that is collected and stored through Metrocard, EZPass, and numerous other systems that catalog where we’ve been and where we’re going.  But the danger is not in the data itself.  The danger is in who has access to the data and how that data is used.

When J. Edgar Hoover was keeping secret files on American citizens, he did so without the benefit of the technologies we have today.  When the Bush administration conducted warrantless wiretapping, existing technologies certainly aided this work but it didn’t motivate it or even ultimately condone it—that was Congress.

In this case, the police made no effort to check the Metrocard data despite Mr. Jones’s alibi.   Luckily, the data didn’t belong to the police but to the transit authority, and his lawyers were not prevented from getting that data from the MTA.

The power government gains from huge amounts of locational data can’t be eliminated, but we can try to balance it demanding access to it for individuals as well.

One last election-related post!

Thursday, November 6th, 2008

It’s hard to believe now, but the red state/blue state maps only became a standard image in American politics in 2000, when it seemed to illustrate very vividly the sharp divides in the country, on politics, culture, even consumer habits.  Many people, however, used the same data in more granular form to show that the story was more nuanced than that, both in 2000 and 2004.  (UPDATED: And a new one for 2008.)

Now, in 2008, we have this great graphic from the New York Times, using data to tell a story, rather than simply provide a snapshot, of how the country has changed since 2004.


Compare this graphic, showing the counties in which Obama won more votes than Kerry (and the counties in which McCain won more votes than Bush), to the simpler red-blue map of the electoral votes won by each candidate.


If we were able to look even closer, we would be able to see how different issues and concerns may have influenced the decision to vote Democratic from county to county.

Who would be interested in that kind of data?  Not just Democrats wanting to gloat, but also Republicans wanting to analyze where their party is and should go, policymakers trying to understand people’s concerns, community organizers trying to galvanize people, even private individuals wanting to understand their community and their country a little bit better.

Now that the election is over, we can really start thinking about what happens next, for our country and our world.  More data, not just for data’s sake, but for more understanding.

The great story of good data

Wednesday, October 22nd, 2008

I love stories.

You might think, then, that I wouldn’t love data.  Stories and data are often seen as two very different ways of presenting information.  Data is considered cold, impersonal, incomplete.

But much of data’s bad reputation comes from limited data, not data in and of itself.  As Hans Rosling, a Swedish professor, demonstrates in this video, data can tell amazing stories.

It’s long but well-worth watching in its entirety.  It’s a few years old, from the 2006 TED conference, but I would bet it’s almost as riveting on YouTube as it was live at the conference.  In it, Rosling uses animated graphs of UN statistics from 1962 to 2003 to tell stories about our world and how it’s changed in ways that defy easy generalizations.

For example, his Swedish medical students studying global health assumed that there were two kinds of countries in the world—Western countries where family sizes are small and people live longer, and Third World countries where family sizes are large and people die young.  But as his animated graphs show, many countries that are still poor and developing have moved by 2003 into the upper left-hand quadrant, of countries with smaller families and longer life expectancies.  By 2003, Vietnam is in the same place the United States was in 1974.  As he declares, “If we don’t look at the data, we underestimate the tremendous change in Asia.”

In my favorite segment, like a great novelist building a complex character, Rosling breaks down one set of data over and over, showing the much more interesting and complex story behind average income and child survival.


The first graph, comparing GDP per capita among countries in the OECD, East Asia, South Asia, Africa, and Latin America, tells the story we all expect.  The blue dot in the upper right-hand quadrant is OECD countries; the small red dot on the bottom left-hand quadrant is Africa.


But then he shows how different countries within Africa have tremendous variations in GDP per capita, as well as child mortality, despite Western conceptions of a monolithic “Africa and its problems.”


And just when you’re patting yourself on the back for understanding that Africa includes a very diverse range of countries, he shows that even within the countries, the distribution of income is very broad.  The highest income quintile in South Africa is quite high, approaching the average per capita GDP in the United States.

As Rosling says, “Improvement of the world must be highly contextualized!”  And the data is what will allow us to do it.  His demonstration itself shows how data can be limiting, how it can be used to “prove” that all of Africa is poor and sick.  But the solution clearly isn’t to ignore the data but to look at more data. Ultimately, broad, detailed, longitudinal data push us to think harder, rather than rest on our assumptions. Stories still need to be told–how did Mauritius get wealthy and healthy?  Why didn’t Ghana? But without the data, we wouldn’t even know those stories were there.

Amazon’s red and blue book-buying map

Wednesday, October 15th, 2008

Sorry, it’s another semi-political post!


We at the Common Data Project are definitely interested in more than politics, but this Amazon map of political book-buying state by state was too interesting not to blog about it. It illustrates so many things I believe in.

One: Information-sharing can be fun.

People love patterns, and even more, knowing where they fit into them. The Amazon customers who are most likely to be drawn to this map are those who have bought political books, books that fall into the red, blue, or purple categories. No one is likely to be outraged that his purchase of Thomas Friedman’s book in the last 60 days got counted in designing this map. Although there’s a lot of data collection that Amazon prefers to keep on the down-low, this kind of tracking is refreshingly open and explicit. We know it’s being collected, and most of all, we get something in return. We all get to enjoy the data as well.

Two: Data has limited value if there is limited context.

As pretty as this map is, it doesn’t really provide much information. Junk Charts lays out a lot of the deficiencies that limit our ability to draw any meaningful conclusions. Providing the map with just the states colored in, but without real sales numbers, doesn’t give you a real sense of which books are selling better, in the same way that the 2004 election red-blue maps with their wide swaths of red in the middle didn’t provide real information about population density and how close the election had actually been, nor how seemingly blue or red states actually contained significant pockets of people who had voted for the other guy. How many people in South Dakota bought a “red” book? Ten, twenty, or a hundred thousand?

The paucity of information on how books were rated red, blue or purple drove me crazy, too. Every place I clicked to “Learn more,” it took me to the same very short four paragraphs. It says that the categorization was based on the book’s own promotional materials and the tags readers added to them, but I still wonder who categorized these books and precisely how they did so. Would all the authors necessarily have labeled their books as blue or red?

And if they were categorizing books as purple, as neither obviously liberal or conservative, why didn’t they include them in the percentage calculations by state?

Three: Underlying data should always be available for alternative analyses.

A lot of people are wary of data; they’ve heard too many times how numbers can be twisted to serve any purpose. We at the Common Data Project make no promises that data = truth, only that when data is truly open and available, conclusions based on that data can then be prodded, tested, and possibly refuted.

In this case, I’m not quite sure if Amazon does have a conclusion to assert, but the decisions it made about which data to include and exclude have shaped the map presented. One conclusion you might draw from a cursory glance might be the same one drawn by one of the commenters to the Junk Charts post—that people only read books they’re likely to already agree with. Imagine now if we could test that conclusion, if we could count how many readers in each state bought both “red” and “blue” books, or if there were readers who would consider themselves “conservative” but bought “liberal” books. Maybe there’s a very active and large political book club in Wyoming buying books from across the spectrum!

It may very well be true that people who identify as conservative buy “red” books, while people who identify as liberal buy “blue” books, but the map as provided doesn’t provide enough information to truly test that conclusion or propose interesting hypotheses of why that’s happening.

Still, I had a good enough time playing around with the map that I was reminded me of a book I’ve been meaning to read, which is probably Amazon’s ultimate goal anyway!

Freep this poll!

Thursday, October 9th, 2008

Have you ever been asked to “Freep this poll”?

The word “freep” comes from the “Free Republic,” an online forum for conservatives where its members are regularly informed of online polls and told to go vote en masse.  Although they don’t necessarily admit to “cheating” the polls, they have been accused of clearing cookies or otherwise circumventing the systems set up to prevent one person from voting multiple times.

Conservatives aren’t the only ones “freeping,” though.  The term has migrated across the political spectrum, and readers of decidedly more liberal sites, like DailyKos, are regularly asked to freep a poll.  And right after a presidential debate is prime freeping time for everyone, as nearly every newspaper and cable news channel will set up online polls asking, “Who won?”

I think freeping is great.

Freeping makes obvious how ridiculously inaccurate online polls can be.  Der Spiegel, a German magazine, was shocked when a 2004 online poll asking readers to rate President Bush’s performance in office was rated “excellent” by 59% of its readers–it turned out it had been freeped.  When freeping skews results to the point that no one can believe them, well, that’s a blow for truth, not ideology.

But being an ever-so-optimistic sort of person, I think freeping also shows the potential of online polls, and online measures of public opinion in general, to be more accurate than they are today.  Online polls are popular, despite being obviously inaccurate, because they’re cheap and fun (for those who just can’t get enough of sharing their opinions).  Most of all, at least in theory, they can reach a much larger group of people than professional pollsters.

The problem is that this larger group, even before freepers get involved, is shaped by the website and the audience it tends to draw.  (And of course, the world of people online is already smaller than the world as a whole.)  It wasn’t surprising, nor particularly revealing, that the people who went to the conservative Drudge Report and voted in its poll rating the Palin-Biden VP debate overwhelmingly found that Palin had won.  But if liberal online politicos had freeped the poll, they could have made the poll more representative of our country’s mix of conservatives and liberals.  And vice versa.

My point is that freeping, as creepy as it seems, is one of those strategies that’s open to everyone, left, right, liberal, conservative, polka-dotted or striped.  Some people will always just enjoy freeping for the sake of messing up the system, to enjoy their power to clear cookies and skew polls, though as I stated above, that can easily go so far that no one believes the results.  But if freeping pushes people to participate in polls in forums where they normally wouldn’t be heard, well, that sounds kind of democratic.  Sure, we still have that problem with ensuring one vote per person, but if we thought online polling could have more than entertainment value, maybe we would try harder to come up with better systems.  (I wonder if it would be possible to set up an online poll that actually let you vote as often as you wanted, but indicated you had done so.  Sometimes it’s entertaining to see who cares the most, or maybe more accurately, has the most time on his hands.)  As Mimi stated earlier, choosing to participate in polls, surveys, and studies that shape our world and our lives is increasingly becoming as democratic a duty as voting in the election booth.

Politics and Privacy, Part II

Thursday, October 2nd, 2008

Last week, I wrote about how political data collection has shown that data collection doesn’t have to be a completely one-way street, but rather, can involve individuals’ active and sometimes almost enthusiastic participation.  Part of the enthusiasm comes from a belief that this is what democracy is about—we have the right to try to persuade our fellow citizens, whether from a soap box in the town square or by calling a voter list through a phone bank.  But the data collection by political campaigns encompasses a lot more than name, occupation, and email address.  Karl Rove revolutionized it, with his famous use of consumer preferences to identify and target likely Republican voters, but the Democrats have worked hard to catch up, Catalist being one of the big players in this effort. It’s one thing to compile donor lists; another to cross-reference “beer versus wine” preferences to voter lists.  How is democracy affected by intense, data-based voter profiling?

As Solon Barocas pointed out during his talk on voter profiling at the recent DIMACS workshop, researchers have found that micro-targeting voters can increase polarization and divisiveness.  As candidates are able to air one radio ad for the Latino voters in one state and a different one for the white voters in another, they’re able to espouse more extreme positions than they would if forced to appeal to a more general audience.

If true, this is a serious problem.  But I like to believe that in the long run, and done right, political data collection and analysis could actually enable new kinds of consensus and coalition-building.  For one, in an era where blogs monitor political campaigns hour-by-hour, a local radio ad can be made available to a national audience no matter which micro-audience was originally targeted.  (Update: we can even find out about “telephone” calls to the deaf community!)

But more importantly, I can imagine that if voters and not just campaigns were able to see who else felt the way they did on major issues, many might be surprised.  Solon mentioned that despite the headlines, the algorithms by which likely Democratic or Republican voters are identified is not as simple as beer = conservative, wine = liberal.  Yes, campaigns believe they can figure out who in a community might lean in their direction, but it’s a much more complicated calculation.

So if people chose to share and know who else felt similarly, in ways that were more fine-grained than national polls, really interesting things could happen to our political discourse.  The Left Coast environmentalist might learn the hunter in South Dakota shares a commitment to conservation.  The pro-choice atheist and the pro-life Catholic might learn they both oppose the death penalty.  I’m not advocating that we throw open the curtains on the voting booth.  But knowing how our fellow citizens feel about the issues facing all of us—it almost sounds like that old-fashioned American democratic institution, the town hall meeting.

After all, democracy is the ultimate social activity.  We’re supposed to be making decisions together.

Politics and Privacy, Part I

Friday, September 26th, 2008

Rock the Vote Application

Catalist and Rock the Vote recently launched an effort to increase voter registration through a very exact tool, a Facebook application.   Using Catalist’s voter targeting databases, and knowing who has downloaded a voter registration form from Rock the Vote’s widget, they’re asking Facebook users to call the people who never actually sent in their forms and remind them to do so.

I’m curious to know how potential voters are responding to these phone calls.  Given Rock the Vote’s target demographic, and the age of most users on Facebook, they may not be as shocked to get a phone call as an older voter might. And in general, I think people are more aware that their personal information is being collected, analyzed, and shared in the political context than they are in other contexts.  I’ve had friends tell me they don’t make donations, even to candidates they support, for fear of getting on “some list.”  And anyone who has ever lived in a state or district involving a close race knows that it’s not uncommon to have a total stranger call you or even knock on your door and ask for you by name.

These kinds of intrusions can be annoying, and in some communities, being outed as a Democrat or a Republican can have more serious repercussions.  But in general, I don’t think the public is as uncomfortable with this kind of data collection by political campaigns and the Federal Election Commission as they are when it’s being done by search engines or ISPs.  (I’m not talking specifically about detailed voter profiling and data mining, which I think is slightly different and will blog about separately.)

I think there are a couple of reasons for this.  First, people believe there are a number of issues that have to be weighed.  It’s not just their privacy rights versus a company’s profits, but their privacy rights versus democratic principles, like government transparency in the case of FEC disclosure.  Second, the data collection is extremely obvious.  We all know campaigns are tracking who’s donated, so they can ask again and again and again, at least until that maximum contribution limit is reached.

Most importantly, though, people want their candidate to win.  If they are contributing more than $200 in an election cycle to a political candidate, they can live with being in the campaign’s database, as well as the FEC’s.  If they care enough to go to a rally and then are asked for their email address, they don’t mind being sent emails from the campaign day after day.  They know that if they are called during dinner and reminded to vote for their candidate, the other likely voters are being called, too.  Heck, the most enthusiastic supporters are using the data themselves, by volunteering for phonebanks and canvassing, as with the Catalist/Facebook application.

Political data collection has some lessons to teach data collection in other arenas.  Don’t try to hide what you’re doing—be obvious.  Even more importantly, give people an incentive to provide information.  Google and Yahoo can assure us that the log data, the IP addresses, the tracking they do when we’re logged into their email accounts, are all meant to provide us a better service, but we don’t really feel like we’re getting something out of it, especially compared to what they’re getting out of it.  These companies currently seem to be working on the model of “Don’t worry, whatever we’re doing won’t hurt you.”  The model should be, “Participate and get value out of the data yourselves.”

Google announces data will be “anonymized” after nine months–but then what?

Tuesday, September 9th, 2008

Everyone is in a tizzy with the news that Google is slashing its data-retention policy from 18 months to nine.  To be more specific, Google will “anonymize IP addresses on our server logs after 9 months.”  The announcement, though, only highlights for me the lack of clarity around the word “anonymize” and the general lack of information around what these data retention policies are actually doing for users’ privacy.

Data-retention is a big issue for some privacy advocates, on the theory that something like the AOL privacy scandal wouldn’t have happened if AOL hadn’t been storing the search queries to begin with.  But as we’ve stated before, we at CDP don’t think data deletion is the answer.  In fact, we’re concerned that announcements like the one today from Google can actually further confuse consumers about what’s at stake.

To begin with, Google isn’t promising to delete its data after nine months, just to “anonymize” it.  The company knows that the word “anonymize” can mean quite a lot of things, and even says so: “We haven’t sorted out all of the implementation details, and we may not be able to use precisely the same methods for anonymizing as we do after 18 months…”

Google is being prodded by the European Union’s stricter regulations around privacy, but even the EU directive on data retention only states, “Such data must be erased or made anonymous when no longer needed for the purpose of the transmission of a communication, except for the data necessary for billing or interconnection payments.”  No clear directive on what “made anonymous” means.

When AOL made its search query data public, the company thought it had “anonymized” it.  Same when Netflix released its data.  That didn’t stop people from individually identifying people in the “anonymized” data set.  I trust that Google’s engineers are not using AOL’s and Netflix’s “anonymization” techniques, but it’s clear that focusing so much on the length of time data is retained draws attention away from what happens after the nine months are up.

How should we define “personal information”?

Thursday, September 4th, 2008

We at CDP recently decided that in keeping with our work on developing new standards for online data collection, we should also create a survey of the privacy policies of the biggest online companies. We want to help users not only understand privacy policies more quickly and easily, but also to help them compare the practices of different companies.

As a result, I’ve been spending a lot of time reading privacy policies.  I knew it wouldn’t be a fun activity, but it’s also been challenging in ways I didn’t quite anticipate.  As I started to sit down and actually compare policies across a set of specific issues, it became quickly obvious that although they use many of the same words—private, personal, anonymous—they aren’t all using the same definitions.

For example, Yahoo defines “personal information” as “information about you that is personally identifiable like your name, address, email address, or phone number, and that is not otherwise publicly available.”  Although it discusses the collection of other information, like log data and IP addresses, it never calls this information “personal.” takes a similar tack, disclosing that it does collect such information, but calling it “anonymous information.”

AOL, in contrast, defines “AOL Network Information” as “personally identifiable information” that includes data like IP addresses, sites visited, and search history.  Of course, AOL can’t pretend that such data is actually “anonymous.”  After all, its proud release of “scrubbed” search query data two years ago was quickly shown to reveal the individual identities of thousands of users.

So what do you think?  When a privacy policy makes promises about your “personal information,” should that include your search query history, your IP address, and your log data?  If not, does that mean these companies are free to do what they will with this data?  Leave it unsecured? Hand it over to marketers, government, anyone?

And what does it mean to us, as a society, that companies are defining these words on their terms?

Yahoo: restoring your “sense” of privacy, not privacy itself

Friday, August 15th, 2008

Hot on the heels of the launch of Cuil and its no data collection policy, Yahoo announced recently that it would allow users to opt-out of targeted advertising on its own websites.

The new policy was announced in response to a letter sent by four members of the House of Representatives to 33 Internet and telecommunications companies. The first question of the letter was, “Has your company at any time tailored, or facilitated the tailoring of, Internet advertising based on consumers’ Internet search, surfing, or other use?” Ha!

In all fairness, I’m glad our elected officials are asking even simple questions. I just hope that they won’t be satisfied with overly simple responses. As many of the commenters to the Bits blog post pointed out, the issue is not so much whether the user is forced to view targeted ads, but what kind of data collection is done in order to send these users targeted ads. Chris Hoofnagle notes,

The problem with opt-out rights in the online advertising context is that it results in a worst case scenario for consumers: the opt out typically only applies to receiving targeted advertising, so the company still tracks the consumer’s behavior, but the consumer doesn’t enjoy the benefit of targeted ads.

This form of opt-out reflects a 20th century conception of privacy–privacy means not being contacted. In the 21st Century, we need to understand more subtle problems, such as the privacy risks from online advertisers mere collection and use of data.

Exactly. This is not about being put on the Internet equivalent of the “Do Not Call” registry. Does Yahoo think I would be okay with having data collected about me, as long as I never see the evidence they’re doing it?

P.S. Then again, there are certainly users like Commenter #8, whose vanity is hurt that Yahoo is sending her ads about reducing wrinkles. But deep down, even she seems to realize only her “sense” of privacy is being restored, not her privacy itself.

Get Adobe Flash player