Archive for the ‘Interesting Uses of Data’ Category

Yea or Nay: Sympathetic Advertising

Wednesday, March 17th, 2010

Using facial recognition technology, an internal computer determines your gender and your age. The billboard then pulls up an ad based on your demographic, targeting your best possible interest. The billboard I tried out saw that I was indeed a woman in her thirties and… lo and behold, pulled up a very appealing lunch advertisement.

The author of this article compares this new technology to retina scanning technology in the movie “Minority Report” that allowed “billboards” to play ads that are tailored to YOU, personally, not you, as a member of a demographic group. Is that a fair comparison?

After all, the data behind the Japanese advertising technology probably looks more like this Wikipedia page on Japanese demographics than this IMDB page on Tom Cruise.

Still, it’s very easy to see the slippery slope between these two scenarios, in particular because they are collecting the faces they’re reading.

So the question remains, where’s the bright line between tracking people to gain a “general understanding” of what’s going and tracking individuals so they can’t get away with anything? Has this face-reading advertising technology already crossed that line?

What do you think?

Read faces to play demographically targeted ads?

View Results

Loading ... Loading ...

Yea or Nay: Track Taxis with GPS?

Wednesday, March 17th, 2010

We talk a lot on this blog about how tracking personal activities and collecting data can be extremely useful. We also talk about the need for better laws, regulations and shared social understanding of how such data should be collected, shared and used.

As part of our ongoing work to make sense of such a complicated and confusing set of issues, we’ll be collecting interesting “moral dilemmas” related to the issue of tracking human behaviors and posting them as a series of online polls. It’s an attempt to take a more “empirical,” case-by-case approach in an effort to keep high-level policy thinking rooted in reality.

If you come across something an interesting moral dilemma, please send them our way.

Without further ado, here’s the first poll:

Using G.P.S. technology installed in cabs, the (Taxi and Limousine) commission discovered more than 1.8 million trips where passengers were charged the higher rate.

Should we track taxis with GPS devices?

View Results

Loading ... Loading ...

Leaving bacterial “fingerprints” on digital devices.

Tuesday, March 16th, 2010

These knitted bacteria also happen to look like fingers.

We’re usually concerned with issues around leaving “digital” fingerprints (e.g. browsing behavior via cookies). But I couldn’t resist posting about new developments in using genetically specific bacterial traces to track your usage of digital devices (well really anything that retains bacteria.) Hmm, does this work on stainless steel?

Smart Grid Data: Unexpected and Amazing Reuses?

Tuesday, March 16th, 2010

As noted in “In the Mix,” the Center for Democracy and Technology and the Electronic Freedom Foundation recently issued joint comments to the California Public Utilities Commission regarding proposed policies around the use of smart grids and smart meters.

(via Flowing Data.)

And then a few days later, I saw this: EPCOR, a Canadian water utility company, issued a graph plotting water usage during the Olympic men’s hockey final.  Notice the spikes in water consumption (and toilet flushing) immediately after the first period, second period, third period, and finally when Canada wins the gold medal.

Is this our worst nightmare?  That someone will find out when we’re peeing?

That’s a bad joke. Plotting a large area’s water consumption in aggregate is not the same as what some of these smart meters are able to measure in terms of energy consumption.

But I do have a more serious point to make.  One of the points CDT and EFF make repeatedly in their comments is that we should avoid “unnecessary” data collection and destroy any “unnecessary” data.

What exactly does “unnecessary” mean?

Does it mean any purpose that is not related to the work of a utility company?  Who decides what’s unnecessary and should they decide what’s unnecessary and necessary now?

The beauty of data is that its potential value is unknown.  A single dataset, collected for one purpose, can be used for other purposes that are socially beneficial but rather unexpected.  For example, Google Trends was created for advertisers so that they can track what search terms are popular.  The CDC, however, has been using Google Trends to track flu outbreaks, by watching where people are Googling flu symptoms, data which is more quickly collected than reports from doctors.  The reason governments all over the world are pushing for open data is because we don’t know yet all that can be done.  By giving access to everyone, we expect interesting, useful, imaginative things to come out of the data we never might have imagined.

Data from the smart grids, in particular, will also require smart visualizations that are easy for individual consumers to understand and access.  Data alone isn’t going to change behavior.  You can imagine open data inviting developers to create easy to use apps that allow consumers to identify easily and painlessly ways to reduce energy consumption.  Some may even choose to share that information and compete with others, the way several universities have set up competitions between dorms.  As much as Al Gore was embarrassed by news revealing how much energy his mansion used, others may be eager to brag about how little energy they use.

Can we protect privacy while also creating room for imaginative and innovative reuse of data?

There are definitely privacy issues we have to consider.  I agree with a lot of the points made in CDT and EFF’s comments.  That “customer information” shouldn’t be limited to “personally identifying information.”  The misuse and misapplication of phrases like “personal information” is something we’ve been harping on for a while.  That customers should have access to the data collected from them and the power to correct mistakes.  That law enforcement shouldn’t be allowed to troll this information without a warrant, that civil litigants shouldn’t be allowed to access this information without a court order based on a showing of compelling interest and after notifying the customer to provide her with a chance to object.

But rather than talking about barring “unnecessary” data collection and data use, we should be thinking of ways to make the data safely available, regardless of whether someone has decided it’s necessary or not.  The data from smart grids is going to be both dangerous and valuable because it is so fine-grained; we clearly can’t just plop it online.  Anonymizing data is really hard.  So at CDP, we’re working hard at thinking about ways to come up with measurable privacy guarantees and testing technologies like PINQ that promise to provide access to raw data without indicating the existence of any particular individual in a dataset.  Other organizations may have different ideas.  I’m grateful for the existence of organizations that imagine the worst-case scenarios around data collection to protect our civil rights.  I also hope to see the growth of more organizations that try to imagine the best-case scenarios.

In the mix

Friday, March 12th, 2010

1) The CDC recently used shopper-card data to track a salmonella outbreak that sickened 245 in 44 states.  It turned out the pepper in salami made in Rhode Island was the culprit.  Although the CDC began to suspect through interviews and questionnaires that some sort of Italian meat product was the problem, the people they talked to couldn’t remember precisely what they had bought and the shopper-card records helped them identify the actual product.

Great story, right?  Unless you’re the director of Consumers Against Supermarket Privacy Invasion and Numbering, in which case, the story smacks of privacy invasion by the government.  The CDC got the records with the permission of the account holders, but to Katherine Albrecht and several of the commenters to the Yahoo News Story, that didn’t assuage their fears.

Here’s a choice quote: “I’d rather have a few die from poisoning and then they fix the problem then have the entire country enslaved, thank you very much.”

There was at least one person who pointed out commenting on a Yahoo news story wasn’t going to do much to preserve their privacy either.

2) MySpace is selling bulk user data! I’m with ReadWriteWeb:

I think the world is an awfully unfair mess and I’m hoping that data analysis will help illuminate some of the hows and the whys. Like the way that real-estate redlining was exposed back in the day by cross referencing census data around racial demographics and housing loan data. That illuminated systematic discrimination against black families in applying for home loans in certain parts of town. So too I think we’ll find a lot of undeniable proof of injustices and clues for how we might deal with them in big data today.

We don’t want another AOL debacle on our hands, but we also don’t want to give up on the possibilities of “big data” because we prematurely assume better privacy-creating techniques and standards aren’t available.

3) My, it’s a privacy-obsessed week!  Here’s one person’s argument “why no one cares about privacy.” It’s a good round-up of pithy quotes from people like Judge Posner, new “talk about me” sites like, and surveys demonstrating the change in the public’s attitude over time.  Wow, in 1998, 80% of people in a Harris poll said they were hesitant to shop online because of privacy worries.

Still, articles like this and the comments to the Yahoo CDC-shopper data article show how much our discussion of privacy involves people yelling at each other across a very big divide.  Is the choice really a binary one?  Privacy + a few deaths versus Big Brother + public health data?  I don’t care if the CDC has access to my grocery records; at the same time, I don’t plan to sign up for and broadcast my purchase of kale and four kinds of cheese this morning.  (Oops, I just did.)  Maybe we should stop talking about “privacy” and start talking about specific situations.

In the mix

Wednesday, March 10th, 2010

1) We’ve wondered in the past, why don’t targeted advertising companies just ask you to opt-in to be tracked?  When I first heard about it, I thought this newish website,, described on NPR, was doing something like that.  You actively register a credit card with the site and it shares ALL your transactions with your friends.  Except NPR reports the company was rather vague about how the information gets to marketing companies.  And what exactly are they offering anyway, other than the opportunity to broadcast, “I am what I buy”?  The only news being broadcast seem to be about people’s Netflix and iTunes buying tendencies.  Services like and and Patients Like Me are also using customers’ data to make money, but they’re offering a real, identifiable service in return.

2) Google explains why it needs your data to provide a better service.

Search data is mined to “learn from the good guys,” in Google’s parlance, by watching how users correct their own spelling mistakes, how they write in their native language, and what sites they visit after searches. That information has been crucial to Google’s famously algorithm-driven approach to problems like spell check, machine language translation, and improving its main search engine. Without the algorithms, Google Translate wouldn’t be able to support less-used languages like Catalan and Welsh.

Data is also mined to watch how the “bad guys” run link farms and other Web irritants so that Google can takecountermeasures.

This is an argument I’m really glad to hear.  It doesn’t make the issue of privacy go away, but I’d love to see privacy advocates and Google talk honestly and thoughtfully about what Google does with the data, how important that is to making Google’s services useful, and what trade-offs people are willing to make when they ask Google to destroy the data.

3) Nat Torkington describes how open source principles could be applied for open data. We heartily agree that these principles could be useful for making data public and useful, though Mimi, who’s worked on open source projects, points out that open source production, with its standard processes, is something  that’s been worked out over decades.  Data management is still relatively in its infancy, so open-sourcing data management will definitely take some work.  Onward ho!

4) The Center for Democracy and Technology and EFF are thinking about privacy and Smart Grids, which monitor energy consumption so that consumers can better control their energy use.  I’m more enthusiastic than EFF about the “potentially beneficial” aspects of smart meters, but in any case, it’s interesting to see these two blog posts within two days of each other.  Energy consumption data, as well as health data, are going to be two huge areas of debate, because the benefits of large-scale data collection and analysis are obvious, even though detailed personal information is involved.

5) The Onion reports Google is apologizing for its privacy problems, directed to very specific people. Ha ha.

“Americans have every right to be angry at us,” Google spokesperson Janet Kemper told reporters. “Though perhaps Dale Gilbert should just take a few deep breaths and go sit in his car and relax, like they tell him to do at the anger management classes he attends over at St. Francis Church every Tuesday night.”

In the mix

Thursday, February 25th, 2010

1) Interesting story on NPR last week about a new study using cellphone data to track people’s movements.  It turns out they were able to predict the nearest cellphone tower 93% of the time and their actual locations 80% of the time.  The potential value to public policy is significant.  It could affect how we put money into public transportation, for example.

Interestingly, though, no one mentioned any concerns about privacy, just a short statement that researchers don’t have names or numbers.  Seems like a perfect, obvious example of how that’s not sufficiently deidentifying, especially as the conclusion is that you can predict where people are.  Another researcher claims that he has data for half a million people and that “major carriers around the world are now starting to share data with scientists.”  What if we end up with another AOL scandal on our hands, and worse, the scandal keeps this kind of research from continuing?

2) The Open Knowledge Foundation has launched a set of principles for open data in science, in support of the idea that scientific data should be “freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.”

We certainly support more data being openly and freely available, but we’re curious.  How will we deal with the rights of people who are in scientific studies?  I’m not a scientist — do most agreements to participate in studies anticipate this level of public availability?  And how can we standardize data to be more easily comparable?

3) It’s not enough to have data. We also need tools to visualize, analyze, and understand data, and more and more tools are available for just that purpose.  Here’s a long list of mapping tools from the Sunlight Foundation, ClearMaps from Sunlight Labs, and Pivot, a new way to combine large groups of similar items on the internet, from Microsoft Live Labs.

In the mix

Wednesday, February 17th, 2010

1) A major study of children is having trouble finding volunteers.  A good exposition of how hard it is to set up a longitudinal study, which is why so many of our ideas about health are based on a very small number of studies.

2) The Sunlight Foundation has launched The Data Mine with the Center for Public Integrity, “to highlight inaccessible or poorly presented information from the federal government.”  On a related note, the Sunlight Foundation analyzed why the numbers of jobs reported by stimulus fund recipients differed from the number cited by President Obama in his State of the Union Speech.  A great reminder that the promise of data is not the same thing as access to good data.

3) Another person presenting his self-collected personal dataSome people love collecting and sharing information about themselves; others are terrified of anything leaking out about themselves.  How do we make personal data useful and relevant to the people in between?

Would PINQ solve the problems with the Census data?

Friday, February 5th, 2010

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

Get Adobe Flash player