Archive for the ‘Interesting Uses of Data’ Category

Cuil: Is zero data collection the answer?

Monday, August 11th, 2008

Cuil, the new search engine, launched with much fanfare this past week. It’s been blogged about all over the place already, so I’m not going to analyze how its results compare to Google’s. I’m more curious about its privacy policy, which trumpets that it collects NOTHING, nada, zip, zilch.

I found it sort of funny that the other big news in search engines recently was Google’s announcement that it was launching an updated version of Google Trends called Google Insights for Search. While one search engine bragged about its lack of data collection, the other was showing it off.

The two news items together highlight the problem at the heart of our ongoing search for more privacy online. Despite all the handwringing over online data collection, especially by big search engines, people love seeing the data that gets collected, even when they’re not advertisers. We want to see how often we’re mentioned in Twitter, or what parts of the world are searching for topics we blog about. It’s not hard to imagine more serious research and analysis being applied to this data and real social good coming out of it.

I’ve never found very compelling the National Rifle Association’s argument, “Guns don’t kill people; people kill people.” But I find myself wanting to say something similar about data collection: “Data collection doesn’t violate privacy; irresponsible people and laws violate privacy.” Shutting down data collection altogether can’t be the answer.

Let’s ask the government to give us information!

Monday, July 7th, 2008

My contracts professor from law school, Ian Ayres, suggests in his book Super Crunchers that the IRS become a source for useful information for ordinary people. The agency could tell taxpayers how much others in their income bracket, on average, are donating to charity or contributing to their IRAs, or tell small businesses whether they might be spending too much money on advertising.

The idea isn’t so far-fetched. About two months ago, the Italian government caused an uproar when it published online the tax details of every single Italian taxpayer. Allegedly meant to fight tax evasion, the move by the outgoing government sounded more like it was motivated by political spite. The most fascinating thing for me, though, was reading various comments in the blogosphere and finding out Norway, Sweden, and Finland do this every year! Apparently, the tax documents are considered official and therefore public records. According to the Swedish government, it’s in keeping with a general principle of government transparency: “To encourage the free exchange of opinion and availability of comprehensive information, every Swedish citizen shall be entitled to have free access to official documents.” And no one really minds.

Of course, this would be inconceivable in the U.S.—there’s a law against it. But as Ian Ayres suggests, the idea that the government should be giving information back to us, instead of just collecting it from us, isn’t totally crazy and Scandinavian. It could be released in anonymized aggregates or in others ways that wouldn’t reveal how much our neighbor makes. The information could be genuinely useful, not just titillating.

There could even be implications for public policy. So much of government policy is expressed in the Internal Revenue Code (such as favoring homeownership over renting), but our debates about tax cuts, mortgage deductions, and credits are based on fairly imprecise numbers. Even as we argue about what a tax cut will do to the “middle class,” we don’t even know what the “middle class” is. Where should government transparency start, if not at the point of revenue collection?

The difference between what you do and what you think you should do

Wednesday, June 25th, 2008

What could be more American than apple pie? Why an orgy of course!

Is anyone surprised that there are more Google searches for “orgy” than “apple pie”? Does this mean “smut peddlars” should be re-characterized as mainstays of mainstream culture? That’s the defense strategy in a trial of a “pornographic Web site operator.”

Now, the parallels the defense is trying to draw are ridiculous. There are 1001 reasons to explain the fact that “orgy” is more popular than “apple pie” and “watermelon” on the internet. For one, “apple pie” is way more specific than “orgy”. “Restaurants” versus “orgy” might be a more interesting comparison.

Still, that would be hard to do right because there are endless variations when it comes to searching for a place to eat. Realistically, how many ways are there to search for “orgy”?

Nevertheless, it is always interesting when raw data about what we do undermines what we think we should do. Does increasing access to such “behavioral” data mean an end to hypocrisy? or the erosion of a basic human device that helps us all get out of bed and face the world each day?

Frequently Asked Question #1: Why is Google offering Google Health?

Wednesday, May 21st, 2008

Everyone must be wondering the same thing I am, as the number one question on the FAQ’s about Google Health is: “Why is Google offering this product?” Related, of course, is Question #6: “If it’s free, how does Google make money off Google Health?”

Unfortunately, the answers aren’t very satisfying.

“It’s what we do. Our corporate mission is to organize the world’s information and make it universally accessible and useful. Health information is very fragmented today, and we think we can help. Google believes the Internet can help users get access to their health information and help people make more empowered and informed health decisions. People already come to Google to search for health information, so we are a natural starting point. In addition, we have a lot of experience storing and managing large amounts of data and developing consumer products that offer a positive and simple user experience.”

I thought their mission, as a corporation, was to maximize profits for their shareholders.

The answer to Question #6 is even worse:

“Much like other Google products we offer, Google Health is free to anyone who uses it. There are no ads in Google Health. Our primary focus is providing a good user experience and meeting our users’ needs.”

But we all know that “other Google products” that are free make money through advertising. And there are “no ads in Google Health”?

In launching Google Health, Google has clearly acknowledged that health information is even more sensitive than the personal information the company has been assiduously collecting up to this point. Although it glosses over the differences between its other applications and Google Health, promising to “conduct our health service with the same privacy, security, and integrity users have come to expect in all our services,” the mere fact that it doesn’t have advertising trumpets that Google is trying to differentiate Google Health from something like Gmail.

But the harder Google tries to assure me that there is no advertising and that the service is free, the harder it is for me to believe there are truly no costs to me. Clearly, there is a real value to providing secure online access to personal health records. Medical records, for the appropriate people, should be accessible, transferable, and plain legible, as anyone who has tried to read a doctor’s handwriting can attest. So why would someone give me something for nothing?

According to the Wall Street Journal, Google is not ruling out advertising in the future, and in the meantime, it hopes Google Health will simply drive more users to Google in general. Perhaps Google itself doesn’t quite know where Google Health will go. But given how easy it is to imagine nightmare scenarios of what can happen with this kind of information, I want the company who’s collecting it and storing it to have a better story about why it’s doing this.

Follow-up photos from MoMA’s “Design and the Elastic Mind”

Tuesday, April 8th, 2008

I forgot my camera the first time I saw this exhibit on a Friday night, with free admission courtesy of Target, and so the photos below don’t capture the enthusiasm and almost sweaty energy of the intense crowd that filled every corner of the exhibition space that night. These photos are from early Wednesday morning last week, with a considerably thinner crowd, and although they’re not fantastic photos, I hope they show some of the curiosity and engagement I saw on people’s faces.

Looking at “Flight Patterns” by Aaron Koblin

An example of Mimi’s point: data of flight patterns imposed on a map, immediately conveying information as well as something nice to look at.

“I Want You to Want Me” by Jonathan Harris and Sep Kamvar

A sweet and funny work playing with data from online dating sites, certainly a database of societal concerns, if not as serious as the Architecture and Justice piece on prison populations.

“Shadow Monsters” by Philip Worthington, probably the most popular piece

And last, something completely unrelated to data, but probably best at conveying how fun this whole exhibition is.

“Data” as a mainstream consumer good? 2 approaches.

Wednesday, April 2nd, 2008

Two examples of “data” becoming a mainstream consumer good.

1. Youtube launches video stats

Google Analytics for YoutubeOstensibly, the service is aimed at the “general public” uploading videos.

“Insight gives the creators an inside look into the viewing trends of their videos on YouTube, and helps them to increase views and become more popular,” said YouTube Product Manager Tracy Chan.

But of course, such a tool is useful to advertisers as well.

“Partners can evaluate metrics to better serve and understand their audiences, as well as increase ad revenue. And advertisers can study their metrics and successes to tailor their marketing — both on and off the site — and reach the right viewers.”

2. More exciting is Patients Like Me.com

PLM is a web service that provides treatment data for diseases like Parkinson’s disease, multiple sclerosis, and AIDS, collected from individuals.ALSFRS-R Progression of Patients on LithiumFrom the recent NYT Magazine profile:

…PatientsLikeMe seeks to go a mile deeper than health-information sites like WebMD or online support groups like Daily Strength. The members of PatientsLikeMe don’t just share their experiences anecdotally; they quantify them, breaking down their symptoms and treatments into hard data. They note what hurts, where and for how long. They list their drugs and dosages and score how well they alleviate their symptoms. All this gets compiled over time, aggregated and crunched into tidy bar graphs and progress curves by the software behind the site. And it’s all open for comparison and analysis. By telling so much, the members of PatientsLikeMe are creating a rich database of disease treatment and patient experience.

Why is this interesting? Well, instead of establishing a parasitic relationship between the web service and their users where the service more or less “spies” on their users, and then makes money off of the the data they collect by selling it to advertisers, PatientsLikeMe sets up individuals and web services in a symbiotic relationship where the user has a stake in the data because the user is the one that gets value out of the aggregates. This is not only more sustainable from a PR perspective, but also from a data quality perspective. If you’re trying to understand how your personal treatment profile stacks up against others; the more detailed and accurate your information is, the more you get out of the service, the more valuable the service is to you and to others.With Youtube, Google is still playing cat and mouse with their users, hoping they won’t notice or care about the data that’s being collected and sold.  PatientsLikeMe on the other hand, is part of an emerging crop of web services (Freshbooks and Wesabe to name 2 in the finance genre) that build a symbiotic relationship with their users. Of this new breed of data-driven services,  PatientsLikeMe is perhaps the most ground-breaking because the user’s relationship to the service is (for a change) just so obvious:

The community as a whole succeeds or fails on the individual contributions of its members.

Data Visualizations at MoMA’s “Design and the Elastic Mind”: Beautiful, clever and heartwarming; but what is it trying to tell me??

Friday, March 28th, 2008

CDTF made a field trip to MoMA to see the Design and the Elastic Mind exhibit. I won’t try to summarize the exhibit here, but I think we each left more hopeful and inspired!

One lightbulb that went off (again) for me at the exhibit was just how hard it is to create truly “meaningful” visualizations of data.

I mean “meaningful” in the literal, not metaphorical sense of the word. Almost all of the exhibits were some combination of beautiful, clever, or heartwarming.

But the only ones that most effectively communicated information as opposed to just data were visualizations on maps and timelines: San Francisco taxi traffic patterns, flow of IP data across the globe, timeline of wikipedia edits.

Why? Maps have intrinsic meaning, they are a representation of the physical world we live in. (Calendars too.) As a semantic-rich canvas for data visualizations, maps become the lens through which we extract knowledge from the information presented to us.

Takeaway? When visualizing data, the backdrop is just as important as the actors in the show because by providing context, they provide us with a frame of reference to begin asking questions of the data: What is the significance of how the data points fall? Well, that depends on the semantic significance of the space they inhabit.

Now the question is, can we build a repertoire of semantic-rich canvases for visualizing data beyond maps and calendars?

Here are just a handful of the exhibits. They either fall into the category of: No explanation needed; or Cool, but what does it mean? (Pictures taken from the MoMA website.)

Which ones do you “get” right away?

Cabspotting (in San Francisco)Cabspotting in San Francisco (Amy Balkin)

Rewiring the SpyRewiring the Spy: “Mapping” terrorism in the news. Haunting. (Lisa Strausfeld and James Nick Sears)

Google Earth Mashup: New York Area Flood ZonesGoogle Earth Mashup: Sea level rise flood maps. (Alex Tingle)

Emergent SurfaceEmergent Surfaces: Motorized sculpture responds to its environment. Gorgeous. (Hoberman Associates)

I Want You to Want Me

I Want You to Want Me! Snippets from dating sites. Cute! (Jonathan Harris and Sep Kamvar)

Text Arc: Alice in WonderlandText Arc: “Mapping” Alice in Wonderland. Egg-shaped whimsy. (W. Bradford Paley)

SonumbraSonumbra: A tree of light that responds to the people in the room. Eerily soothing. (Rachel Wingfield & Mathias Gmachl)

Mapping the Internet“Mapping” the Internet Oddly 80s! (Bill Cheswick)

A nonprofit wants to share its mailing list with some economists–would that bother you?

Thursday, March 13th, 2008

There’s a fascinating article in the New York Times Sunday Magazine on an economists’ study of what makes people donate by an interesting liberal-conservative pair, Dean Karlan and John List. They wanted to do an empirical study of fundraising strategies, to find out what kind of solicitations are the most successful. As the article points out, lab experiments of economic choices aren’t particularly realistic: “If you put a college sophomore in a room, gave her $20 to spend and presented her with a series of pitches from hypothetical charities, she might behave very differently than when sitting on her sofa sorting through letters from actual organizations.”

So Karlan and List found an opportunity for a field experiment, a partnership with an actual, unnamed nonprofit that allowed them to try different solicitation strategies and map the outcomes. They wrote solicitation letters that were similar, except some didn’t mention a matching gift, some mentioned a 1-to-1 match, some a 2-to-1, and some a 3-to-1. In the end, if a matching gift was mentioned, it increased the likelihood of a donation, but the size of the matching gift did not. As the author, David Leonhardt, notes, their findings and the findings of other economists in this area are significant to many people, from the nonprofits trying to be better fundraisers to economists studying human behavior, even to those who want to make tax policy more effective and efficient.

The article, however, didn’t mention whether the donors to the nonprofit had consented to their responses being shared with anyone other than the nonprofit. I’m not that concerned about whether donors’ privacy may have been egregiously violated. (I’m also not sure what’s required of nonprofits in this area.) I’m just curious to know, if they had been given the choice, would they have agreed to their information being shared with the economists? Obviously, the study wouldn’t have worked if potential donors had been told they would be sent different solicitation letters to measure their responses, but I think if most people on a nonprofit’s mailing list were asked if they would explicitly allow their information to be used in academic studies, they would consent. They might want assurances that their individual identities would be protected—that no one would know Mr. So-And-So had given zero dollars to a cause he publicly champions. But they might very well be willing to help the nonprofit figure out how to be more effective and be a part of an academic study that could shape public policy. They might even be curious to know how their giving measures compares to other donors in their income brackets or geographic areas.

Most people, myself included, have a knee-jerk antipathy to having their personal information shared with anybody other than the organization or company they give it to. But maybe we would feel differently if we were actually given some choices, if our personal identities could be protected, if sharing information could lead to more than just targeted advertising or more junk mail.

Property Shark and “Contextual Integrity”: Where real estate obsession and privacy academia intersect

Tuesday, March 4th, 2008

Recently, I was having dinner with some friends when the topic of Property Shark came up. My friends, being homeowners, were disturbed that someone could simply go online, type in their address, and find out who the owners were and precisely what they had paid for it. One friend exclaimed, “I don’t want people to know how much money I have!” When I pointed out that the information was public record, and that before Property Shark, anyone could have gone down to City Hall and found the same information, he didn’t care. It still bothered him.

For all our talk of “privacy,” of how it’s being violated all over the place, of how it’s already lost, it’s not even clear what we mean when we say “privacy.” We, as a society, might have agreed that it is good public policy for real estate records to be public so that potential buyers can make sure sellers actually own the property they’re selling. Capitalism can’t thrive if you can’t be sure you own what you own. But when we theoretically made this agreement, we certainly didn’t imagine a world where “public” means available to anyone, anywhere, at any time. Professor Helen Nissenbaum, who recently presented at the DIMACS Data Privacy Workshop, has proposed that we think about “contextual integrity” rather than “privacy.” She argues that it’s more useful to consider what’s appropriate in each context rather than assuming there is a blanket “privacy” standard applicable to all situations.

That makes sense to me. My friend wasn’t arguing that the information shouldn’t be public record. Rather, he wasn’t comfortable with that information being accessed so easily online.

Personally, in the universe of privacy breaches, Property Shark doesn’t seem so problematic, but it’s certainly helpful as the Common Datatrust Foundation works on privacy problems to remember that “privacy” doesn’t have a singular meaning. One of CDTF’s goals for this year is to create some privacy standards for companies and other data collectors that acknowledge that information flow can’t just have a on/off, public/private spigot. It’s obvious that our world and our needs are more complex than that. After all, sometimes it’s hard to know even what we want when we clamor for more privacy. Even my friend, when pressed, admitted that the next time he was looking to buy a house, the first thing he would do is go to Property Shark.

Where that “study” you quoted came from: Remember that call you got during dinner?

Tuesday, May 29th, 2007

Over the last few months I’ve been to a number of interesting talks at the Stanford Methods of Analysis Program in the Social Sciences (MAPSS) colloquium. Two types of speakers have caught my attention: those who work closely with the logistics and mechanics of data collection, and those who try to use survey data to test their hypotheses.

Most recently I got to hear Linda Piekarski of Survey Sampling International on SSI’s efforts to address changes in the telephone system, as well as their recent forays into internet surveys. (I didn’t realize how perfect the original design for the U.S. phone system was for tele-survey companies.)

Also memorable was Yale Professor Don Green’s talk about measuring the effectiveness of political campaign advertising. One of my favorite lines (though I’m paraphrasing) was that “Any time you see a clean, clear graph of data, there’s something wrong. Data “noise” is what reality looks like.”

What follows is a summary of the challenges facing the collection of data about individuals derived in part from these talks.
Today, there are three main ways of collecting data from individuals, each of which contain flaws that seriously undermine the quality of the data collected.

  1. Pay them a tiny reward, lure them with a sweepstakes or nag them at dinner with a phone call from a stranger. For example, online stores may offer a coupon or rebate for your feedback on your buying experience.
  2. Make it easy for individuals to inadvertently or unthinkingly consent to data being collected about them, and/or subsequently changing the substances of what is collected, or the uses for that data. One prominent example is Amazon.com’s site registration process, which makes no attempt to highlight their third-party data-sharing practices.
  3. Leverage data collected for some other purpose – so-called “Secondary Use”. For example addresses collected for fulfillment (shipping) being used for geographically targeted marketing messages.

These mechanisms have a set of critical flaws:

  1. Tiny rewards and nagging phone calls are an insufficient value proposition for many individuals, thus the pool of participants is unlikely to be well distributed across the target distribution. Instead it will favor those individuals for whom the reward remains attractive, however small; or those individuals for whom the cost of participation (time) is small enough to make the reward adequate. (Mechanism 1)
  2. Rewards or compensation that are distributed without regard to accuracy provide no incentive for either careful or genuine accurate self-reporting. (Mechanism 1)
  3. These practices cultivate a public perception of a mesh of “big brother” networks collecting an ever-expanding set of data, beyond the control of any one individual. Privacy outrage still surfaces in mainstream media occasionally, but the general public is increasingly numb to incremental discoveries of the erosion of personal privacy. While anesthesia may appear temporarily attractive to data collectors, it also disengages individuals from the data collection goals, which decreases participation and discourages accurate self-reporting. For example, when you are pressured to answer a survey at a department store or after check-out at a web retailer, do you react with an earnest attempt to supply them with the information they need? (All mechanisms)
  4. In an effort to fight back the ever-increasing invasive data collection going on, privacy legislation and legal liability has forced data to be “silo-ed” and “anonymized” as much as possible. That means that unless you are a part of a larger survey panel, each subsequent survey you complete or data you consent to have collected will be stored separately from your other data. This eliminates the possibility of data-accuracy maintenance by individuals, and makes longitudinal analysis increasingly difficult. (All mechanisms)