Posts Tagged ‘Data Collection’

Comments on Richard Thaler “Show Us the Data. (It’s Ours, After All.)” NYT 4/23/11

Tuesday, April 26th, 2011

Professor Richard Thaler, a professor from the University of Chicago wrote a piece in the New York Times this weekend with an idea that is dear to CDP’s mission: making data available to the individuals it was collected from.

Particularly because the title of the piece suggests that he is saying exactly what we are saying, I wanted to write a few quick comments to clarify how it is different.

1. It’s great that he’s saying loudly and clearly that the payback for data collection should be the data itself – that’s definitely a key point we’re trying to make with CDP, and not enough people realize how valuable that data is to individuals, and more generally, to the public.

2. However, what Professor Thaler is pushing for is more along the lines of “data portability”, the idea of which we agree with at an ethical and moral level, has some real practical limitations when we start talking about implementation. In my experience, data structures change so rapidly that companies are unable to keep up with how their data is evolving month-to-month. I find it hard to imagine that entire industries could coordinate a standard that could hold together for very long without undermining the very qualities that make data-driven services powerful and innovative.

3. I’m also not sure why Professor Thaler says that the Kerry-McCain Commercial Privacy Bill of Rights Act of 2011 doesn’t cover this issue. My reading of the bill is that it’s covered in the general sense of access to your information – Section 202(4) reads:

to provide any individual to whom the personally identifiable information that is covered information [covered information is essentially anything that is tied to your identity] pertains, and which the covered entity or its service provider stores, appropriate and reasonable-

(A) access to such information; and

(B) mechanisms to correct such information to improve the accuracy of such information;

Perhaps what he is simply pointing out is the lack of any mention about instituting data standards to enable portability versus simply instituting standards around data transparency.

I have a long post about the bill that is not quite ready to put out there, and it does have a lot of issues, but I didn’t think that was one of them.


In the mix

Wednesday, July 15th, 2009

Hacker Exposes Private Twitter Documents (NYT Bits Blog)

Code Red: How software companies could screw up Obama’s healthcare reform (Washington Monthly)

Collect Data About Yourself with Twitter (Flowing Data)

The Nike Experiment: How the Shoe Giant Unleashed the Power of Personal Metrics (Wired)

In the mix

Wednesday, May 20th, 2009

Site Lets Writers Sell Digital Copies. (NY Times)

Linked Data is Blooming: Why You Should Care (ReadWriteWeb)

Mint Considers Selling Anonymized Data From Its Users (ReadWriteWeb)

The Growing Popularity of Popularity Lists (The Numbers Guy/Wall Street Journal)

Tuesday in the Mix

Tuesday, May 12th, 2009

Just Landed: Processing, Twitter, MetaCarta & Hidden Data (blprnt)

Greece Puts Brakes on Street View (BBC)

Developer of AdBlock Plus Proposes a Fairer Approach to Ad Blocking (ReadWriteWeb)

What Does Access to Real World Data Online Make Possible? (ReadWriteWeb)

Monday in the Mix

Monday, May 11th, 2009

Signs Your Wireless Carrier Loves You (NYT)

Calendar as filter (

New Search Service Aims at Answering Tough Queries, but Not Taking on Google (NYT)

Time for my second cup of coffee

Thursday, January 29th, 2009

img_0376.jpgBeing a serious coffee drinker, I love studies that show drinking coffee is good for you.  This one is particularly gratifying—drinking three to five cups a day seems to be linked with a markedly decreased likelihood of developing dementia!

These are the kinds of studies that get a lot of press.  I guess all the other coffee-addicts out there also want to hear their habit is good for them (note its Number One status in most-emailed articles in the New York Times, Jan. 26, 2009).  It wasn’t too long ago, though, that caffeine was the devil-incarnate of hot beverages and anyone who cared about her health felt pressured to quit her coffee addiction.  So what’s going on?

The researchers in this study are careful to point out that their findings only hint at a link between coffee and decreased risk of dementia.  No conclusions can be drawn; no recommendations can be made.

But they did feel that the study was unusual in the kind and amount of data available to them.  Of the original 2000 subjects who were selected 21 years ago, 70% were still available for examination.  Because the subjects had reported their coffee consumptions at the beginning of the study, there was less risk that people were inaccurately recalling their consumption.

It’s surely a rare thing, to have a good longitudinal group of subjects, but ultimately, it still means that this finding comes from a group of 2000 people x 70% = 1400 people.  And as the researchers pointed out, any self-reported data is subject to inaccuracies.  So multiple inaccuracies in a sample of 1400 people—hmm, maybe I can’t congratulate myself on my coffee consumption after all.

When I talk to people about the research potential of online data collection through the Common Data Project, they’ll almost always say to me, “But can any conclusions from online data collection be accurate?”  But for me, the question should be, “Could any conclusions from online data collection be more accurate than what’s available now?”  How sure are we that the conclusions we’re drawing now are accurate?  You can imagine that longitudinal studies in particular, that rely on self-reporting anyway, could greatly benefit from online data collecting tools that would reduce the costs of collecting, monitoring, and updating information on thousands, maybe even tens of thousands of people.

I’m looking forward to seeing what future longitudinal studies will say about the health benefits of my coffee addiction.

Data’s endless possibilities

Friday, January 9th, 2009

The New York Times recently published a succinct but meaty article on New York City’s new electronic health record system.  Planned and promoted by the Bloomberg administration, the system includes about 1000 primary care physicians, focused primarily on three of the poorest neighborhoods, and the data they generate about their patients.  As I read it, I found myself counting all the different functions of the system.  I found at least ten:

•    Clean up outdated filing systems;
•    Enable a doctor to compare how one patient is doing compared with his or her other patients;
•    Enable a doctor to compare how one patient is doing compared to patients all over the city;
•    Enable the city’s public health department to monitor disease frequency and outbreaks, like the flu;
•    Enable the city to promote preventative measures, like cancer screening in new ways;
•    Create new financial incentives for doctors to improve their patients’ health, on measures like controlling blood pressure or cholesterol;
•    Provide reports cards to doctors comparing their results with other doctors’;
•    Improve care by less-experienced doctors with advice and information based on a patient’s age, sex, ethnic background, and medical history, including prompts to provide routine tests and vaccinations and warnings on how drugs can potentially interact;
•    Allow doctors to follow up more closely with patients, like reminding them of appointments through new calling and text-messaging systems and being notified if their patients do not fill prescriptions; and
•    Allow patients to access their own records, make appointments electronically, and monitor their own progress on health targets (should the doctor decide to do so);

Pretty amazing, isn’t it?

Data is like that.  Once you collect it, the possibilities are endless.  Reading about this one system for health records made me realize why it’s so hard for me to describe CDP’s goals in one sentence.  We’re not trying to do something singular, like “enable a doctor to compare patients’ data.”  We’re trying to create a place where this function, and innumerable other possibilities can exist, while also being mindful that “endless possibilities” include some scary ones that we need to guard against.

Making personal data more personal

Monday, December 29th, 2008


The New York State Department of Health recently launched a new online tool for researching the prevalence of certain medical conditions by zip code.  It has a terribly boring name—Prevention Quality Indicators in New York State—but what they’re providing is very exciting.

Prevention Quality Indicators or PQIs are a set of measures developed by a federal health agency.  They count the number of people admitted to hospitals for a specific list of twelve conditions, some of which include various complications from diabetes, hypertension, asthma, and urinary tract infections.  All of these are conditions in which good preventative care can help avoid hospitalization or the development of more severe conditions.  As the Department explains, “The PQIs can be used as a starting point for evaluating the overall quality of primary and preventive care in an area. They are sometimes characterized as ‘avoidable hospitalizations,’ but this does not mean that the hospitalizations were unnecessary or inappropriate at the time they occurred.”

It’s not the kind of data that would normally get your average New York resident excited.  Even though it’s personal information—it doesn’t get more personal than health—it’s unlikely to feel very personal to anyone.

That’s what makes numbers and data off-putting for so many people.  Even when the numbers include people like us, we don’t see ourselves in them, so it’s hard to feel like those numbers have anything to say to us personally.  At the same time, so many decisions are being made based on data, huge decisions that affect all of us.  It’s important for democracy that ordinary citizens have a stake in the data, that they not only have access to the data but that they also have an interest in reviewing the data themselves.

What’s interesting to me about this website, then, is that is its potential for making this obscure piece of government health data much more immediate and personal for ordinary citizens, and not just public health data geeks.  As soon as I heard about this website, the first thing I did was look up my zip code, “11205” in the county of Kings (Brooklyn).  I could then see racial disparities in the admission rate for these conditions in my neighborhood, and even see data on specific hospitals in my area.  Whenever there is a way to organize and access data in a way that is personal to the user, it’s immediately more compelling.

There’s no particular reason for me to wonder what asthma admission rates were in my zip code in 2006.  But I can imagine a mother of a child with asthma coming upon this site, wondering what asthma rates are in her zip code and the ones around it, and maybe seeing patterns that lead her to talk to other parents and elected officials.  And I can imagine other data sets of personal information being made truly relevant and personal in similar ways.

Woo-hoo, more data…from Amazon?

Tuesday, December 9th, 2008

Amazon announced recently that they would begin hosting huge databases of public information on their servers and charging users only for the cost of computing and storage for their own applications.  Although this information is already publicly available, Amazon’s service in hosting the data means scientists, other researchers, and businesses no longer have to create their own infrastructure to store and analyze this data.  It’s the data equivalent of a library—where people can do research without having to house and maintain their own collections.

This is an incredible service Amazon is providing, but it did make me wonder, do we need an Andrew Carnegie of public databases for our time?  Carnegie, of course, was not a saint, and he imposed terms on the towns that applied for his money, but ultimately, he created the public institution of the public library.  Although we now take the idea of a public library for granted, to the point that we’ve let many of them wither away without funding, we’ve come to believe wholeheartedly that public access to information is essential and right.  Even the great collections of private universities support this principle; as nonprofit institutions given tax-exempt status, they are governed by their missions to add knowledge to the world and have simple procedures to grant access to people who are not affiliated with the university.

Here, Amazon is providing public access, but as a private company rather than a public institution or nonprofit organization.  I’m not saying that nonprofits and government entities are morally superior to private companies, or that private companies are incapable of providing a public service.  I actually think that private and public, for-profit and non-profit approaches to different issues is crucial for creating a truly vibrant marketplace of ideas.  But given the central and increasingly commanding role of data in our lives, it’s essential that we at least ask ourselves the question, “Are there functions that nonprofits and public institutions could fill better with regards to public access to data, than private companies?

We at the Common Data Project obviously believe there are good reasons to found a nonprofit organization to make data more public and accessible.  The number one reason, for me, is that the goal of public access to information may not always jive neatly with the more simple and straightforward goal of profit for a private company.

But what do you think?

Using data to build a sense of community in grassroots organizing

Tuesday, November 25th, 2008


In the past few weeks, we’ve seen two different political campaigns use technology and the Internet in new, expansive ways. We saw the mother of all online campaigns with the election of Barack Obama, and it’ll be interesting to see how its massive database of donors and volunteers is mobilized in the coming months. We’re also seeing supporters of gay marriage using Join the Impact to organize people across the country and the world, both to stage the protests that occurred simultaneously on November 15 and to create momentum for further action.

It’s fascinating to compare and with something like the ACLU Action Center. There are a lot of differences, but the one that really jumps out at me is that two recent campaigns firmly place you, the supporter, within a larger context of what others are doing in support of the same cause. It’s like that old fundraising symbol, the thermometer with a target goal, except much more interesting.

For example, Join the Impact’s first organized action was the simultaneous demonstrations that were held on November 15. The website was used not only to announce and spread the word that these demonstrations were happening, but also to record where they occurred and how many people attended. It may be fun for San Diego attendees to see that they attended the largest demonstration at 25,000 people, but it’s even more important for the 15 people who demonstrated in Sandpoint, Idaho, to know that they are part of something bigger. Traditionally, demonstrations have sought to be newsworthy by being enormous—hence, the “Million Man March” on Washington. But these demonstrations were trying to show something slightly different, a sense of support for the cause from towns both big and small, from liberal bastions to more stereotypically conservative places. The fifteen people in Sandpoint were more significant demonstrating in Sandpoint than they would have been if they had traveled to Boise and added to that demonstration by fifteen, and certainly much more than if they had traveled to San Diego. The website enabled the campaign to take a snapshot of that day that both focused on the local but also provided a view of the national that would have been impossible otherwise.

Now look at the ACLU Action Center, which focuses primarily on organizing letter-writing campaigns. It’s very easy to use, they’ll help you figure out who your senator or representative is and his or her email address, and they’ll even provide a template for the letter you send. Once you send your letter, you’ll be exhorted to send the link to all your friends. Yet the website isn’t really connecting you to anyone else. Imagine if, once you sent your letter, you got to see how many letters had been sent from your state. And that you got to compare that number to how many letters had been sent from a different state. Maybe you would see that very few letters were being sent to the representative from Utah who is actually chair of a key committee, and you would feel compelled to email your college roommate who now lives in Salt Lake City and ask her to send a letter as well.

Nonprofit organizations are eager to use “social networking” to promote their work and further their mission. It makes sense to recognize that the best kind of organizing has and always will depend on real connections between people. But the full potential of the Internet isn’t in a Facebook fan page. The best sites are going to take advantage of the Internet’s ability to collect and aggregate information in ways that reinforce a sense of community and shared purpose.

Get Adobe Flash player