Archive for the ‘Public Policy’ Category

In the mix

Wednesday, March 31st, 2010

1) Exciting news!  A diverse coalition of left-leaning and right-leaning organizations, as well as a bunch of big corporations, has formed around the goal of revising the Electronic Communications Privacy Act.  This law, from 1986, clearly didn’t anticipate the world we live in now, the extent to which we use emails, the “expectation of privacy” we have in email, and the extent to which we store our data and our documents in the cloud.  This law will greatly impact our work at the Common Data Project, but even without a professional stake in this, I’d be pretty excited.  After all, we all (except my mom who doesn’t use computers) have a personal stake in this.

2) The full text of danah boyd’s talk at SXSW is available on her blog.  This is my favorite line:

For the parents and educators in the room… Many of you are struggling to help young people navigate this new world of privacy and publicity, but many of you are confused yourself. The worst thing you can do is start a sentence with “back in my day.” Back in your day doesn’t matter.

It’s an obvious but useful point for privacy and information issues in general.  The ECPA from back in the day of 1986 can’t deal with today.  It’s time to really think, which of our assumptions about privacy still hold true?

3) David Brooks’s column this week got me thinking.  If we agree with him, which I do, that a country’s success cannot be measured simply with things like GDP, what else should we measure and how? My friends who work in social sciences are initially skeptical when I talk about the data collection potential of something like the Common Data Project’s datatrust.  They’re distrustful of self-reported data, even as they acknowledge that their existing methodologies are imperfect.  But with things that are hard to measure, self-reporting is often the only way to go.  The datatrust, the Internet, and its measurable guarantees of privacy could dramatically change how self-reported data is collected, analyzed, and published.

4) Facebook data destroyed: Pete Warden, who had created a database from 210 million public Facebook profiles, was prepared to release the data to social scientists who were fascinated by the potential to research social connections, particularly as mashed up with census data on income, mobility and employment.  But then Facebook said he had violated its terms of use, and unable to defend a potential lawsuit, he destroyed the data.

Argh, isn’t there a better way?  The decision to make one’s profile public on profile may not equal a decision to consent to be in such a database, and that Warden’s planned “anonymization” was unlikely to be very robust, but this situation is a perfect example of why the Common Data Project was founded: to create a new norm, with strong privacy and sharing standards, that makes such data truly, safely available.

In the mix

Monday, March 22nd, 2010

1) EFF is posting documents as it gets them indicating how the government uses social networks in law enforcement investigations. The Fourth Amendment is what requires the police to have a search warrant when they come to search your house.  The cases interpreting the Fourth Amendment that led to such requirements were based on expectations of privacy that are rooted in physical spaces.  But as we start to live more of our lives in an online space our founding fathers could never have imagined, how should we change the laws protecting our rights?

2) An overview of the history of people challenging the constitutionality of the U.S. Census. Personally, I love filling out the census form.  I wish I’d gotten the American Community Survey.

3) The Transaction Records Access Clearinghouse, a data research organization at Syracuse University studying federal spending, enforcement, and staffing recently got a $100,000+ bill for a FOIA request. The bill was based on the calculation that 861 man hours were required to create a description of what is in the U.S. Citizenship and Immigration Service’s database of claims for U.S. citizenship.  As an immigration lawyer, I used to deal with USCIS all the time, and even I am surprised that the agency would need that much time just to figure out what’s in the database.  You almost hope that the bill was calculated just to rebuff TRAC’s FOIA request, because the alternative, that the database is that screwed up, is pretty awful.

4) danah boyd at Microsoft Research gave the keynote at SXSW on “Privacy and Publicity” last week, challenging the idea that personal information is on a binary spectrum of public and private.  It’s great to hear more and more people making this point, which is at the heart of CDP’s mission.

5) Google now has a service that lets you place your own ad on TV.  Really shockingly simple and easy, and fascinating in light of the growing fear that evil advertisers are taking over our lives.  Would it make a difference if we could all become advertisers, too?

In the mix

Wednesday, March 10th, 2010

1) We’ve wondered in the past, why don’t targeted advertising companies just ask you to opt-in to be tracked?  When I first heard about it, I thought this newish website,, described on NPR, was doing something like that.  You actively register a credit card with the site and it shares ALL your transactions with your friends.  Except NPR reports the company was rather vague about how the information gets to marketing companies.  And what exactly are they offering anyway, other than the opportunity to broadcast, “I am what I buy”?  The only news being broadcast seem to be about people’s Netflix and iTunes buying tendencies.  Services like and and Patients Like Me are also using customers’ data to make money, but they’re offering a real, identifiable service in return.

2) Google explains why it needs your data to provide a better service.

Search data is mined to “learn from the good guys,” in Google’s parlance, by watching how users correct their own spelling mistakes, how they write in their native language, and what sites they visit after searches. That information has been crucial to Google’s famously algorithm-driven approach to problems like spell check, machine language translation, and improving its main search engine. Without the algorithms, Google Translate wouldn’t be able to support less-used languages like Catalan and Welsh.

Data is also mined to watch how the “bad guys” run link farms and other Web irritants so that Google can takecountermeasures.

This is an argument I’m really glad to hear.  It doesn’t make the issue of privacy go away, but I’d love to see privacy advocates and Google talk honestly and thoughtfully about what Google does with the data, how important that is to making Google’s services useful, and what trade-offs people are willing to make when they ask Google to destroy the data.

3) Nat Torkington describes how open source principles could be applied for open data. We heartily agree that these principles could be useful for making data public and useful, though Mimi, who’s worked on open source projects, points out that open source production, with its standard processes, is something  that’s been worked out over decades.  Data management is still relatively in its infancy, so open-sourcing data management will definitely take some work.  Onward ho!

4) The Center for Democracy and Technology and EFF are thinking about privacy and Smart Grids, which monitor energy consumption so that consumers can better control their energy use.  I’m more enthusiastic than EFF about the “potentially beneficial” aspects of smart meters, but in any case, it’s interesting to see these two blog posts within two days of each other.  Energy consumption data, as well as health data, are going to be two huge areas of debate, because the benefits of large-scale data collection and analysis are obvious, even though detailed personal information is involved.

5) The Onion reports Google is apologizing for its privacy problems, directed to very specific people. Ha ha.

“Americans have every right to be angry at us,” Google spokesperson Janet Kemper told reporters. “Though perhaps Dale Gilbert should just take a few deep breaths and go sit in his car and relax, like they tell him to do at the anger management classes he attends over at St. Francis Church every Tuesday night.”

In the mix

Tuesday, March 2nd, 2010

1) I’m looking forward to reading this series of blog posts from the Freedom to Tinker blog at Princeton’s Center for Information Technology Policy on what government datasets should look like to facilitate innovation, as the first one is incredibly clear and smart.

2) The NYTimes Bits blog recently interviewed Esther Dyson, “Health Tech Investor and Space Tourist” as the Times calls her, where she shares her thoughts on why ordinary people might want to track their own data and why we shouldn’t worry so much about privacy.

3) A commenter on the Bits interview with Esther Dyson referenced this new 501(c)(6) nonprofit, CLOUD: Consortium for Local Ownership and Use of Data.  Their site says, “CLOUD has been formed to create standards to give people property rights in their personal information on the Web and in the cloud, including the right to decide how and when others might use personal information and whether others might be allowed to connect personal information with identifying information.”

We’ve been thinking about whether personal information could or should be viewed as personal property, as understood by the American legal system, for awhile now.  I’m not quite sure it’s the best or most practical solution, but I’m curious to see where CLOUD goes.

4) The German Federal Constitutional Court has ruled that the law requiring data retention for 6 months is unconstitutional.  Previously, all phone and email records had to be kept for 6 months for law enforcement purposes.  The court criticized the lack of data security and insufficient restrictions to access to the data.

Although Europe has more comprehensive and arguably “stricter” privacy laws, many countries also require data retention for law enforcement purposes.  We in the U.S. might think the Fourth Amendment is going to protect our phone and email records from being poked into unnecessarily by law enforcement, but existing law is even less clear than in Europe.  So much privacy law around telephone and email records is built around antiquated ideas of our “expectations,” with analogies to what’s “inside the envelope” and what’s “outside the envelope,” as if all our communications can be easily analogized to snail mail.  All these issues are clearly simmering to a boil.

5) Google’s introduced a new version of Chrome with more privacy controls that allow you to determine how browser cookies, plug-ins, pop-ups and more are handled on a site-by-site basis.  Of course, those controls won’t necessarily stop a publisher from selling your IP address to a third-party behavioral targeting company!

In the mix

Thursday, February 25th, 2010

1) Interesting story on NPR last week about a new study using cellphone data to track people’s movements.  It turns out they were able to predict the nearest cellphone tower 93% of the time and their actual locations 80% of the time.  The potential value to public policy is significant.  It could affect how we put money into public transportation, for example.

Interestingly, though, no one mentioned any concerns about privacy, just a short statement that researchers don’t have names or numbers.  Seems like a perfect, obvious example of how that’s not sufficiently deidentifying, especially as the conclusion is that you can predict where people are.  Another researcher claims that he has data for half a million people and that “major carriers around the world are now starting to share data with scientists.”  What if we end up with another AOL scandal on our hands, and worse, the scandal keeps this kind of research from continuing?

2) The Open Knowledge Foundation has launched a set of principles for open data in science, in support of the idea that scientific data should be “freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.”

We certainly support more data being openly and freely available, but we’re curious.  How will we deal with the rights of people who are in scientific studies?  I’m not a scientist — do most agreements to participate in studies anticipate this level of public availability?  And how can we standardize data to be more easily comparable?

3) It’s not enough to have data. We also need tools to visualize, analyze, and understand data, and more and more tools are available for just that purpose.  Here’s a long list of mapping tools from the Sunlight Foundation, ClearMaps from Sunlight Labs, and Pivot, a new way to combine large groups of similar items on the internet, from Microsoft Live Labs.

In the mix

Wednesday, February 17th, 2010

1) A major study of children is having trouble finding volunteers.  A good exposition of how hard it is to set up a longitudinal study, which is why so many of our ideas about health are based on a very small number of studies.

2) The Sunlight Foundation has launched The Data Mine with the Center for Public Integrity, “to highlight inaccessible or poorly presented information from the federal government.”  On a related note, the Sunlight Foundation analyzed why the numbers of jobs reported by stimulus fund recipients differed from the number cited by President Obama in his State of the Union Speech.  A great reminder that the promise of data is not the same thing as access to good data.

3) Another person presenting his self-collected personal dataSome people love collecting and sharing information about themselves; others are terrified of anything leaking out about themselves.  How do we make personal data useful and relevant to the people in between?

Is Public the new Private?

Wednesday, February 3rd, 2010

Publicy (Publi[c] + [Priva]cy)
When the public, not the private, is the default.

In a world where so much more is out in the public, will people just stop worrying about privacy completely? Maybe in another five years, people simply won’t care if their names and addresses come up when someone searches for “people who have STDs” or “people who are 40 year-old virgins.”

40 Year-Old Virgin

For some of us, that’s hard to believe.

But I wonder if even for the people who are “most public” about their lives, the end of all privacy is equally scary. After all, aren’t Twitter, MySpace and Facebook simply opportunities for all of us to craft public personas we want others to see? Which implicitly includes controlling what people don’t see.

At the end of the day, the line between public and private has to do with control. Just because we’re now all sharing volumes more than we used to, doesn’t mean that we’re any more willing to share the skeletons in our closets.

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

What kind of institution do we want to be? Part II

Tuesday, December 15th, 2009

As described in the first post, banks and credit unions could be useful models for the datatrust because of their function of holding valuable assets for account holders.  Public libraries and museums are very different, but their function, of providing the public access to valuable social assets, is also relevant to the datatrust.

A. We want to be an online public library of useful, personal data, because no democracy can function properly without broad access to information.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Image by FreeFoto, available under a Creative Commons Non-Commercial No-Derivative Works License.

Although public libraries now have the fuzzy-warm feeling status right up there with puppies and babies, the public library system was not established in the U.S. without controversy.  The only people who owned books were the rich, and many argued that the poor would not know how to take care of the books they borrowed.  The system was largely established through the efforts of Andrew Carnegie and others who believed in both public libraries and public schools, and that democracy could not function without public access to information.

Librarians are now champions for intellectual freedom.  As a profession, librarians have developed strong principles around the confidentiality of library users, and they were on the front lines in resisting the USA PATRIOT Act’s provisions around FBI access to library records. Although they are often underfunded and can seem out of date, the current recession has made obvious what has been going on for awhile, that people really do use the library. And when they do, they don’t abuse the privilege.  Many communities feel invested in their local branches, and the respect people have for libraries translates into a respect for their holdings.

We hope the establishment of our datatrust can follow a similar path.  Everyone may not agree now that this kind of access to information is necessary.  But we strongly believe that the status quo, where large corporations and government agencies have access but the public does not, stifles the free flow of information that really is crucial for a functioning democracy.  We hope that the datatrust can grow to engender the same kind of respect and to be a valuable member of many communities.

Of course, the information in books is qualitatively different from personal data about an individual.  If a book gets lost, it’s not as great a loss as if personal data gets misused.  Which leads us to the next point.

B. We want to make data available to the public because it is too valuable to be kept in a locked safe, the way museums make great art available.


Museums are interesting institutions to us because they showcase extremely valuable pieces that would be safest from damage and theft if kept locked up in a vault, yet are put on public display because the value afforded to the public outweighs the risk of damage and theft.  Although they have a greater reputation for elitism than public libraries, museums also operate on the belief that certain assets, like great art or historical artifacts, should belong to society at large rather than to a private collector.  Thus, when a private collector does donate his or her collection to a museum, he or she gains the reputational benefit of having done something altruistic.  At the same time, access to the public comes with protective measures for security—guards, technology, velvet ropes, and more.

Personal data, to us at CDP, is also too valuable to keep locked up.  Arguably, personal data is currently kept by many private collectors or corporations.  They gain value from that data, but that value is not shared with the public.  Unlike art, which is usually made by an individual, personal data is collected from a large swaths of the general population, and yet we don’t have access to that data.  Like museums, we will want to think of security measures to minimize any risk, but we do acknowledge that there will be some risk, known and unknown, in our project.  But that risk is so much outweighed by the potential benefits to society, we think it’s a worthwhile experiment.

Museums also add value to their holdings by curating them.  That’s an important challenge for us, as information is only valuable when it’s organized.

Watching the Electrons

Monday, November 23rd, 2009

A warning about electrical outlets that has nothing to do with bathtubs.

This from Ontario’s Information and Privacy Commissioner, who has been studying the implications of “smart grid” technology that will enable utilities to micro-monitor usage, with the aim of more efficient electric delivery:

Intimate details of hydro customers’ habits, from when they cook or take showers, to when they go to bed, plus such security issues as whether they have an alarm system engaged, could all be discerned by the data automatically fed by appliances and other devices to the companies providing electric power.

Why does it matter here in the USA? Well, the economic stimulus enacted earlier this year contains at least $4.5 billion specifically for smart grid projects, with the aim of kick-starting even greater private investments.

As in so many other areas (medicine, taste in movies), the tech that promises better, “smarter” outcomes also demands more detailed information.  So what’s our plan for that information?

Get Adobe Flash player