Who has your data and how can the government get it?

Monday, June 28th, 2010

The questions are more complicated than they might seem.

In the last month, we’ve seen Facebook criticized and scrutinized at every turn for the way they collect and share their users’ data.  Much of that criticism was deserved, but what was missing in that discussion were the companies that have your data without even your knowledge, let alone your consent.

The relationship between a user and Facebook is at least relatively straightforward.  The user knows his or her data has been placed in Facebook, and legislation could be updated relatively easily to protect his or her expectation of privacy in that data.

But what about the data consumer service companies share with third parties?

Pharmacies sell prescription data that includes you; cellphone-related businesses sell data that includes you.

So much of the data economy involves companies and businesses that don’t necessarily have you as a customer, and thus even less incentive to protect your interests.

What about data that’s supposedly de-identified or anonymized?  We know that such data can be combined with another dataset to re-identify people.  Could the government seek that kind of data and avoid getting even a subpoena?  Increasingly, the companies that have data about you aren’t even the companies you initially transacted with.  How will existing privacy laws, even proposed reforms by the Digital Due Process coalition, deal with this reality?

These are all questions that consume us at the Common Data Project for good reason.  As an organization dedicated to enabling the safe disclosure of personal information, we are committed to talking about privacy and anonymity in measurable ways, rather than with vague promises.

If you read a typical privacy policy, you’ll see language that goes something like this,

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:…

We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request

We think the datatrust needs to be do better than that. We want to know exactly what “enforceable government request” means.  We want to think creatively about what individual privacy rights mean when organizations are sharing information with each other. We’ve written up the aspects that seem most directly relevant to our project here, including 1) a quick overview of federal privacy law; 2) implications for data collectors today; and 3) implications for the datatrust.

We ultimately have more questions than answers.  But we definitely can’t assume we know everything there is to know.  Even at the Supreme Court, where the Justices seem to have some trouble understanding how pagers and text messages work, they understand that the world is changing quickly.  (See City of Ontario v. Quon.)  We all need to be asking questions together.

So take a look.  Let us know if there are issues we’re missing. What are some other questions we should be asking?

Why do we need a datatrust? Isn’t there so much data out there already?

Tuesday, January 5th, 2010

In the past couple of years, and even more in the past couple of months, there’s been an explosion of data being made available online.  The Obama administration has announced a commitment to transparency with the Open Government Initiative, including, a central clearinghouse for raw data sets made available by federal agencies.  Local governments, like New York City and Washington, D.C., are also putting data online and holding contests for best applications of that data.  There are easier ways to access data that’s always been publicly available, like Property Shark for real estate records and for local information on everything from crime reports to restaurant health code violations.

So why do we need a “datatrust”?

Because the data isn’t actually so accessible.

Don’t get me wrong, there is certainly more data available than there ever has been before.  But if you actually sit down and look at some of the data sets online now, you’ll start to see that a great deal of work remains to be done.

Recently, I decided to do a survey of U.S. federal agency websites and the data they provide.  As an ordinary, interested citizen with reasonable research skills, this is what I found:

  • Often presented in a disorganized manner, so that it’s difficult to determine what’s available and where.
  • Largely available only as aggregates and statistics, which may or may not answer the questions we have.
  • When microdata/underlying data is available, only made available for researchers whose applications are approved, after registration, and/or after signing confidentiality agreements.
  • No easy query interface for non-researchers.

So let’s take a look at some specific sites.

1. Well-intentioned but incomplete. is supposed to be centralized place for “raw,” downloadable federal government data.  But there are an uneven number of datasets, as well as uneven participation among agencies.  Over 50% of the 809 data sets are from the Environmental Protection Agency (EPA).  This may be because there is someone super-enthusiastic about this project at the EPA, or because EPA data on issues like air quality is less personal and arguably less sensitive, but for whatever reason, those looking for EPA data are likely to be much happier than those looking for something else. does include some human-subject data, such as the American Time Use Survey (Labor), HHA Medicare Cost Report Data (Health and Human Services), Residential Energy Consumption (Energy), and Individuals Granted Asylum by Region and Country of Nationality (Homeland Security).  But it does not include such major microdata sets as Nat’l Health & Nutrition Survey (NHANES), U.S. Census PUMs, & Medical Expenditure Survey (MEPS)., an older site, is more comprehensive, but it isn’t focused on microdata and raw data sets.

Most of all, there is no easy way to query these datasets.  They’re intended to be available for developers and those who know how to write programs that can query XML, CSV, Shapefile databases, which is all well and good, but they’re not actually providing information to less skilled but interested citizens like myself.

2.  U.S. Census: A LOT of Data, but Completely Disorganized

Let’s start with the home page, which looks like this.  A lot of words, and not much guidance to what means what.

Now, let’s say I’m curious about the demographics of my Brooklyn neighborhood.  I might decide to go to “People & Households,” which takes me here:

I’ll try “Data by Subject,” which takes me here:

It’s hard to know precisely which of these categories will take me to what I want, some basic demographic information on my neighborhood.  I tried clicking on Population Profile and Small Area Income and Poverty Estimates, which didn’t pan out.  “Community” sounds right, so I’ll click on “American Community Survey,” which takes me here.

If I click on Access Data, it gives me these choices:

And if I click on American FactFinder, I end up here:

Okay, I don’t really know what any of this means.  Thematic maps, reference maps, custom table???

But let’s say I’d started with “American FactFinder” on the home page, which is linked in the far left-hand column.  If I’d started there, I would have found this:

I can see there’s a little window at the top where I can get a Fact Sheet for my community.  Hmm, that seems easy!  Why didn’t I get here earlier? But let’s just click on “American Community Survey–Learn More” and see if that takes me back where I was before:

Ack, where am I?  Why is this different from the other ACS page?
If I go back and click on “Get Data” under American Community Survey, I would go back to the ACS page I first saw

The organizing principle is not completely devoid of logic, but there are endless loops within loops of links on the Census site.  You can lose your way really quickly and find yourself unable to even retrace your steps.  The home page does have boxes on the right where you can enter a city/town, county or zip for “Population” and you can select a state for “QuickFacts,” but the box where you can enter a city/town, county or zip for community “Fact Sheets” is only found if you click on American FactFinder.  Why?

There was a part of me that hoped I was just stupid because I was inexperienced.  But my friends who use Census data regularly for work tell me they also have trouble finding what they need.  I’m sure there are reasons why you can’t just query all the data, but what are they?  And how should we deal with them?  Should we just put up with them or try to find a solution to make data more available?

In Part II of this post, I’ll analyze the data available from the IRS, the Agency for Healthcare Research & Quality, and the EPA.

Ack! Congress writing privacy policies?

Thursday, May 7th, 2009

It remains to be seen what actually get’s proposed. But, on first blush, it doesn’t feel right for Congress to be writing privacy policies for all the interwebs. But that appears to be what the Democratic Congressman from Virginia (Rick Boucher) is trying to do:Rick Boucher
‘If the site used its customer data for first-party purposes (i.e., the site itself advertising to its own customers), it would have to offer consumers an opt-out option. “The default position would be that the first-party marketing transaction could occur,” Boucher elaborated. “It would only be prevented if the affirmative step was taken to say, ‘no, you can’t do that.”

‘But if the customer information is going to be sent to “some completely unrelated party,” Boucher added, “not associated with the first-party transaction, that would fall under opt-in, and that information could then be shared with the other party only if the customer affirmatively took the step of saying ‘yes you can share it.'”

What would be the fallout of such legislation for you and me?

Every time I use Google without logging in (which is almost always), do I need to give permission for Google to collect data from me so they now what ads to serve up? What if I use the Google search bar in my browser? How would that work?

Since advertising is “core” to Google’s business, maybe collecting search query data would fall under “first-party purposes”, even though that data is shared with “third-party advertisers”.

It’s a sign of the times that even Congress is starting to worry about the fine print in privacy policies and we certainly laud attempts to cut through the obfuscation of privacy legalese.

Still, this binary opt-in/opt-out approach feels like a hatchet job where a scalpel is needed.

Or better yet, Congress should first focus on legislation that will create standards around currently wishy-washy concepts of “anonymization” and “personal information” that allow companies to violate the spirit of their own policies, if not the letter.

Needles in a verbal haystack

Monday, February 16th, 2009

Sometime during the Bush administration, a 1970s-ish malaise settled on the country, and it hasn’t lifted yet. There’s the volatile price of oil…that cop drama about an officer who goes back in time to 1973…everyone in a bad mood about something (I was teething)…and oh yeah, the government is apparently spying on journalists!

This story has been reported widely, but not gotten much traction. In January, former National Security Agency analyst Russell Tice told MSNBC’s Keith Olbermann that the NSA did inded listen in on the communications of all kinds of ordinary Americans, with special attention paid to journalists (hey, who says 24-hour news networks never actually make news?)

TICE: Well, I don’t know what our former president knew or didn’t know. I’m sort of down in the weeds. But the National Security Agency had access to all Americans’ communications, faxes, phone calls, and their computer communications. And that doesn’t — it didn’t matter whether you were in Kansas, you know, in the middle of the country, and you never made a communication — foreign communications at all. They monitored all communications.

One American who is sure he was spied on is the New Yorker’s Lawrence Wright, who writes on terrorism. He told NPR’s On The Media (full disclosure: they are work colleagues) that two federal agents actually showed up at his front door to ask him directly about the contents of his conversations:

WRIGHT: And then they began asking if the person on our end of the call, my end, was named Caroline. And that’s my daughter’s name. And they asked, you know, is her name Caroline Brown? And I said, no, she’s, you know, a student at Brown. But I said, her name’s not on any of our phones. How do you know this information? Are you listening to my calls? And they just shut their briefcases and left.

There is so much here to unpack: the question of whether monitoring Wright’s communications was legal then, whether it would be legal today, the fact that federal agents still apparently make housecalls, and the specific way in which communications are unpacked. Tice told Olbermann the NSA’s monitoring a bit like googling for keywords:

TICE: what was done was a sort of an ability to look at the meta data, the signaling data for communications, and ferret that information to determine what communications would ultimately be collected. Basically, filtering out sort of like sweeping everything with that meta data, and then cutting down ultimately what you are going to look at and what is going to be collected, and in the long run have an analyst look at, you know, needles in a haystack for what might be of interest.

It would be interesting to know whether this sort of data sweep gets better results than labor-intensive 24-hour East German surveillance operation depicted in The Lives of Others. Judging from “Caroline Brown”, I’d say maybe not – yet.

(Click here if you need primer on presidential spying on journalists from Kennedy onwards.)

Let’s ask the government to give us information!

Monday, July 7th, 2008

My contracts professor from law school, Ian Ayres, suggests in his book Super Crunchers that the IRS become a source for useful information for ordinary people. The agency could tell taxpayers how much others in their income bracket, on average, are donating to charity or contributing to their IRAs, or tell small businesses whether they might be spending too much money on advertising.

The idea isn’t so far-fetched. About two months ago, the Italian government caused an uproar when it published online the tax details of every single Italian taxpayer. Allegedly meant to fight tax evasion, the move by the outgoing government sounded more like it was motivated by political spite. The most fascinating thing for me, though, was reading various comments in the blogosphere and finding out Norway, Sweden, and Finland do this every year! Apparently, the tax documents are considered official and therefore public records. According to the Swedish government, it’s in keeping with a general principle of government transparency: “To encourage the free exchange of opinion and availability of comprehensive information, every Swedish citizen shall be entitled to have free access to official documents.” And no one really minds.

Of course, this would be inconceivable in the U.S.—there’s a law against it. But as Ian Ayres suggests, the idea that the government should be giving information back to us, instead of just collecting it from us, isn’t totally crazy and Scandinavian. It could be released in anonymized aggregates or in others ways that wouldn’t reveal how much our neighbor makes. The information could be genuinely useful, not just titillating.

There could even be implications for public policy. So much of government policy is expressed in the Internal Revenue Code (such as favoring homeownership over renting), but our debates about tax cuts, mortgage deductions, and credits are based on fairly imprecise numbers. Even as we argue about what a tax cut will do to the “middle class,” we don’t even know what the “middle class” is. Where should government transparency start, if not at the point of revenue collection?

Scary pizza

Tuesday, June 17th, 2008

My friend sent this to me recently. Created by the ACLU for its campaign against the National ID program, it’s a mash-up of all our worst surveillance fears. It starts with a guy calling his local pizzeria for a couple of double meat pizzas, while you see the computer screen the girl at the pizza place is looking at as she rings up his order. She surprises him first by knowing his name, his home address, and his place of work from the moment his call comes in, but it gets rapidly worse, from a $20 health surcharge for meat pizza because of his high cholesterol and blood pressure to her snide comments about his waist size and his ability to pay for the pizzas, based on what she knows of his purchase history, including airplane tickets to Hawaii.

It’s entertaining, but also frustrating for a couple of reasons. First, there are very good reasons for me to be concerned about private companies’ data collection and their potential for collusion in U.S. government surveillance, but this video doesn’t explain how the National ID program would lead to the pizzeria having my health records. By focusing only on the sensational horror of the pizza girl knowing the customer bought a bunch of condoms, it forgets to tell us the pizzeria might literally be giving their customers’ names, phone numbers, and addresses to government officials. (The ACLU does have this report providing a more detailed argument about the dangers of private-public surveillance, but there was no direct link to it from the pizza video.)

Second, in terms of data collection and its dangers in general, the video ends up feeling sort of hysterical. It obscures, rather than clarifies, what’s really at stake.

We do live in a world where data collection is happening on an unprecedented level. But for me, what’s scary is not the mere possibility that all this data could get linked together. It’s about control. Do I get to decide who has my information? Do I get to control how it’s disseminated and analyzed?

Right now, we definitely don’t and that’s a problem. But the solution may not be to stop data collection altogether and segregate all the information out there so no linkage can happen ever.

I might not want the pizza girl at my local pizzeria to know about my health problems, but I might not mind if, as I ordered food online, the program allowed me to review my choices and build a more a nutritious meal specific to my needs, without disclosing my specific preferences to each restaurant. I might not want the government to be able to access my purchase history, but I might want to be able to securely track and access my purchases and my financial accounts at the same time so I can better determine how well I’m meeting my budget. I might even want to share certain information, securely and anonymously, if I thought it would lead to beneficial research by scientists, economists, and policymakers.

Of course, I wouldn’t sign up for anything if I thought my personal information could get leaked to the government or anyone else without my consent. It would make for a somewhat less dramatic video, but this is what the Common Datatrust Foundation is interested in addressing—how can we turn our capacity for data collection and sharing into something that is a public good, rather than a scary fear?

