Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

Smart Grid Data: Unexpected and Amazing Reuses?

Tuesday, March 16th, 2010

As noted in “In the Mix,” the Center for Democracy and Technology and the Electronic Freedom Foundation recently issued joint comments to the California Public Utilities Commission regarding proposed policies around the use of smart grids and smart meters.

(via Flowing Data.)

And then a few days later, I saw this: EPCOR, a Canadian water utility company, issued a graph plotting water usage during the Olympic men’s hockey final.  Notice the spikes in water consumption (and toilet flushing) immediately after the first period, second period, third period, and finally when Canada wins the gold medal.

Is this our worst nightmare?  That someone will find out when we’re peeing?

That’s a bad joke. Plotting a large area’s water consumption in aggregate is not the same as what some of these smart meters are able to measure in terms of energy consumption.

But I do have a more serious point to make.  One of the points CDT and EFF make repeatedly in their comments is that we should avoid “unnecessary” data collection and destroy any “unnecessary” data.

What exactly does “unnecessary” mean?

Does it mean any purpose that is not related to the work of a utility company?  Who decides what’s unnecessary and should they decide what’s unnecessary and necessary now?

The beauty of data is that its potential value is unknown.  A single dataset, collected for one purpose, can be used for other purposes that are socially beneficial but rather unexpected.  For example, Google Trends was created for advertisers so that they can track what search terms are popular.  The CDC, however, has been using Google Trends to track flu outbreaks, by watching where people are Googling flu symptoms, data which is more quickly collected than reports from doctors.  The reason governments all over the world are pushing for open data is because we don’t know yet all that can be done.  By giving access to everyone, we expect interesting, useful, imaginative things to come out of the data we never might have imagined.

Data from the smart grids, in particular, will also require smart visualizations that are easy for individual consumers to understand and access.  Data alone isn’t going to change behavior.  You can imagine open data inviting developers to create easy to use apps that allow consumers to identify easily and painlessly ways to reduce energy consumption.  Some may even choose to share that information and compete with others, the way several universities have set up competitions between dorms.  As much as Al Gore was embarrassed by news revealing how much energy his mansion used, others may be eager to brag about how little energy they use.

Can we protect privacy while also creating room for imaginative and innovative reuse of data?

There are definitely privacy issues we have to consider.  I agree with a lot of the points made in CDT and EFF’s comments.  That “customer information” shouldn’t be limited to “personally identifying information.”  The misuse and misapplication of phrases like “personal information” is something we’ve been harping on for a while.  That customers should have access to the data collected from them and the power to correct mistakes.  That law enforcement shouldn’t be allowed to troll this information without a warrant, that civil litigants shouldn’t be allowed to access this information without a court order based on a showing of compelling interest and after notifying the customer to provide her with a chance to object.

But rather than talking about barring “unnecessary” data collection and data use, we should be thinking of ways to make the data safely available, regardless of whether someone has decided it’s necessary or not.  The data from smart grids is going to be both dangerous and valuable because it is so fine-grained; we clearly can’t just plop it online.  Anonymizing data is really hard.  So at CDP, we’re working hard at thinking about ways to come up with measurable privacy guarantees and testing technologies like PINQ that promise to provide access to raw data without indicating the existence of any particular individual in a dataset.  Other organizations may have different ideas.  I’m grateful for the existence of organizations that imagine the worst-case scenarios around data collection to protect our civil rights.  I also hope to see the growth of more organizations that try to imagine the best-case scenarios.

In the mix

Friday, March 12th, 2010

1) The CDC recently used shopper-card data to track a salmonella outbreak that sickened 245 in 44 states.  It turned out the pepper in salami made in Rhode Island was the culprit.  Although the CDC began to suspect through interviews and questionnaires that some sort of Italian meat product was the problem, the people they talked to couldn’t remember precisely what they had bought and the shopper-card records helped them identify the actual product.

Great story, right?  Unless you’re the director of Consumers Against Supermarket Privacy Invasion and Numbering, in which case, the story smacks of privacy invasion by the government.  The CDC got the records with the permission of the account holders, but to Katherine Albrecht and several of the commenters to the Yahoo News Story, that didn’t assuage their fears.

Here’s a choice quote: “I’d rather have a few die from poisoning and then they fix the problem then have the entire country enslaved, thank you very much.”

There was at least one person who pointed out commenting on a Yahoo news story wasn’t going to do much to preserve their privacy either.

2) MySpace is selling bulk user data! I’m with ReadWriteWeb:

I think the world is an awfully unfair mess and I’m hoping that data analysis will help illuminate some of the hows and the whys. Like the way that real-estate redlining was exposed back in the day by cross referencing census data around racial demographics and housing loan data. That illuminated systematic discrimination against black families in applying for home loans in certain parts of town. So too I think we’ll find a lot of undeniable proof of injustices and clues for how we might deal with them in big data today.

We don’t want another AOL debacle on our hands, but we also don’t want to give up on the possibilities of “big data” because we prematurely assume better privacy-creating techniques and standards aren’t available.

3) My, it’s a privacy-obsessed week!  Here’s one person’s argument “why no one cares about privacy.” It’s a good round-up of pithy quotes from people like Judge Posner, new “talk about me” sites like Blippy.com, and surveys demonstrating the change in the public’s attitude over time.  Wow, in 1998, 80% of people in a Harris poll said they were hesitant to shop online because of privacy worries.

Still, articles like this and the comments to the Yahoo CDC-shopper data article show how much our discussion of privacy involves people yelling at each other across a very big divide.  Is the choice really a binary one?  Privacy + a few deaths versus Big Brother + public health data?  I don’t care if the CDC has access to my grocery records; at the same time, I don’t plan to sign up for Blippy.com and broadcast my purchase of kale and four kinds of cheese this morning.  (Oops, I just did.)  Maybe we should stop talking about “privacy” and start talking about specific situations.

Prostate Cancer and the Inexorable Pull To Act On Unlikely Events

Wednesday, March 10th, 2010

Here’s another example of how we seize on numbers we can see, no matter how uncertain and meaningless they might be, because there’s not yet a viable alternative source of information.

As a society, we will probably opt for prostate testing no matter how flawed it is until there’s a better, more accurate alternative. In other words, bad, misleading information is better than no information, especially in a culture that prizes initiative and can-do-ness over a more fatalistic view of life: Yes We Can!

This is a design challenge for anybody trying to help people make sense of data. It is also especially important for us right now as we try to figure out a meaningful privacy guarantee for the datatrust. It’s easy for us to guarantee that you’ll never know with 100% certainty the answer to any question. But in many situations, people won’t need anything close to 100% certainty to feel compelled to act.

Certainly in the case of screening for diseases, it’s incredibly hard to do nothing if there is even a hint of a chance that we might be fatally ill.

What are other examples of numbers we make too much of and can’t get enough of?

  • Poll numbers
  • Housing data
  • Almost any study that comes about health and nutrition

In the mix

Wednesday, March 10th, 2010

1) We’ve wondered in the past, why don’t targeted advertising companies just ask you to opt-in to be tracked?  When I first heard about it, I thought this newish website, Blippy.com, described on NPR, was doing something like that.  You actively register a credit card with the site and it shares ALL your transactions with your friends.  Except NPR reports the company was rather vague about how the information gets to marketing companies.  And what exactly are they offering anyway, other than the opportunity to broadcast, “I am what I buy”?  The only news being broadcast seem to be about people’s Netflix and iTunes buying tendencies.  Services like Mint.com and and Patients Like Me are also using customers’ data to make money, but they’re offering a real, identifiable service in return.

2) Google explains why it needs your data to provide a better service.

Search data is mined to “learn from the good guys,” in Google’s parlance, by watching how users correct their own spelling mistakes, how they write in their native language, and what sites they visit after searches. That information has been crucial to Google’s famously algorithm-driven approach to problems like spell check, machine language translation, and improving its main search engine. Without the algorithms, Google Translate wouldn’t be able to support less-used languages like Catalan and Welsh.

Data is also mined to watch how the “bad guys” run link farms and other Web irritants so that Google can takecountermeasures.

This is an argument I’m really glad to hear.  It doesn’t make the issue of privacy go away, but I’d love to see privacy advocates and Google talk honestly and thoughtfully about what Google does with the data, how important that is to making Google’s services useful, and what trade-offs people are willing to make when they ask Google to destroy the data.

3) Nat Torkington describes how open source principles could be applied for open data. We heartily agree that these principles could be useful for making data public and useful, though Mimi, who’s worked on open source projects, points out that open source production, with its standard processes, is something  that’s been worked out over decades.  Data management is still relatively in its infancy, so open-sourcing data management will definitely take some work.  Onward ho!

4) The Center for Democracy and Technology and EFF are thinking about privacy and Smart Grids, which monitor energy consumption so that consumers can better control their energy use.  I’m more enthusiastic than EFF about the “potentially beneficial” aspects of smart meters, but in any case, it’s interesting to see these two blog posts within two days of each other.  Energy consumption data, as well as health data, are going to be two huge areas of debate, because the benefits of large-scale data collection and analysis are obvious, even though detailed personal information is involved.

5) The Onion reports Google is apologizing for its privacy problems, directed to very specific people. Ha ha.

“Americans have every right to be angry at us,” Google spokesperson Janet Kemper told reporters. “Though perhaps Dale Gilbert should just take a few deep breaths and go sit in his car and relax, like they tell him to do at the anger management classes he attends over at St. Francis Church every Tuesday night.”

In the mix

Tuesday, March 2nd, 2010

1) I’m looking forward to reading this series of blog posts from the Freedom to Tinker blog at Princeton’s Center for Information Technology Policy on what government datasets should look like to facilitate innovation, as the first one is incredibly clear and smart.

2) The NYTimes Bits blog recently interviewed Esther Dyson, “Health Tech Investor and Space Tourist” as the Times calls her, where she shares her thoughts on why ordinary people might want to track their own data and why we shouldn’t worry so much about privacy.

3) A commenter on the Bits interview with Esther Dyson referenced this new 501(c)(6) nonprofit, CLOUD: Consortium for Local Ownership and Use of Data.  Their site says, “CLOUD has been formed to create standards to give people property rights in their personal information on the Web and in the cloud, including the right to decide how and when others might use personal information and whether others might be allowed to connect personal information with identifying information.”

We’ve been thinking about whether personal information could or should be viewed as personal property, as understood by the American legal system, for awhile now.  I’m not quite sure it’s the best or most practical solution, but I’m curious to see where CLOUD goes.

4) The German Federal Constitutional Court has ruled that the law requiring data retention for 6 months is unconstitutional.  Previously, all phone and email records had to be kept for 6 months for law enforcement purposes.  The court criticized the lack of data security and insufficient restrictions to access to the data.

Although Europe has more comprehensive and arguably “stricter” privacy laws, many countries also require data retention for law enforcement purposes.  We in the U.S. might think the Fourth Amendment is going to protect our phone and email records from being poked into unnecessarily by law enforcement, but existing law is even less clear than in Europe.  So much privacy law around telephone and email records is built around antiquated ideas of our “expectations,” with analogies to what’s “inside the envelope” and what’s “outside the envelope,” as if all our communications can be easily analogized to snail mail.  All these issues are clearly simmering to a boil.

5) Google’s introduced a new version of Chrome with more privacy controls that allow you to determine how browser cookies, plug-ins, pop-ups and more are handled on a site-by-site basis.  Of course, those controls won’t necessarily stop a publisher from selling your IP address to a third-party behavioral targeting company!

IP addresses + zip codes = ?

Monday, March 1st, 2010

ClearSight Interactive, a new behavioral targeting company, has spent the past 18 months collecting more than 100 million IP addresses.  CEO Tom Alison says, in a comment to the article, “Our goal is to become the bridge between online and offline data.”

Whoa, baby.

Alison claims in his comment that Wendy Davis, the writer of the article, didn’t accurately describe what ClearSight Interactive is doing.  So let’s look at the claims he puts out in his comment.

We have a file of IP addresses with 9-digit zip code appended. Our data providers supply the zip code linked to IP without any personally identifiable information. We are able to predict a more likely neighborhood or work location than the zip code or longitude and latitude of the ISPs server readily available from many software or online providers…

In other words, they know where you live. Their press release says more: “ClearSight Interactive bridges IP addresses to verified postal addresses and email addresses.”

Alison claims they do not collect data on online behavior:

We offer geo-demographic marketplace data, not behavioral data. We collect no online behavior. Unlike those companies and websites that utilize individual household data and set cookies, we append census and de-identified marketing data at the neighborhood level.  We all know that people in the same household or neighborhood are not the same. But for many useful marketing attributes, bird of a feather do flock or even live together.

I guess that’s supposed to make me feel better, that the company knows where I live but it only guesses what I might be looking for in a car.  Actually, the company isn’t guessing.  It promises in its press release, “After a consumer views or clicks an ad, the company can then monitor the users future behavior using contact information databases to determine if they later made a purchase – e.g. did someone who viewed a car ad actually visit the dealership and purchase a vehicle?”

Almost more shocking is Alison’s attitude about the privacy implications.  He repeats over and over that they do not have “PII” or “personally identifying information.”  If nothing else, we’ve learned from the AOL debacle and numerous other supposedly anonymized databases, that PII like name and address are not necessary to successfully reidentify large numbers of people in a dataset.

So how did ClearSight Interactive even get this information?  It bought it from publishers, who normally ask their customers if they are okay with their information being shared with third-party marketers.  As the article points out, most people who click “yes” assume that means they’ll get emails from third-party marketers.  They don’t assume that the publishers will sell IP logs to a third-party targeting company.  ClearSight Interactive promises that if you choose to opt-out later, the company will update its records and remove you from its databases.  To which, all I can say is, if you’re so sure that people have actively chosen to allow you to have this information, why not build your business around asking them to opt-in?

On some level, Alison is clearly aware privacy could impact his company.  He writes, “At ClearSight we take privacy matters very seriously,” and the article quotes him as saying they are waiting to see if Congress passes privacy legislation.  But if it’s true that “[a]ll our IP and zip data fall within the appropriate privacy provisions of our partners” and everything they’ve done is legal, well, that’s some of the strongest evidence I’ve heard in support of better privacy legislation.

In the mix

Wednesday, February 17th, 2010

1) A major study of children is having trouble finding volunteers.  A good exposition of how hard it is to set up a longitudinal study, which is why so many of our ideas about health are based on a very small number of studies.

2) The Sunlight Foundation has launched The Data Mine with the Center for Public Integrity, “to highlight inaccessible or poorly presented information from the federal government.”  On a related note, the Sunlight Foundation analyzed why the numbers of jobs reported by stimulus fund recipients differed from the number cited by President Obama in his State of the Union Speech.  A great reminder that the promise of data is not the same thing as access to good data.

3) Another person presenting his self-collected personal dataSome people love collecting and sharing information about themselves; others are terrified of anything leaking out about themselves.  How do we make personal data useful and relevant to the people in between?

Would PINQ solve the problems with the Census data?

Friday, February 5th, 2010

Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.

Would PINQ solve the problems with the Census data?

No.  But it might help in the future.

The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population.  To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data.  Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.

PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data.  PINQ only allows researchers to query for aggregates.

However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that.  Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added.  Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.

The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today.  Now, with raw data, researchers like to “look at the data” and “fit a line to the data.”  A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data.  There are almost certainly research objectives that cannot be met with PINQ alone.  But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.

With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.

Can we trust Census data?

Wednesday, February 3rd, 2010

Yesterday, the Freakanomics blog at the New York Times reported that a group of researchers had discovered serious errors in PUMS (public-use microdata samples) files released by the U.S. Census Bureau.  When compared to aggregate data released by the Census, the PUMS files revealed up to 15% discrepancies for the 65-and-older population.  As Justin Wolfers explains, PUMS files are small samples of the much larger, confidential data used by the Census for the general statistics it releases. These samples are crucial to researchers and policymakers looking to measure trends that the Census itself has not calculated.

When I read this, the first thought I had was, “Hallelujah!”  Not because I felt gleeful about the Census Bureau’s mistakes, but because this little post in the New York Times articulated something we’ve been trying to communicate for awhile: current methods of data collection (and especially data release) are not perfect.

People love throwing around statistics, and increasingly people love debunking statistics, but that kind of scrutiny is normally directed at surveys conducted by people who are not statisticians.  Most people generally hear words like “statistical sampling” and “disclosure avoidance procedure” and assume that those people surely know what they’re doing.

But you don’t have to have training in statistics to read this paper and understand what happened. The Census Bureau, unlike many organizations and businesses that claim to “anonymize” datasets, knows that individual identities cannot be kept confidential simply by removing “identifiers” like name and address, which is why they use techniques like “data swapping” and “synthetic data.” It doesn’t take a mathematician to understand that when you’re making up data, you might have trouble maintaining the accuracy of the overall microdata sample.

To the Bureau’s credit, it does acknowledge where inaccuracies exist.  But as the researchers found, the Bureau is unwilling to correct its mistakes because doing so could reveal how they altered the data in the first place and thus compromise someone’s identity.  Which gets to the heart of the problem:

Newer techniques, such as swapping or blanking, retain detail and provide better protection of respondents’ confidentiality. However, the effects of the new techniques are less transparent to data users and mistakes can easily be overlooked.

The problems with current methods of data collection aren’t limited to the Census PUMS files either.  The weaknesses outlined by this former employee could apply to so many organizations.

This is why we have to work on new ways to collect, analyze, and release sensitive data.

In the mix: Your unique(ish) browser fingerprint…and…No $$ for privacy.

Friday, January 29th, 2010

1) EFF’s Panopticlick project lets you see how much your browser reveals and whether that might potentially “identify” you, based on their calculation of how identifiable a set of bits might be.

Can someone with a better grasp of math than I have explain to me how their information theory works? Right now, they have let’s say 10,000 people who’ve contributed their browser info. Bruce Schneier found out he was unique in 120,000. But if millions of people tested their browsers, would his configuration really be that unique? (Lots of skepticism in the comments to Schneier’s post, too.)

2) New initiative by advertising groups to reveal that they are tracking information — a small “i” icon:

What a quote: “‘This is not the full solution, but this moves the ball forward,’ he said.”

Well, that’s the understatement of the century. Full solution to what? The advertising industry keeping regulators off their backs? Helping users understanding how targeted advertising finds them? Really, neither are the real problem. Regulators should be focusing on establishing industry guidelines for how service providers and 3rd party advertising partners store and share data.

3) Should government data be in more user-friendly formats than XML?

Or should we leave usability to disinterested 3rd parites? If the government starts releasing user-friendly data, will that simply open the door for agencies to “spin” their data to make themselves look good? Actually, right now, how do we really know the data that’s being released hasn’t been “edited” in some way? Who’s vetting these releases and what’s the process?

4) Ten years and no one is really making any money off of “privacy”?

Perhaps no one has successfully “sold” privacy (as it’s own thing) because we haven’t yet agreed on what that a “privacy product” would look like. As Mimi says, “If someone was selling something that would guarantee that I would never get any SPAM (mail or email) for the rest of my life, I would totally sign up for that.” But that might not equal “privacy” for someone else.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes