Don’t take it personally: how “personal” information is defined in privacy policies

April 28th, 2009 by Grace Meng

Most privacy certification programs, like Truste, require that the privacy policy identify what kinds of personally identifiable information (PII) are being collected.  It’s a requirement that’s meant to promote transparency—the user must be informed!

As a result, nearly every privacy policy we looked at included a long list of the types of information being collected.  But who can process a long catalog of items?  What popped out at me, after reading policy after policy, was the way so many of the companies we surveyed categorize the information they collect into 1) “personal information” that you provide, such as name and email address, often when you sign up for an account; and 2) cookie and log data, including IP address, browser type, browser language, web request, and page views.

When the first category is called “personal” information, the second category implicitly becomes “not-personal” information.  But the queries we put into search engines—what could be more personal?  How much could you learn about me, just looking at the history of things I’ve bought on Amazon, let alone the things I’ve Googled?  What is an IP address if not a marker linking my computer to the actions I (and others) take on that computer?

Yahoo and Amazon go the extra step of labeling cookie and log data, “automatic information,” giving it a ring of inevitability.  Ask Network calls this information “limited information that your browser makes available whenever you visit any website.”  Wikipedia similarly states, “When a visitor requests or reads a page, or sends email to a Wikimedia server, no more information is collected than is typically collected by web sites.”

There are companies that do define “personal information” much more broadly.  EBay’s definition includes “computer and connection information, statistics on page views, traffic to and from the sites, ad data, IP address and standard web log information” and “information from other companies, such as demographic and navigation information.”  AOL states that its AOL Network Information may include “personally identifiable information” that includes “information about your visits to AOL Network Web sites and pages, and your responses to the offerings and advertisements presented on these Web sites and pages” and “information about the searches you perform through the AOL Network and how you use the results of those searches.”

And there are websites that don’t collect information at all: Ixquick and Cuil, the search engines that have been trying to build a brand around privacy.  These companies decided that privacy required that they not record IP addresses, and Ixquick deletes log data after 48 hours.

Personally, I don’t think the solution is in deleting IP addresses and log data willy-nilly.  But we as a society can’t have a thoughtful discussion on what it takes to balance privacy rights against the value of data if companies aren’t honest about how “personal” cookie and log data can be.

Some companies do acknowledge that information that they don’t consider “personal” could become personally identifying if it were combined with other data.  Microsoft therefore promises to “store page views, clicks and search terms…separately from your contact information or other data that directly identifies you (such as your name, email address, etc.).  Further we have built in technological and process safeguards to prevent the unauthorized correlation of this data.”  Similarly, WebMD makes this promise: “we do not link non-personal information from Cookies to personally identifiable information without your permission and do not use Cookies to collect or store Personal Health Information about you.”  WebMD further states that data warehouses it contracts with are required to agree that they “not attempt to make this information personally identifiable, such as by combining it with other databases.”

Otherwise, there’s very little discussion of what combination of data means.  When data is combined, many data sets that initially appear to be anonymous or “non-personally identifiable” can become de-anonymized.  Researchers at the University of Texas in recent years have demonstrated that it is possible to de-anonymize through combination, as when Netflix data is combined with IMDB ratings,  or when Twitter is combined with Flickr.   So when companies offhandedly note that they are combining information they collect from different sources, they are learning a great deal more about individual people than the average user would imagine.  And as you might imagine, large companies like Microsoft, Google, and Yahoo have a wealth of databases at their disposal.

So that’s “transparency,” a long list of types of information collected and artful categorization.  It’s amazing that some privacy policies can use so many words and yet say so little.

Tags: , ,

3 Responses to “Don’t take it personally: how “personal” information is defined in privacy policies”

  1. […] better yet, Congress should first focus on legislation that will create standards around currently wishy-washy concepts of “anonymization” and “personal information” that allow companies to […]

  2. […] Place in the Crowd The Common Data Project Blog « Peanuts and Cracker Jack Don’t take it personally: how “personal” information is defined in privacy policie… […]

  3. […] they’ve collected from you, like log data, search history, and what you’ve clicked on, might affect your sense of privacy as well. Since they conveniently choose not to call this kind of information “personal,” they have no […]

Get Adobe Flash player