Posts Tagged ‘Data Collection’

Where that “study” you quoted came from: Remember that call you got during dinner?

Tuesday, May 29th, 2007

Over the last few months I’ve been to a number of interesting talks at the Stanford Methods of Analysis Program in the Social Sciences (MAPSS) colloquium. Two types of speakers have caught my attention: those who work closely with the logistics and mechanics of data collection, and those who try to use survey data to test their hypotheses.

Most recently I got to hear Linda Piekarski of Survey Sampling International on SSI’s efforts to address changes in the telephone system, as well as their recent forays into internet surveys. (I didn’t realize how perfect the original design for the U.S. phone system was for tele-survey companies.)

Also memorable was Yale Professor Don Green‘s talk about measuring the effectiveness of political campaign advertising. One of my favorite lines (though I’m paraphrasing) was that “Any time you see a clean, clear graph of data, there’s something wrong. Data “noise” is what reality looks like.”

What follows is a summary of the challenges facing the collection of data about individuals derived in part from these talks.
Today, there are three main ways of collecting data from individuals, each of which contain flaws that seriously undermine the quality of the data collected.

  1. Pay them a tiny reward, lure them with a sweepstakes or nag them at dinner with a phone call from a stranger. For example, online stores may offer a coupon or rebate for your feedback on your buying experience.
  2. Make it easy for individuals to inadvertently or unthinkingly consent to data being collected about them, and/or subsequently changing the substances of what is collected, or the uses for that data. One prominent example is Amazon.com’s site registration process, which makes no attempt to highlight their third-party data-sharing practices.
  3. Leverage data collected for some other purpose – so-called “Secondary Use”. For example addresses collected for fulfillment (shipping) being used for geographically targeted marketing messages.

These mechanisms have a set of critical flaws:

  1. Tiny rewards and nagging phone calls are an insufficient value proposition for many individuals, thus the pool of participants is unlikely to be well distributed across the target distribution. Instead it will favor those individuals for whom the reward remains attractive, however small; or those individuals for whom the cost of participation (time) is small enough to make the reward adequate. (Mechanism 1)
  2. Rewards or compensation that are distributed without regard to accuracy provide no incentive for either careful or genuine accurate self-reporting. (Mechanism 1)
  3. These practices cultivate a public perception of a mesh of “big brother” networks collecting an ever-expanding set of data, beyond the control of any one individual. Privacy outrage still surfaces in mainstream media occasionally, but the general public is increasingly numb to incremental discoveries of the erosion of personal privacy. While anesthesia may appear temporarily attractive to data collectors, it also disengages individuals from the data collection goals, which decreases participation and discourages accurate self-reporting. For example, when you are pressured to answer a survey at a department store or after check-out at a web retailer, do you react with an earnest attempt to supply them with the information they need? (All mechanisms)
  4. In an effort to fight back the ever-increasing invasive data collection going on, privacy legislation and legal liability has forced data to be “silo-ed” and “anonymized” as much as possible. That means that unless you are a part of a larger survey panel, each subsequent survey you complete or data you consent to have collected will be stored separately from your other data. This eliminates the possibility of data-accuracy maintenance by individuals, and makes longitudinal analysis increasingly difficult. (All mechanisms)

Who cares about Privacy: Why search queries in America trump sexual history in Africa.

Wednesday, December 6th, 2006

A couple of weeks ago I reported on Sam Clark’s presentation about interesting social science data collection efforts, in particular, research being done in the area of AIDS/HIV in Sub-Saharan Africa…a data collection project that was startling both for how intimate the survey questions were and for the cursory attention paid to privacy matters.

A week later, I attempted to give various rationales for why privacy did not figure prominently in Clark’s presentation. The sheer urgency of the AIDS epidemic, the relative powerlessness of the survey subjects and the relative irrelevance of databases, the internet and modern digital life as we know it to much of Africa seemed to me to be the three most powerful reasons.

So now I’d like to contrast that with the media frenzy of late in the First World over AOL’s unfortunate bungling that led to an unqualified release of user query data to the public.

So in the context of this DSS work in Africa, do the AOL users have a right to be outraged? If there was a leak of INDEPTH user data, would the U.S. media be condemning INDEPTH? or would they not care because the general African’s privacy is too far removed from our reality? Or maybe INDEPTH survey respondents are disenfranchised at this point?

What if there was an unfortunate bungling of personal information at INDEPTH? Who cares if we know AOL user 34653 is looking for a good cross-dressing cruise for couples and idolizes Cher, Castro and Trent Lott in a single breath. It’s trivial in comparison to Name, HIV status, # and type of sex partners in the last 6 months, # of times you’ve had unprotected sex in the last 6 months and with who.

What if an embattled, desperate government with a touch of psychosis decided that this data was handy for carrying out a genocidal “solution” to the AIDS epidemic?

I can’t help feeling that the fact that “the [AOL] data was leaked” is besides the point.

Yes it was careless, wrong, and inconsiderate. But in the end, is that really why people are so unhappy?

I think people are unhappy about the AOL data release because it was a surprise. People simply didn’t realize how much of their life was being captured, recorded and analyzed by search engines. Even with our modern-day sophistication, we are just as naive about the digital fingerprints we leave everywhere as the respondents are about the surveys they answer. In some ways you could say the respondents in INDEPTH’s DSS were more aware. They were painfully aware that their lives were being examined, and not only that, they knew and understood the goals of the organization that was collecting that information.

For AOL users, it was only after the data release that people started to realize that as an individual, you are laying bare your psyche: contemplations of suicide, murder, sexual hang-ups, personal insecurities, etc…so that the folks at AOL (and other search engines) can sell you better targeted advertising and make more money. Contrast that with what social scientists are trying to accomplish in sub-Saharan Africa and you start to feel like a cheap date.

What’s unfortunate is that the reason why the AOL data became compromised was because AOL was following others in the industry trying to “do good” by making their data available to academic researchers who might try to do something more with the data than figure out advertising schemes.

The takeaway here is that there is no straightforward, one-size-fits-most policy when it comes to privacy. It’s not about how much privacy is enough privacy. It’s not about whether people should share data or not share data. It’s clear that there are myriad circumstances that call for different levels of care on the part of the people collecting data and provoke different responses on the part of the people sharing information. Like most things having to do with human beings and society, privacy is context-sensitive and grand sweeping EULAs and privacy policies are insufficient, if not downright ridiculous for capturing how we should approach the issue, as an industry and as a society.

Now that AOL knows beyond a shadow of a doubt that their users can’t seem to be able to extrapolate from their generic, vague privacy policy, the natural consequences of using the AOL search service, they need to find a way to make the search experience itself clearly communicate to the user the data collection that is happening behind the scenes. The INDEPTH survey respondents wouldn’t be surprised to see themselves in a report on the sexual history of people who are HIV+ or living with AIDS in Sub-Saharan African. They’re answering a survey, what else would they expect?

Similarly, AOL users shouldn’t be surprised that someone is keeping track of what they search for, what sites they visit, for how long and how often.* Attaining this kind of mutual understanding with your users is much trickier and has yet to be done successfully. After all, AOL’s users don’t think of themselves as answering a survey when they conduct a search or visit a website. But as far as the researchers at AOL are concerned, that’s exactly what they’re doing,

*That being said, everyone should be surprised and outraged if any of this data is released without being properly anonymized. Whether or not everyone has the wherewithal and press connections to express their indignation and anger is another issue for another blog entry.

How to evaluate a privacy statement when you’re dying of AIDS

Sunday, November 12th, 2006

Last week, I reported on Professor Sam Clark’s recent talk: “Relational Databases in the Social and Health Sciences: The View from Demography.” Clark covered a wide array of topics from the challenges of working with heterogeneous sets of social science field research to data-driven outcome-modeling that is used to drive policy decisions in the arena of AIDS/HIV prevention and treatment in Sub-Saharan Africa.

As I mentioned last week, surprisingly, privacy did not come up during Professor Clark’s talk…except in a brief aside, where Clark acknowledged that study subjects are at times uncomfortable disclosing extra-marital relationships. On the whole, privacy did not appear to be a taking up too many cycles at either INDEPTH, a network of ‘Demographic Surveillance Systems’ (DSS is social science-speak for data collection sites) that is working to standardize field research, or SPEHR, Clark’s personal effort to design a standard database schema for social science research. At the risk of being presumptuous, ‘Demographic Surveillance System‘ itself speaks volumes about how social science regards the issue of privacy.

At the same time, the frequent media alerts about privacy and data leaks (HP, AOL, Veterans) got me wondering: How would this data be handled in a US-based study? How readily would you respond to an online survey asking you how many times you’ve had unprotected sex?

Not very well would be my guess. Forget allowing someone to compile a detailed log of your day-to-day sexual activity. People would never even get past the first 2 questions: Are you HIV positive? Are you living with AIDS? The ramifications of leaking such information are all too well-known in modern society.

Just to make sure that I hadn’t misread the lack of emphasis, I rooted around the INDEPTH website to see if I could find a meatier discussion about privacy.

I found a reference to “A Data Model for Demographic Surveillance Systems“, a 21 page paper which makes it’s first and last mention of privacy on p.18 in its ‘Conclusions and Future Work’ section:

“More work is needed for sites that require better data privacy than simply restricting access to the data set. Certainly, separating the name from the ID field is the first step in providing better data privacy.”

I also found “Data access, security and confidentiality“, a 174-word document in the INDEPTH DSS Resource Toolkit that recommends 3 things to researchers designing data collection systems:

1. Be clear about who has access to the data, what data do they have access to, and what level of access should they have.
2. Back up the data. A RAID server is ideal.
3. Separate survey respondent ID numbers from their names.

These are all good recommendations that demonstrate a willingness to address the issue. But isn’t this oversimplification at best and gross negligence at worst? Granted, I may be unfair in singling out INDEPTH to play the role of spokesperson for the entire social science community on the topic of privacy. So maybe all I really can say is that, at the very least, the folks at INDEPTH are seriously underestimating the challenges of taking on guardianship of sensitive personal data. Like the researchers at AOL, we can only wait for the consequences of their mis-estimation to play out.

So again, the sense I get is that privacy isn’t a major issue. Why’s that?

INDEPTH’s users have more important things to worry about. They’re not scanning people’s email to sell mattress companies more targeted advertising. They’re trying to do things like save a continent from implosion.

According to UNAIDS, in 2005 alone an estimated 3.2 million people in Sub-Saharan Africa became newly infected, while 2.4 million adults and children died of AIDS. In the U.S., which has less than 40% of the population of Sub-Saharan African, if 1 million Americans were dying AIDS every year, we wouldn’t be talking about privacy either.

A second, more insidious reason is that this flavor of information privacy is largely an information-age phenomena, one that requires the individual to understand the implications and weigh the risks of disclosure.

Our ‘modern-day’ awareness, or wariness of disclosure did not come for free. Even with all of the media frenzy, people regularly compromise their personal information in myriad ways everyday: Chocolate bars for passwords.

Nevertheless, no matter how tenuous a grasp the public has on data and databases, the level of sophistication mainstream America has achieved in the realm of ‘things digital’ is not to be taken for granted.

It’s not a matter of intelligence or common sense. I’m guessing that the people who willingly participate in DSS such as INDEPTH don’t have a gut-level appreciation of what it means to be ‘in the system’ for the simple reason that they live in pre-digital or barely digital societies and aren’t kept track of in their daily existence the way we are.

They don’t log in, they don’t enter passwords, PIN numbers or secret codes. They don’t answer self-selected security questions, swipe key fobs, scan ID cards, metro cards, and medical insurance cards. They don’t accept certificates, add people to whitelists, report spam. They don’t make spreadsheets, tag pictures, maintain ‘address books’, query their email or for that matter, query the web. They don’t inspect the history in their web browser to delete all the URLs that might not be so great for other people to inadvertently stumble across. They’ve never had an application rejected because of ‘low’ test scores and ‘bad’ grades. They’ve never been denied insurance for having ‘above average’ blood pressure. They’ve never been denied a mortgage for having ‘below average’ credit. They’ve never been audited by the IRS or logged into Amazon to be confronted with “Here’s a recommendation just for you: Getting pregnant after Menopause!”.

In other words, the subjects in this study don’t necessarily have a clear conception of this thing called a database that is going to consume their personal life history, chop it up into discrete cells, array it in rows and columns, making it all the more digestible for aggregating, analyzing, comparing and accessible to on-demand recall. The question is, when a respondent ‘consents’ to ‘participate in a survey’, do they understand what they’re consenting to? Do the field researchers themselves understand what respondents are consenting to?

Even if respondents did fully understand what ‘consent’ really meant (which is highly doubtful given that most First World internet users don’t fully digest what it means to ‘Accept’ a EULA), there still remains the unresolved issue of whether dire circumstances (e.g. lots of people dying with no end in sight) warrant slackened attention to privacy.

Up Next: Who cares about Privacy: Why search queries in America trump sexual history in Africa.

Privacy Paranoia Part II: What are they afraid of?

Tuesday, October 24th, 2006

In Privacy Paranoia Part I, I questioned the assumption that people are intrinsically suspicious of data collection efforts and generally unwilling to volunteer personal information, by walking through a few everyday examples of information sharing.

However, while there are an abundance of scenarios and circumstances under which you and I are happy to reveal personal data, that does not change the stubborn fact that users generally are suspicious of data collection efforts and in many cases would choose NOT to share personal information. (Except for a lack of patience for reading fine print and paying attention to default settings on the software they install.)

Privacy Paranoia Part II addresses this apparent inconsistency which clears the path to Part III, which will address concrete ways to change user attitudes toward data collection.

The general public’s seemingly contradictory relationship with information-sharing can be explained away once we, as web service providers, accept responsibility for the reaction we provoke in our users.

In the real world, information-sharing works as a quid pro quo where both sides agree to terms they can live with and exchange information accordingly.

In the world of online services, we as service providers are attempting to engage our users in this exchange, but we present it as a one-sided deal. You give, we take. The terminology we use as an industry belie our inward focus. We don’t engage in information-sharing with our users. We collect data. We mine data. We warehouse data.

So, the million-dollar question is: What do we need to provide our users in order to engage them in an information-exchange with us?

1. Transparency of intent. As the user, if I know why you need the information you are requesting, I am more likely to give it to you, even if there are opportunities for you to re-purpose my information in ways I don’t intend.

2. Personal benefit (If I need to tell you.)

  • I give complete strangers on eBay my home address, in exchange for having my purchase arrive on my doorstep.
  • I tell my credit card company what I purchased, where I purchased and when I purchased it, in exchange for being free from the constraints of managing cash.

1 and 2 are as far as most people go. And many people have pretty low standards for 2.

3. Reputation. What is the reputation of the person/entity that is requesting this information? Are they going to maliciously misuse my information? Are they going to take care with my information? Are they even capable of understanding what “taking care with my information” means? (As in, are they clueless enough to transmit my credit card number in plain text?)

4. What else could the requester do with this information? How valuable, how sensitive is the information I’m giving out?

Today, few people weigh these factors systematically, not because they don’t want to, but because they can’t. The services, organizations and businesses asking for phone numbers, addresses, gender, income, credit card numbers and social security numbers aren’t holding up their end of the quid pro quo.

1. Transparency into the Hows, Whys, Whens and What-fors
2. Exchanging data rather than Collecting data

As a result, in place of rational evaluation, habit and confusing design rule. Some people run their own email servers and devise dozens of aliases to throw ‘Big Brother’ off the trail. Others happily hand over their data in exchange for the famous free bar of chocolate in the subway.

This makes it very hard to predict how the general public will deal with information-sharing services. The reaction could run the gamut from paranoid revulsion to earnest enthusiasm to blasé indifference. This in turn makes the quality of the data we hope to collect and build a service around, unreliable and uneven. We want everyone to be represented in the data pool, paranoiacs included.

Therefore, if we want to neutralize the randomizing influence of personality, we must find a way to walk people through evaluating questions 1-4 in a rational and considered way; and hopefully the answers they come up with convince them that participating in the information-sharing community is in their best interest.

How do we do that?

Privacy Paranoia Part I: What are we afraid of?

Wednesday, October 18th, 2006

If a stranger asked you on the street “What is your street address?” you would probably be pretty startled at his presumption and walk away. What part of town you’re from is friendly chit-chat, but street address is a tad too specific for comfort. After all, what business could he possibly have with your address? However, If that same stranger is standing behind a counter at a store, wearing a uniform asking the same question, you still might not give him your address, but you’d have a better sense of why he was asking, what he’s likely to do with the information and how it will affect your life (more snail mail SPAM).

You may also wonder if the stranger will abuse his access privileges and re-purpose your personal information for his own interests, possibly at your expense (e.g. identity theft). How likely is this? That depends on a whole host of factors from the brand and reputation of the store, your past experiences with the store, the dress and mannerisms of the stranger, personal biases, etc.

When a security gate asks you to identify yourself with your swipe card, you volunteer personal information (who you are, where you are and when you were there) without even thinking about it. The social contract is clear: If I tell you who I am, you (the disembodied security system instituted by the disembodied corporation I work for) will let me in so I can go to work, make money and support myself and my expensive spending habits. Besides, who cares if everyone in the world knows that I was at work at 9:14 AM in the morning? How could that information possibly harm me in the future?

Finally, when your doctor wants to know if you’re sexually active or abusing drugs, depending on how ill you feel, how desperate you are to feel better and the political leanings of the hospital, you’ll spill your guts, because that’s what you’re supposed to do with doctors.

Once you get past these questions of Who, Why, For What and How, you might ask yourself if the person, business or organization who is asking for your information is even capable of taking responsibility for it.

Clearly, we wear our personal information on our sleeves in a variety of ways in a broad range of situations every day, multiple times a day. Yet, as an industry, we’ve pretty much given up on the idea that users will volunteer personal information to a web service. Instead, we resort to not-so-subtle tricks that we hope our users won’t notice. Clever default settings and EULAs we know our users don’t read. However, this is neither the right way to go about building a user base, nor is it sustainable. It is also, by no means, the only way.

Privacy Paranoia Part II: What are they afraid of?

FreshBooks Aligns Data Collection with its Customers’ Interests

Wednesday, October 11th, 2006

I think FreshBooks is attempting something very interesting.

[Freshbooks is geared toward small businesses and/or independent contractors. From their Manifesto: "Our mission is to deliver fast and simple invoicing and time tracking services that help you manage your business."]

They are asking their users to optionally classify their profession/industry. In return, participants gain access to business metrics for their industry, based on aggregations of data collected from the Freshbooks user population.

The examples they give are

  • “What is the average invoice size for [your profession]?”
  • “How long does the average [your profession] take to get paid?”
  • “What is the average monthly revenue of other [your profession]?”

I would imagine this will raise many a small business eyebrow. However, they still feel thin and generic to me. I want to know:

  • “How many years of experience do other professionals in my industry have?”
  • “What are their industry credentials? Education? Training? Skill set? Work experience?”
  • “What is the quality of their clientèle?”
  • “Where is there operation based?”
  • “What kind of capital investments have they made?

Collecting data from users is not new. Collecting data from users to provide a service is not new (if you consider targeted advertising a user service). However, there is something unique about what Freshbooks is doing that differentiates it from the various other data collection efforts on the internet. They have figured out a way to provide data to their customers that provides tangible, monetary value to their users; value that their users would probably be willing to pay for, and value that is difficult (expensive!) if not impossible for them to get anywhere else.

Furthermore, Freshbooks’ model turns the tables on data collection and privacy. In place of a parasitic relationship where Internet Company as Big Brother spies on users in order to make big bucks selling Targeted Advertising, a symbiotic exchange is established where users happily provide personal data in exchange for a tangible good in return. Sounds too good to be true? It probably is in the immediate future.

It’s worth noting that

  1. Freshbooks is collecting data from a real service they provide (as opposed to polls and surveys). This minimizes the risk of collecting bogus data.
  2. Because FreshBooks implies they will only tell you about the industry you indicate (thereby encouraging you to provide an accurate categorization or be given useless data) data inaccuracies due to user information distortions should be minimal.
  3. Freshbooks is being at least semi-transparent about what they’re doing with the data they collect. As a result, Freshbooks is establishing a trust relationship with their users, which turns the data they collect from their users into a renewable resource, as opposed to one (advertising) that runs dry as soon as users find out they’re being spied on.I say semi-transparent because:3a. Freshbooks is not being completely forthright about who else they may or may not be selling this data to.3b. Implicit is the fact that Freshbooks can also use this data to optimize their own business and pricing strategies.
  4. Although they are not charging for this data yet, the information (to any given customer) would probably be valued at at least $100s/year. (How Freshbooks might choose to monetize that value is a different story.) By contrast, the dollars that Freshbooks might have been able to get from selling targeted advertising for that customer’s eyeballs is unlikely to approach $100/year.
  5. Freshbooks reassures its users that their data is only used in its “anonymous aggregate form”. However, the term ‘data aggregates’ is so vague as to be largely useless. Freshbooks still doesn’t have a complete story about how they will protect the individual identities of their users.
  6. I’m not clear on how this new program jibes with the FreshBooks privacy statement, which under the heading “Ownership of Data Submitted to Active FreshBooks Subscriptions” suggests that user data is owned by the user, not by FreshBooks. How then does Freshbooks have the right to aggregate and share your data with other users? Does Freshbooks only collect data from users who opt-in to share/view data? If so, that severely limits their data pool. I wonder how many of their 90,000+ users are considered active and will opt-in…?

I’m very interested to hear if this sticks, and if their users are able to jump over the hurdle of giving up a little bit of privacy for a little bit of information. The relevancy of the data will presumably be a factor in continued participation.

What they should be doing:

  • Providing context about what’s missing: It is as important to understand who isn’t participating in providing data, as it is to know who is.
  • Provide context about their users: It is as important to understand the demographics, circumstances and nature of the other participants as it is to know what they raw accounting numbers are. After all, do I, as an small-town consultant really care what the big boys are charging on Madison avenue?
  • Taking a lot of care with the aggregates such that some sort of data-release scandal doesn’t come and bite them.
  • Refrain from using their data for parasitic reasons which undermine the trust relationship they’re building with their users.
  • Provide a way for users to cleanly and completely end their participation in the data collection program.

While time will tell what happens with the execution of this effort, I am excited by the attempt: A business that collects data from their users and returns to them business intelligence, rather than handing over the customer relationships they built to the highest pay-per-click bidder.

Get Adobe Flash player