Posts Tagged ‘Information’

Numbers are only as useful as the questions you ask of them

Friday, May 9th, 2008

Errol Morris recently made a point about filmmaking that expresses precisely how I feel about interpreting data:

“There is no mode of expression, no technique of production that will instantly produce truth or falsehood.”

Data, like film or photography is a representation of “the real world”. We study it in the hopes of finding enlightenment and understanding. However, there is no “technique of [data representation] that will instantly produce truth or falsehood.” If someone is disillusioned with data it is often because they expect too much from it. Data can’t “tell us” anything. It can only take something that is hard to grasp and offer up otherwise submerged surface area for examination, inquiry and analysis.

An important part of analyzing data is doubt and questioning, yet most data reported by the mainstream media is doubt-resistant. Some magic number is reported to the public with no real data to pick apart and study. Where do these magic numbers come from?

I wrote a while back about Yale Professor Don Green saying casually that he never believes tidy data. Beware of dumbed down data! When presented with data that conveniently boils down to “one number” that can explain it all, raise an eyebrow and dig deeper.

Responsible reporting of new data findings should probe and challenge the data. Where did the data come from? How reliable are these sources? What data is missing? How might what’s missing change the results? (This is the hardest to pull-off because it requires us to imagine what we don’t know.) How is the way the data is presented inadvertently influencing how it will be interpreted? What assumptions will each person bring to their interpretation of this data? Are they valid? Who’s in a position to make that judgment? Without at least asking these questions, eye-catching “magic number” headlines are a disservice to the public, designed to catch eyeballs with false clarity rather than expose the confusing uncertainty of reality.

More often than not, analyzing one data set simply propels you to collect and question more data. Now that I have this data, what other data do I need? Now that I have answers to these questions, what other questions do I now know I need to ask?

This is not to say that data never yields answers. Generally speaking, however, every hard-won answer simply opens the door to 5 more questions you couldn’t have imagined at the outset.

Yet another data breach

Thursday, March 20th, 2008

A major grocery chain, Hannaford, recently announced that due to a security breach, up to four million credit cards may be vulnerable to access by criminals. So add another to the list of 2008 security breaches, and it’s only March.

As Flowing Data points out, when you look at a timeline of big data breaches from Attrition.org, data breaches have occurred with more frequency, not less, the closer we get to the present. Yet data breaches seem to be getting less coverage than they used to. When I looked at the full list of breaches catalogued by Attrition.org, I saw some that I’d heard of and many I hadn’t. And with this recent breach, I haven’t seen as much coverage as I would have expected. Plenty of specialized blog reactions and local news coverage, but not much national attention. Are people just getting used to this? Or is it that they think they have no alternatives?

CDTF’s Presentation at the Workshop on Data Privacy

Friday, February 22nd, 2008

The Common Datatrust Foundation recently attended and made a short presentation at the Workshop on Data Privacy, hosted by Rutgers University’s Center for Discrete Mathematics & Theoretical Computer Science (DIMACS).

There were spirited conversations across disciplines as statisticians, mathematicians, computer scientists, and media experts discussed how to balance the public’s interest in both privacy and information sharing. The presentations ranged from tutorials on new security and privacy technology to the management of existing databases of personal information, such as the U.S. Census, as well as thought-provoking presentations on more abstract but highly relevant questions, such as what we mean when we say we want to protect “privacy.” As Professor Helen Nissenbaum from NYU Law School pointed out, certain kinds of information flow are appropriate for certain situations; there is no uniform way to understand privacy protection.

We were excited to see how our presentation provoked questions and conversations as well. Alex Selkirk introduced the concept of a “datatrust,” a secure, structured data storage system where each record in each dataset has a set of rules defining who may use it, what it may be used for, and with what level of anonymity it may be disclosed. The presentation focused primarily on one example of the current limits of data disclosure: the subprime mortgage crisis. Although there is a great deal of data held by banks and mortgage companies on subprime loans, investigators and researchers are unable to analyze the data because the data holders are bound by confidentiality agreements to individual borrowers. CDTF proposed that a datatrust, as a third party, could use new technology to anonymize and aggregate the data in a way that would allow researchers to query the loan data without forcing the disclosure of identifying details about the borrowers. Such data-sharing would further CDTF’s mission to both protect individual privacy and encourage the sharing of information for the public good.

We hope that the conversation we began at DIMACS will continue to engage conference participants and others in the coming months.

What exactly is Google up to?

Wednesday, February 6th, 2008

Even as Google has become the most coveted place to work, to the extent that even their cafeteria gets media coverage, it’s also getting increasingly negative attention as a potentially sinister force. The New Yorker recently published an article with rather vague speculation at the way Google might take over the world. Now, we hear that Microsoft is trying to buy Yahoo so they can together fight Google. (Isn’t it funny that Microsoft is seeing another company as the big, bad world-dominator?) More and more, people are starting to wonder, “What exactly is Google up to?”

But given that we can’t read the minds of Sergey Brin and Larry Page, perhaps what we should be looking at is the conflict-of-interest inherent in Google’s business model. Google’s stated mission as a company is to organize the world’s information and make it universally accessible and useful. But are Google’s customers really the individuals searching for information, or are they the advertisers who actually increase Google’s revenues and stock value? To be fair, Google makes a respectable effort to separate advertising from “legitimate,” as in “non-jerry-rigged” search results. But after ten years, the Google search experience is pretty much the same as it’s always been. Has Google been working really hard on tools to help people find better information faster, or has it been working really hard on tools to help advertisers better target potential customers?

Google doesn’t have to be evil to be troubling. It may have started out with the purest of intentions, but it’s hampered itself with the conflict-of-interest at the heart of its operations. Law professor Tim Wu, as quoted in the New Yorker, said it straight, “I predict that Google will end up at war with itself.”

Where that “study” you quoted came from: Remember that call you got during dinner?

Tuesday, May 29th, 2007

Over the last few months I’ve been to a number of interesting talks at the Stanford Methods of Analysis Program in the Social Sciences (MAPSS) colloquium. Two types of speakers have caught my attention: those who work closely with the logistics and mechanics of data collection, and those who try to use survey data to test their hypotheses.

Most recently I got to hear Linda Piekarski of Survey Sampling International on SSI’s efforts to address changes in the telephone system, as well as their recent forays into internet surveys. (I didn’t realize how perfect the original design for the U.S. phone system was for tele-survey companies.)

Also memorable was Yale Professor Don Green’s talk about measuring the effectiveness of political campaign advertising. One of my favorite lines (though I’m paraphrasing) was that “Any time you see a clean, clear graph of data, there’s something wrong. Data “noise” is what reality looks like.”

What follows is a summary of the challenges facing the collection of data about individuals derived in part from these talks.
Today, there are three main ways of collecting data from individuals, each of which contain flaws that seriously undermine the quality of the data collected.

  1. Pay them a tiny reward, lure them with a sweepstakes or nag them at dinner with a phone call from a stranger. For example, online stores may offer a coupon or rebate for your feedback on your buying experience.
  2. Make it easy for individuals to inadvertently or unthinkingly consent to data being collected about them, and/or subsequently changing the substances of what is collected, or the uses for that data. One prominent example is Amazon.com’s site registration process, which makes no attempt to highlight their third-party data-sharing practices.
  3. Leverage data collected for some other purpose – so-called “Secondary Use”. For example addresses collected for fulfillment (shipping) being used for geographically targeted marketing messages.

These mechanisms have a set of critical flaws:

  1. Tiny rewards and nagging phone calls are an insufficient value proposition for many individuals, thus the pool of participants is unlikely to be well distributed across the target distribution. Instead it will favor those individuals for whom the reward remains attractive, however small; or those individuals for whom the cost of participation (time) is small enough to make the reward adequate. (Mechanism 1)
  2. Rewards or compensation that are distributed without regard to accuracy provide no incentive for either careful or genuine accurate self-reporting. (Mechanism 1)
  3. These practices cultivate a public perception of a mesh of “big brother” networks collecting an ever-expanding set of data, beyond the control of any one individual. Privacy outrage still surfaces in mainstream media occasionally, but the general public is increasingly numb to incremental discoveries of the erosion of personal privacy. While anesthesia may appear temporarily attractive to data collectors, it also disengages individuals from the data collection goals, which decreases participation and discourages accurate self-reporting. For example, when you are pressured to answer a survey at a department store or after check-out at a web retailer, do you react with an earnest attempt to supply them with the information they need? (All mechanisms)
  4. In an effort to fight back the ever-increasing invasive data collection going on, privacy legislation and legal liability has forced data to be “silo-ed” and “anonymized” as much as possible. That means that unless you are a part of a larger survey panel, each subsequent survey you complete or data you consent to have collected will be stored separately from your other data. This eliminates the possibility of data-accuracy maintenance by individuals, and makes longitudinal analysis increasingly difficult. (All mechanisms)

Privacy Paranoia Part II: What are they afraid of?

Tuesday, October 24th, 2006

In Privacy Paranoia Part I, I questioned the assumption that people are intrinsically suspicious of data collection efforts and generally unwilling to volunteer personal information, by walking through a few everyday examples of information sharing.

However, while there are an abundance of scenarios and circumstances under which you and I are happy to reveal personal data, that does not change the stubborn fact that users generally are suspicious of data collection efforts and in many cases would choose NOT to share personal information. (Except for a lack of patience for reading fine print and paying attention to default settings on the software they install.)

Privacy Paranoia Part II addresses this apparent inconsistency which clears the path to Part III, which will address concrete ways to change user attitudes toward data collection.

The general public’s seemingly contradictory relationship with information-sharing can be explained away once we, as web service providers, accept responsibility for the reaction we provoke in our users.

In the real world, information-sharing works as a quid pro quo where both sides agree to terms they can live with and exchange information accordingly.

In the world of online services, we as service providers are attempting to engage our users in this exchange, but we present it as a one-sided deal. You give, we take. The terminology we use as an industry belie our inward focus. We don’t engage in information-sharing with our users. We collect data. We mine data. We warehouse data.

So, the million-dollar question is: What do we need to provide our users in order to engage them in an information-exchange with us?

1. Transparency of intent. As the user, if I know why you need the information you are requesting, I am more likely to give it to you, even if there are opportunities for you to re-purpose my information in ways I don’t intend.

2. Personal benefit (If I need to tell you.)

  • I give complete strangers on eBay my home address, in exchange for having my purchase arrive on my doorstep.
  • I tell my credit card company what I purchased, where I purchased and when I purchased it, in exchange for being free from the constraints of managing cash.

1 and 2 are as far as most people go. And many people have pretty low standards for 2.

3. Reputation. What is the reputation of the person/entity that is requesting this information? Are they going to maliciously misuse my information? Are they going to take care with my information? Are they even capable of understanding what “taking care with my information” means? (As in, are they clueless enough to transmit my credit card number in plain text?)

4. What else could the requester do with this information? How valuable, how sensitive is the information I’m giving out?

Today, few people weigh these factors systematically, not because they don’t want to, but because they can’t. The services, organizations and businesses asking for phone numbers, addresses, gender, income, credit card numbers and social security numbers aren’t holding up their end of the quid pro quo.

1. Transparency into the Hows, Whys, Whens and What-fors
2. Exchanging data rather than Collecting data

As a result, in place of rational evaluation, habit and confusing design rule. Some people run their own email servers and devise dozens of aliases to throw ‘Big Brother’ off the trail. Others happily hand over their data in exchange for the famous free bar of chocolate in the subway.

This makes it very hard to predict how the general public will deal with information-sharing services. The reaction could run the gamut from paranoid revulsion to earnest enthusiasm to blasé indifference. This in turn makes the quality of the data we hope to collect and build a service around, unreliable and uneven. We want everyone to be represented in the data pool, paranoiacs included.

Therefore, if we want to neutralize the randomizing influence of personality, we must find a way to walk people through evaluating questions 1-4 in a rational and considered way; and hopefully the answers they come up with convince them that participating in the information-sharing community is in their best interest.

How do we do that?

Privacy Paranoia Part I: What are we afraid of?

Wednesday, October 18th, 2006

If a stranger asked you on the street “What is your street address?” you would probably be pretty startled at his presumption and walk away. What part of town you’re from is friendly chit-chat, but street address is a tad too specific for comfort. After all, what business could he possibly have with your address? However, If that same stranger is standing behind a counter at a store, wearing a uniform asking the same question, you still might not give him your address, but you’d have a better sense of why he was asking, what he’s likely to do with the information and how it will affect your life (more snail mail SPAM).

You may also wonder if the stranger will abuse his access privileges and re-purpose your personal information for his own interests, possibly at your expense (e.g. identity theft). How likely is this? That depends on a whole host of factors from the brand and reputation of the store, your past experiences with the store, the dress and mannerisms of the stranger, personal biases, etc.

When a security gate asks you to identify yourself with your swipe card, you volunteer personal information (who you are, where you are and when you were there) without even thinking about it. The social contract is clear: If I tell you who I am, you (the disembodied security system instituted by the disembodied corporation I work for) will let me in so I can go to work, make money and support myself and my expensive spending habits. Besides, who cares if everyone in the world knows that I was at work at 9:14 AM in the morning? How could that information possibly harm me in the future?

Finally, when your doctor wants to know if you’re sexually active or abusing drugs, depending on how ill you feel, how desperate you are to feel better and the political leanings of the hospital, you’ll spill your guts, because that’s what you’re supposed to do with doctors.

Once you get past these questions of Who, Why, For What and How, you might ask yourself if the person, business or organization who is asking for your information is even capable of taking responsibility for it.

Clearly, we wear our personal information on our sleeves in a variety of ways in a broad range of situations every day, multiple times a day. Yet, as an industry, we’ve pretty much given up on the idea that users will volunteer personal information to a web service. Instead, we resort to not-so-subtle tricks that we hope our users won’t notice. Clever default settings and EULAs we know our users don’t read. However, this is neither the right way to go about building a user base, nor is it sustainable. It is also, by no means, the only way.

Privacy Paranoia Part II: What are they afraid of?

FreshBooks Aligns Data Collection with its Customers’ Interests

Wednesday, October 11th, 2006

I think FreshBooks is attempting something very interesting.

[Freshbooks is geared toward small businesses and/or independent contractors. From their Manifesto: “Our mission is to deliver fast and simple invoicing and time tracking services that help you manage your business.”]

They are asking their users to optionally classify their profession/industry. In return, participants gain access to business metrics for their industry, based on aggregations of data collected from the Freshbooks user population.

The examples they give are

  • “What is the average invoice size for [your profession]?”
  • “How long does the average [your profession] take to get paid?”
  • “What is the average monthly revenue of other [your profession]?”

I would imagine this will raise many a small business eyebrow. However, they still feel thin and generic to me. I want to know:

  • “How many years of experience do other professionals in my industry have?”
  • “What are their industry credentials? Education? Training? Skill set? Work experience?”
  • “What is the quality of their clientèle?”
  • “Where is there operation based?”
  • “What kind of capital investments have they made?

Collecting data from users is not new. Collecting data from users to provide a service is not new (if you consider targeted advertising a user service). However, there is something unique about what Freshbooks is doing that differentiates it from the various other data collection efforts on the internet. They have figured out a way to provide data to their customers that provides tangible, monetary value to their users; value that their users would probably be willing to pay for, and value that is difficult (expensive!) if not impossible for them to get anywhere else.

Furthermore, Freshbooks’ model turns the tables on data collection and privacy. In place of a parasitic relationship where Internet Company as Big Brother spies on users in order to make big bucks selling Targeted Advertising, a symbiotic exchange is established where users happily provide personal data in exchange for a tangible good in return. Sounds too good to be true? It probably is in the immediate future.

It’s worth noting that

  1. Freshbooks is collecting data from a real service they provide (as opposed to polls and surveys). This minimizes the risk of collecting bogus data.
  2. Because FreshBooks implies they will only tell you about the industry you indicate (thereby encouraging you to provide an accurate categorization or be given useless data) data inaccuracies due to user information distortions should be minimal.
  3. Freshbooks is being at least semi-transparent about what they’re doing with the data they collect. As a result, Freshbooks is establishing a trust relationship with their users, which turns the data they collect from their users into a renewable resource, as opposed to one (advertising) that runs dry as soon as users find out they’re being spied on.I say semi-transparent because:3a. Freshbooks is not being completely forthright about who else they may or may not be selling this data to.3b. Implicit is the fact that Freshbooks can also use this data to optimize their own business and pricing strategies.
  4. Although they are not charging for this data yet, the information (to any given customer) would probably be valued at at least $100s/year. (How Freshbooks might choose to monetize that value is a different story.) By contrast, the dollars that Freshbooks might have been able to get from selling targeted advertising for that customer’s eyeballs is unlikely to approach $100/year.
  5. Freshbooks reassures its users that their data is only used in its “anonymous aggregate form”. However, the term ‘data aggregates’ is so vague as to be largely useless. Freshbooks still doesn’t have a complete story about how they will protect the individual identities of their users.
  6. I’m not clear on how this new program jibes with the FreshBooks privacy statement, which under the heading “Ownership of Data Submitted to Active FreshBooks Subscriptions” suggests that user data is owned by the user, not by FreshBooks. How then does Freshbooks have the right to aggregate and share your data with other users? Does Freshbooks only collect data from users who opt-in to share/view data? If so, that severely limits their data pool. I wonder how many of their 90,000+ users are considered active and will opt-in…?

I’m very interested to hear if this sticks, and if their users are able to jump over the hurdle of giving up a little bit of privacy for a little bit of information. The relevancy of the data will presumably be a factor in continued participation.

What they should be doing:

  • Providing context about what’s missing: It is as important to understand who isn’t participating in providing data, as it is to know who is.
  • Provide context about their users: It is as important to understand the demographics, circumstances and nature of the other participants as it is to know what they raw accounting numbers are. After all, do I, as an small-town consultant really care what the big boys are charging on Madison avenue?
  • Taking a lot of care with the aggregates such that some sort of data-release scandal doesn’t come and bite them.
  • Refrain from using their data for parasitic reasons which undermine the trust relationship they’re building with their users.
  • Provide a way for users to cleanly and completely end their participation in the data collection program.

While time will tell what happens with the execution of this effort, I am excited by the attempt: A business that collects data from their users and returns to them business intelligence, rather than handing over the customer relationships they built to the highest pay-per-click bidder.