Archive for November, 2006

How to evaluate a privacy statement when you’re dying of AIDS

Sunday, November 12th, 2006

Last week, I reported on Professor Sam Clark’s recent talk: “Relational Databases in the Social and Health Sciences: The View from Demography.” Clark covered a wide array of topics from the challenges of working with heterogeneous sets of social science field research to data-driven outcome-modeling that is used to drive policy decisions in the arena of AIDS/HIV prevention and treatment in Sub-Saharan Africa.

As I mentioned last week, surprisingly, privacy did not come up during Professor Clark’s talk…except in a brief aside, where Clark acknowledged that study subjects are at times uncomfortable disclosing extra-marital relationships. On the whole, privacy did not appear to be a taking up too many cycles at either INDEPTH, a network of ‘Demographic Surveillance Systems’ (DSS is social science-speak for data collection sites) that is working to standardize field research, or SPEHR, Clark’s personal effort to design a standard database schema for social science research. At the risk of being presumptuous, ‘Demographic Surveillance System‘ itself speaks volumes about how social science regards the issue of privacy.

At the same time, the frequent media alerts about privacy and data leaks (HP, AOL, Veterans) got me wondering: How would this data be handled in a US-based study? How readily would you respond to an online survey asking you how many times you’ve had unprotected sex?

Not very well would be my guess. Forget allowing someone to compile a detailed log of your day-to-day sexual activity. People would never even get past the first 2 questions: Are you HIV positive? Are you living with AIDS? The ramifications of leaking such information are all too well-known in modern society.

Just to make sure that I hadn’t misread the lack of emphasis, I rooted around the INDEPTH website to see if I could find a meatier discussion about privacy.

I found a reference to “A Data Model for Demographic Surveillance Systems“, a 21 page paper which makes it’s first and last mention of privacy on p.18 in its ‘Conclusions and Future Work’ section:

“More work is needed for sites that require better data privacy than simply restricting access to the data set. Certainly, separating the name from the ID field is the first step in providing better data privacy.”

I also found “Data access, security and confidentiality“, a 174-word document in the INDEPTH DSS Resource Toolkit that recommends 3 things to researchers designing data collection systems:

1. Be clear about who has access to the data, what data do they have access to, and what level of access should they have.
2. Back up the data. A RAID server is ideal.
3. Separate survey respondent ID numbers from their names.

These are all good recommendations that demonstrate a willingness to address the issue. But isn’t this oversimplification at best and gross negligence at worst? Granted, I may be unfair in singling out INDEPTH to play the role of spokesperson for the entire social science community on the topic of privacy. So maybe all I really can say is that, at the very least, the folks at INDEPTH are seriously underestimating the challenges of taking on guardianship of sensitive personal data. Like the researchers at AOL, we can only wait for the consequences of their mis-estimation to play out.

So again, the sense I get is that privacy isn’t a major issue. Why’s that?

INDEPTH’s users have more important things to worry about. They’re not scanning people’s email to sell mattress companies more targeted advertising. They’re trying to do things like save a continent from implosion.

According to UNAIDS, in 2005 alone an estimated 3.2 million people in Sub-Saharan Africa became newly infected, while 2.4 million adults and children died of AIDS. In the U.S., which has less than 40% of the population of Sub-Saharan African, if 1 million Americans were dying AIDS every year, we wouldn’t be talking about privacy either.

A second, more insidious reason is that this flavor of information privacy is largely an information-age phenomena, one that requires the individual to understand the implications and weigh the risks of disclosure.

Our ‘modern-day’ awareness, or wariness of disclosure did not come for free. Even with all of the media frenzy, people regularly compromise their personal information in myriad ways everyday: Chocolate bars for passwords.

Nevertheless, no matter how tenuous a grasp the public has on data and databases, the level of sophistication mainstream America has achieved in the realm of ‘things digital’ is not to be taken for granted.

It’s not a matter of intelligence or common sense. I’m guessing that the people who willingly participate in DSS such as INDEPTH don’t have a gut-level appreciation of what it means to be ‘in the system’ for the simple reason that they live in pre-digital or barely digital societies and aren’t kept track of in their daily existence the way we are.

They don’t log in, they don’t enter passwords, PIN numbers or secret codes. They don’t answer self-selected security questions, swipe key fobs, scan ID cards, metro cards, and medical insurance cards. They don’t accept certificates, add people to whitelists, report spam. They don’t make spreadsheets, tag pictures, maintain ‘address books’, query their email or for that matter, query the web. They don’t inspect the history in their web browser to delete all the URLs that might not be so great for other people to inadvertently stumble across. They’ve never had an application rejected because of ‘low’ test scores and ‘bad’ grades. They’ve never been denied insurance for having ‘above average’ blood pressure. They’ve never been denied a mortgage for having ‘below average’ credit. They’ve never been audited by the IRS or logged into Amazon to be confronted with “Here’s a recommendation just for you: Getting pregnant after Menopause!”.

In other words, the subjects in this study don’t necessarily have a clear conception of this thing called a database that is going to consume their personal life history, chop it up into discrete cells, array it in rows and columns, making it all the more digestible for aggregating, analyzing, comparing and accessible to on-demand recall. The question is, when a respondent ‘consents’ to ‘participate in a survey’, do they understand what they’re consenting to? Do the field researchers themselves understand what respondents are consenting to?

Even if respondents did fully understand what ‘consent’ really meant (which is highly doubtful given that most First World internet users don’t fully digest what it means to ‘Accept’ a EULA), there still remains the unresolved issue of whether dire circumstances (e.g. lots of people dying with no end in sight) warrant slackened attention to privacy.

Up Next: Who cares about Privacy: Why search queries in America trump sexual history in Africa.

When Privacy Doesn’t Matter

Friday, November 3rd, 2006

Last Thursday MSR hosted Professor Sam Clark from the University of Washington for a talk entitled “Relational Databases in the Social and Health Sciences: The View from Demography.” For someone interested in using data for driving decision-making, it was interesting to hear about someone using empirical data to model the impact of different policies on a societal problem.

My main take-aways from the talk were as follows:

  • Social scientists today rarely use relational database (RDBMS) technology, or when they do, they use antique software. Apparently much analysis is done in statistical packages (I’m guessing SAS, and the like), which apparently lack much of the data management technology that is indispensable when working with larger datasets. For social scientists in general, the potential of current database technologies is only just becoming apparent.
    • As I am not familiar with many of the alternatives, had I been physically at the talk on campus rather than watching on-line, I would have liked to clarify what has changed to make RDBMS more attractive than it was before. I can only surmise from the talk that large data sets have only recently become available to social scientists, and that previous data sets were too small to warrant the RDBMS.
    • Even now, Clark said his colleagues would categorize a “large” dataset to be around 500 Megabytes.
  • Many early attempts at moving demographic data to relational data structures failed because the impact of the schema design on the demographical data uses was underestimated by those developing such systems.
  • Breadth of data has significant value to the longitudinal studies social scientists are conducting. Yet, lack of agreement on how to collect and store data is hampering their ability to interrelate data sets. Therefore, developing and agreeing on a standard is very desirable. (Incidentally, this problem is not exclusive to demographic datasets.)
  • Clark has done several iterations on a standard schema, particularly for capturing “Event-Influence-State” type datasets, commonly used in demography.
    • The Structured Population Event History Register (SPEHR).
    • One example he shared with us assessed “the impact of male circumcision as an HIV prevention strategy”. By using a longitudinal study (2 years, 3,000 people) to feed his simulation, he was able to demonstrate likely outcomes of the policy intervention in different phases of the epidemic; data to feed a real-world policy decision. :)

He also shared lots of anecdotal statistics about the AIDS epidemic in Africa, massive infection rates and death rates, which I continue to find mind-boggling: What would day-to-day living look like in the U.S. if 20% of Americans were infected with HIV? Or if we suddenly had millions of “dual-orphans” (normally a rare phenomenon) to raise? How will Africa recover?

All in all a very interesting talk with food for thought on many fronts, but one issue was conspicuously missing: Privacy.

Up Next: How to evaluate a privacy statement when you’re dying of AIDS.