Posts Tagged ‘Databases’

Property Shark and “Contextual Integrity”: Where real estate obsession and privacy academia intersect

Tuesday, March 4th, 2008

Recently, I was having dinner with some friends when the topic of Property Shark came up. My friends, being homeowners, were disturbed that someone could simply go online, type in their address, and find out who the owners were and precisely what they had paid for it. One friend exclaimed, “I don’t want people to know how much money I have!” When I pointed out that the information was public record, and that before Property Shark, anyone could have gone down to City Hall and found the same information, he didn’t care. It still bothered him.

For all our talk of “privacy,” of how it’s being violated all over the place, of how it’s already lost, it’s not even clear what we mean when we say “privacy.” We, as a society, might have agreed that it is good public policy for real estate records to be public so that potential buyers can make sure sellers actually own the property they’re selling. Capitalism can’t thrive if you can’t be sure you own what you own. But when we theoretically made this agreement, we certainly didn’t imagine a world where “public” means available to anyone, anywhere, at any time. Professor Helen Nissenbaum, who recently presented at the DIMACS Data Privacy Workshop, has proposed that we think about “contextual integrity” rather than “privacy.” She argues that it’s more useful to consider what’s appropriate in each context rather than assuming there is a blanket “privacy” standard applicable to all situations.

That makes sense to me. My friend wasn’t arguing that the information shouldn’t be public record. Rather, he wasn’t comfortable with that information being accessed so easily online.

Personally, in the universe of privacy breaches, Property Shark doesn’t seem so problematic, but it’s certainly helpful as the Common Datatrust Foundation works on privacy problems to remember that “privacy” doesn’t have a singular meaning. One of CDTF’s goals for this year is to create some privacy standards for companies and other data collectors that acknowledge that information flow can’t just have a on/off, public/private spigot. It’s obvious that our world and our needs are more complex than that. After all, sometimes it’s hard to know even what we want when we clamor for more privacy. Even my friend, when pressed, admitted that the next time he was looking to buy a house, the first thing he would do is go to Property Shark.

CDTF’s Presentation at the Workshop on Data Privacy

Friday, February 22nd, 2008

The Common Datatrust Foundation recently attended and made a short presentation at the Workshop on Data Privacy, hosted by Rutgers University’s Center for Discrete Mathematics & Theoretical Computer Science (DIMACS).

There were spirited conversations across disciplines as statisticians, mathematicians, computer scientists, and media experts discussed how to balance the public’s interest in both privacy and information sharing. The presentations ranged from tutorials on new security and privacy technology to the management of existing databases of personal information, such as the U.S. Census, as well as thought-provoking presentations on more abstract but highly relevant questions, such as what we mean when we say we want to protect “privacy.” As Professor Helen Nissenbaum from NYU Law School pointed out, certain kinds of information flow are appropriate for certain situations; there is no uniform way to understand privacy protection.

We were excited to see how our presentation provoked questions and conversations as well. Alex Selkirk introduced the concept of a “datatrust,” a secure, structured data storage system where each record in each dataset has a set of rules defining who may use it, what it may be used for, and with what level of anonymity it may be disclosed. The presentation focused primarily on one example of the current limits of data disclosure: the subprime mortgage crisis. Although there is a great deal of data held by banks and mortgage companies on subprime loans, investigators and researchers are unable to analyze the data because the data holders are bound by confidentiality agreements to individual borrowers. CDTF proposed that a datatrust, as a third party, could use new technology to anonymize and aggregate the data in a way that would allow researchers to query the loan data without forcing the disclosure of identifying details about the borrowers. Such data-sharing would further CDTF’s mission to both protect individual privacy and encourage the sharing of information for the public good.

We hope that the conversation we began at DIMACS will continue to engage conference participants and others in the coming months.

How to evaluate a privacy statement when you’re dying of AIDS

Sunday, November 12th, 2006

Last week, I reported on Professor Sam Clark’s recent talk: “Relational Databases in the Social and Health Sciences: The View from Demography.” Clark covered a wide array of topics from the challenges of working with heterogeneous sets of social science field research to data-driven outcome-modeling that is used to drive policy decisions in the arena of AIDS/HIV prevention and treatment in Sub-Saharan Africa.

As I mentioned last week, surprisingly, privacy did not come up during Professor Clark’s talk…except in a brief aside, where Clark acknowledged that study subjects are at times uncomfortable disclosing extra-marital relationships. On the whole, privacy did not appear to be a taking up too many cycles at either INDEPTH, a network of ‘Demographic Surveillance Systems’ (DSS is social science-speak for data collection sites) that is working to standardize field research, or SPEHR, Clark’s personal effort to design a standard database schema for social science research. At the risk of being presumptuous, ‘Demographic Surveillance System‘ itself speaks volumes about how social science regards the issue of privacy.

At the same time, the frequent media alerts about privacy and data leaks (HP, AOL, Veterans) got me wondering: How would this data be handled in a US-based study? How readily would you respond to an online survey asking you how many times you’ve had unprotected sex?

Not very well would be my guess. Forget allowing someone to compile a detailed log of your day-to-day sexual activity. People would never even get past the first 2 questions: Are you HIV positive? Are you living with AIDS? The ramifications of leaking such information are all too well-known in modern society.

Just to make sure that I hadn’t misread the lack of emphasis, I rooted around the INDEPTH website to see if I could find a meatier discussion about privacy.

I found a reference to “A Data Model for Demographic Surveillance Systems“, a 21 page paper which makes it’s first and last mention of privacy on p.18 in its ‘Conclusions and Future Work’ section:

“More work is needed for sites that require better data privacy than simply restricting access to the data set. Certainly, separating the name from the ID field is the first step in providing better data privacy.”

I also found “Data access, security and confidentiality“, a 174-word document in the INDEPTH DSS Resource Toolkit that recommends 3 things to researchers designing data collection systems:

1. Be clear about who has access to the data, what data do they have access to, and what level of access should they have.
2. Back up the data. A RAID server is ideal.
3. Separate survey respondent ID numbers from their names.

These are all good recommendations that demonstrate a willingness to address the issue. But isn’t this oversimplification at best and gross negligence at worst? Granted, I may be unfair in singling out INDEPTH to play the role of spokesperson for the entire social science community on the topic of privacy. So maybe all I really can say is that, at the very least, the folks at INDEPTH are seriously underestimating the challenges of taking on guardianship of sensitive personal data. Like the researchers at AOL, we can only wait for the consequences of their mis-estimation to play out.

So again, the sense I get is that privacy isn’t a major issue. Why’s that?

INDEPTH’s users have more important things to worry about. They’re not scanning people’s email to sell mattress companies more targeted advertising. They’re trying to do things like save a continent from implosion.

According to UNAIDS, in 2005 alone an estimated 3.2 million people in Sub-Saharan Africa became newly infected, while 2.4 million adults and children died of AIDS. In the U.S., which has less than 40% of the population of Sub-Saharan African, if 1 million Americans were dying AIDS every year, we wouldn’t be talking about privacy either.

A second, more insidious reason is that this flavor of information privacy is largely an information-age phenomena, one that requires the individual to understand the implications and weigh the risks of disclosure.

Our ‘modern-day’ awareness, or wariness of disclosure did not come for free. Even with all of the media frenzy, people regularly compromise their personal information in myriad ways everyday: Chocolate bars for passwords.

Nevertheless, no matter how tenuous a grasp the public has on data and databases, the level of sophistication mainstream America has achieved in the realm of ‘things digital’ is not to be taken for granted.

It’s not a matter of intelligence or common sense. I’m guessing that the people who willingly participate in DSS such as INDEPTH don’t have a gut-level appreciation of what it means to be ‘in the system’ for the simple reason that they live in pre-digital or barely digital societies and aren’t kept track of in their daily existence the way we are.

They don’t log in, they don’t enter passwords, PIN numbers or secret codes. They don’t answer self-selected security questions, swipe key fobs, scan ID cards, metro cards, and medical insurance cards. They don’t accept certificates, add people to whitelists, report spam. They don’t make spreadsheets, tag pictures, maintain ‘address books’, query their email or for that matter, query the web. They don’t inspect the history in their web browser to delete all the URLs that might not be so great for other people to inadvertently stumble across. They’ve never had an application rejected because of ‘low’ test scores and ‘bad’ grades. They’ve never been denied insurance for having ‘above average’ blood pressure. They’ve never been denied a mortgage for having ‘below average’ credit. They’ve never been audited by the IRS or logged into Amazon to be confronted with “Here’s a recommendation just for you: Getting pregnant after Menopause!”.

In other words, the subjects in this study don’t necessarily have a clear conception of this thing called a database that is going to consume their personal life history, chop it up into discrete cells, array it in rows and columns, making it all the more digestible for aggregating, analyzing, comparing and accessible to on-demand recall. The question is, when a respondent ‘consents’ to ‘participate in a survey’, do they understand what they’re consenting to? Do the field researchers themselves understand what respondents are consenting to?

Even if respondents did fully understand what ‘consent’ really meant (which is highly doubtful given that most First World internet users don’t fully digest what it means to ‘Accept’ a EULA), there still remains the unresolved issue of whether dire circumstances (e.g. lots of people dying with no end in sight) warrant slackened attention to privacy.

Up Next: Who cares about Privacy: Why search queries in America trump sexual history in Africa.

When Privacy Doesn’t Matter

Friday, November 3rd, 2006

Last Thursday MSR hosted Professor Sam Clark from the University of Washington for a talk entitled “Relational Databases in the Social and Health Sciences: The View from Demography.” For someone interested in using data for driving decision-making, it was interesting to hear about someone using empirical data to model the impact of different policies on a societal problem.

My main take-aways from the talk were as follows:

  • Social scientists today rarely use relational database (RDBMS) technology, or when they do, they use antique software. Apparently much analysis is done in statistical packages (I’m guessing SAS, and the like), which apparently lack much of the data management technology that is indispensable when working with larger datasets. For social scientists in general, the potential of current database technologies is only just becoming apparent.
    • As I am not familiar with many of the alternatives, had I been physically at the talk on campus rather than watching on-line, I would have liked to clarify what has changed to make RDBMS more attractive than it was before. I can only surmise from the talk that large data sets have only recently become available to social scientists, and that previous data sets were too small to warrant the RDBMS.
    • Even now, Clark said his colleagues would categorize a “large” dataset to be around 500 Megabytes.
  • Many early attempts at moving demographic data to relational data structures failed because the impact of the schema design on the demographical data uses was underestimated by those developing such systems.
  • Breadth of data has significant value to the longitudinal studies social scientists are conducting. Yet, lack of agreement on how to collect and store data is hampering their ability to interrelate data sets. Therefore, developing and agreeing on a standard is very desirable. (Incidentally, this problem is not exclusive to demographic datasets.)
  • Clark has done several iterations on a standard schema, particularly for capturing “Event-Influence-State” type datasets, commonly used in demography.
    • The Structured Population Event History Register (SPEHR).
    • One example he shared with us assessed “the impact of male circumcision as an HIV prevention strategy”. By using a longitudinal study (2 years, 3,000 people) to feed his simulation, he was able to demonstrate likely outcomes of the policy intervention in different phases of the epidemic; data to feed a real-world policy decision. 🙂

He also shared lots of anecdotal statistics about the AIDS epidemic in Africa, massive infection rates and death rates, which I continue to find mind-boggling: What would day-to-day living look like in the U.S. if 20% of Americans were infected with HIV? Or if we suddenly had millions of “dual-orphans” (normally a rare phenomenon) to raise? How will Africa recover?

All in all a very interesting talk with food for thought on many fronts, but one issue was conspicuously missing: Privacy.

Up Next: How to evaluate a privacy statement when you’re dying of AIDS.

Get Adobe Flash player