Posts Tagged ‘Data Mining’

@IAPP Privacy Foo Camp 2010: What Is Anonymous Enough?

Tuesday, October 26th, 2010

Editor’s Note: Becky Pezely is an independent contractor for Shan Gao Ma, a consulting company started by Alex Selkirk, President of the Board of the Common Data Project.  Becky’s work, like Tony’s, touches on many of the privacy challenges that CDP hopes to address with the datatrust.  We’re happy to have her guest blogging about IAPP Academy 2010 here.

Several weeks ago we attended the 2010 Global Privacy Summit (IAPP 2010) in Baltimore, Maryland.   

In addition to some engaging high-profile keynotes – including FTC Bureau of Consumer Protection Director David Vladeck – we got to participate in the first ever IAPP Foo Camp

The Foo Camp was comprised of four discussion topics aimed at covering the top technology concerns facing a wide-range of privacy professionals.

The session we ran was titled “Low Impact Data Mining”.  The intention was to discuss, and better understand, the current challenges in managing data within an organization.  All with a lens on managing data in a way that is “low impact” on resources while returning “high (positive) impact” on the business.

The individuals in our group represented a vast array of industries including: financial services, insurance, pharmaceutical, law enforcement, online marketing, health care, retail and telecommunications.  It was fascinating that, even across such a wide range of industries, that there could be such a pervasive set of privacy  challenges that were common among them.

Starting with:

What is “anonymous enough”?

If all you need is gender, zip code and birthdate to re-identify someone then what data, when released, is truly “anonymous enough”?  Can a baseline be defined, and enforced, within our organization that ensures customer protection?

It feels safe to say that this was the root-challenge from which all the others stemmed.  Today the release of data is mostly controlled, and subsequently managed, by a trusted person(s). The individual(s) is the ones responsible for “sanitizing” the data that gets released internally, or externally, to the organization.  They are charged with managing the release of data to fulfill everything from understanding business performance to fulfilling business obligations with partners.  And their primary concern is to know how well they are protecting their customer’s information, not only from the perspective of company policy, but also from a perspective of personal morals. They are they gatekeepers for assessing the level of protection provided based on which data they released to whom and they want to have some guarantee that what they are releasing is “anonymous enough” to have the level of protection they want to achieve.  These gatekeepers want to know when the data they release is “anonymous enough” and how they can employ a definition, or measurement, that guarantees the right level of anonymity for their customers.

This challenge compounds for these individuals, and their organizations, when adding in various other truths of the nature of data today:

The silos are getting joined.

The convention that used to be held was that data within an organization was in a silo – all on it’s own and protected – such that anyone looking at the data, would only see that set of data.  Now, it’s starting to become the reality that these data sets are getting joined and it’s not always known where, when, how, with whom the join originated. Nor is it known where the joined data set could is currently stored since it was modified from its original silo.  Soon that joined data-set takes on a life of its own and makes its way around the institution.  Given the likelihood of this occurring, how can the person(s) responsible for being the gatekeeper(s) of the data, and assessing the level of protection provided to customers, do so with any kind of reliable measurement that guarantees the right level of anonymity?

And now there’s data in the public market.

Not only is the data joined with data (from other silos) within the organization, but also with data outside the organization sold in the public market.  This prospect has increased the ability for organizations to produce data that is “high impact” for the business – because they now know WAY MORE about their customers.  But does the benefit outweigh the liability? As the ability to know more about individual customers increases, so does the level of sensitivity and the concern for privacy.    How do organizations successfully navigate mounting privacy concerns as they move from in silos, to joined-silos, to joined-silos combined with public data?   

The line between “data analytics” and looking at “raw data” is blurring.

Because the data is richer, and more plentiful, the act of data analysis isn’t as benign as it might once have been.  The definition of “data analytics” has evolved from something high-level (to know, for example, how many new customers are using the service this quarter) to something that  looks a lot more like looking at raw data to target specific parts of their business to specific customers (to, for example, sell <these products> to customers that make <this much money>, are females ages 30 – 35 and live in <this neighborhood> and typically spend <this much> on <these types of products>, etc…).

And the data has different ways of exiting the system.

The truth is, as scary as this data can be, everyone wants to get their hands on it, because the data leads to awareness that is meaningful and valuable for the business.  Thus, the data is shared everywhere – inside and outside the organization.  With that fact comes a whole set of challenges emerge when considering all the ways data might be exiting any given “silo”, such as: Where is all the data going?  How is it getting modified (joined, sanitized, rejoined) and at which point is it no longer the data that needs to be protected by the organization? How much data needs to be released externally to fulfill partner/customer business obligations? Once the data has exited, can the organization’s privacy practices still be enforced? 

Brand affects privacy policy.  Privacy policy affects brand.

Privacy is a concern of the whole business, not just the resources that manage the data, nor solely the resources that manage liability.  In the event of a “big oopsie” where there is a data/privacy breach, it will be the communication with customers before, during and after the incident that determines the internal and external impact on the brand and the perception of the organization.  And that communication is dictated by both what the privacy policy enforces and what brand “allows”.  In today’s age of data, how can an organization have an open dialog with customers about their data if the brand does not support having that kind of a conversation?  No surprise that Facebook is the exemplary case for this: Facebook continues to pave a new path, and draw customers, to share and disclose more information about themselves.  As a result they have experienced the backlash from customers when they take it too far. The line of communication is very open – customers have a clear way to lash back when Facebook has gone too far, and Facebook has a way of visibly standing behind their decision or admitting their mistake.  Either way, it is now commonplace for Facebook’s customers to expect that there will be more incidents like this and that Facebook has a way (apparently suitable enough to keep most customers) of dealing with it.  Their “policy” allowed them to respond this way, and now it’s become a part of who Facebook is.  And now the policy that evolves to support this behavior moving forward.

In the discussion of data and privacy, it seems inherently obvious that the mountain of challenges we face is large, complicated and impacts the core of all our businesses.  Nonetheless, it is still fascinating to have been able to witness first-hand – and to now be able to specifically articulate – how similar the challenges are across a diverse group of businesses and how similar the concerns are across job-function. 

We want to re-thank everyone from IAPP that joined in on the discussions that we had at Foo Camp and throughout the conference.  We look forward to an opportunity to deep dive into these types of problems.

Post Script: Meanwhile, the challenges, and related questions, around the anonymization of data with some kind of measurable privacy guarantee that came up at Foo Camp are ones that we have been discussing on our blog for quite some time.  These are precisely the sorts of challenges that have motivated us to create a datatrust.  While we typically envision the datatrust being used in scenarios where there isn’t direct access to data, we walked away with specific examples from our discussions at IAPP Foo Camp where direct access to the data is required – particularly to fulfill business obligation – as a type of collateral (or currency). 

The concept of data as the new currency of today’s economy has emerged.  Not only did it come up at the IAPP Foo Camp, it also came up back in August where we heard Marc Davis talk about this at IPP 2010. With all of this in mind, it is interesting evaluate the possibility of the datatrust being able to act as a special type of data broker in these exchanges.  The idea being that the datatrust is a sanctioned data broker (by the industry, or possibly by the government), that inherently meets federal, local, municipal regulations and protects the consumers of business partners who want to exchange data as “currency,” while alleviating businesses and their partners from the headaches of managing data use/reuse.  The “tax” on using the service is that these aggregates are stored and made available to the public to query in the way we imagine (no direct access to the data) for policy-making and research.  This is something that feels compelling to us and will influence our thinking as we continue to move forward with our work.

In the mix…EU data retention laws, Wikipedia growing

Friday, June 11th, 2010

1) Australia thinking about requiring ISPs to record browsing histories (via Truste).

Electronic Frontier Australia (EFA) chair Colin Jacobs said the regime was “a step too far”.

“At some point data retention laws can be reasonable, but highly-personal information such as browsing history is a step too far,” Jacobs said. “You can’t treat everybody like a criminal. That would be like tapping people’s phones before they are suspected of doing any crime.”

Sounds shocking, but the EU already requires it.

2) European privacy officials are pointing out that Microsoft, Google and Yahoo’s methods of “anonymization” are not good enough to comply with EU requirements (via EFF).  As we’ve been saying for awhile, “anonymization” is not a very precise claim.  (Even though they also want ISPs to retain browsing histories for law enforcement–confused? I am.)

3) Wikipedia is adding two new executive roles.  In the process of researching our community study, it really struck me how small Wikipedia‘s staff was compared to the staff of more centralized, less community-run businesses like Yelp and Facebook.  Having two more staff members is not a huge increase, but it does make me wonder, is a larger staff inevitable when an organization tries to assert more editorial control over what the community produces?

In the mix

Monday, April 5th, 2010

1) Slate had an interesting take on the bullying story in Massachusetts and the prosecutor’s anger at Facebook for not providing information, i.e., evidence of the bullying.  Apparently, Facebook provided basic subscriber information, but resisted providing more without a search warrant.  Emily Bazelon points out how this area of law is murky, and references the coalition forming around reforming the Electronic Communications Privacy Act, but her larger point is an extra-legal one.  The evidence of bullying the DA was looking for was at one point public, even if eventually deleted. She points out that it may be hard for kids or parents who are upset to have the presence of mind to do this, but that they could take screenshots and preserve evidence themselves.

The case raises a lot of interesting questions about anonymity, privacy, and the values we have online.  Anonymity on the Internet has been a rallying cry for so many people, but I wonder, if something is illegal in the offline world, should it suddenly be legal online because you can be anonymous and avoid prosecution?  (Sexual harassment is a crime in the subway, too!)  We now live in a world where many of us occupy space both online and offline.  We used to think of them as completely separate spaces, and it’s true that the Internet gives us opportunities to do things, both good and bad, that we wouldn’t have offline.  But it’s increasingly obvious that we need to transfer some of the rules we have about the offline world into the online one.  For disability rights advocates, that includes pushing the definition of “public accommodation” to include online stores like Target, and suing them if their sites are not accessible to the blind using screen readers.  For privacy advocates, that includes acknowledging that people have an expectation of privacy in their emails as well as their snail mail.  Free speech in the offline world doesn’t mean you can say anything you want anywhere you want.  Maybe it’s time to be more nuanced about how we protect free speech online as well.

2) It turns out Twitter is pretty good at predicting box office returns — what else might it predict?

3) Cases like this amaze me, because the parties are litigating a question that seems like a no-brainer.  A New Jersey court upheld recently that an employee had an expectation of privacy in her Yahoo personal account, even if she accessed it on a company computer. Would we ever litigate whether an employee had an expectation of privacy in a piece of personal mail she brought to the office and decided to read at her desk?

4) The New York Times is acknowledging their readers’ online comments in separate articles, namely, this one describing readers’ reactions to federal mortgage aid.  It’s a smart way to give online readers a sense that their comments are being read.  I wonder if this is where the “Letters to the Editor” page is going.  I’ve been wondering, who are these readers who are so happy to be the 136th comment on an article?  But the people who write letters to the editor have always been people who have extra time and energy.  In a way, online comments expands the world of people who are willing to write a letter to the editor.

5) Would we feel differently about government data mining if the government were better at it? Mimi and I went to a talk at the NYU Colloquium on Information Technology and Society where Joel Reidenberg, a law professor at Fordham, talked about how transparency of personal information online is eroding the rule of law.  One of the arguments he made against government data mining was that it doesn’t work, with the example of airport security, its inability to stop the underwear bomber, and its terribly inaccurate no-fly lists.  Well, the Obama administration just announced a new system of airport security checks that uses intelligence-based data mining that is meant to be more targeted.  It’s hard to know now whether the new system will be better and smarter, but it raises a point those opposed to data mining don’t seem to consider — what if the government were better at it?  Could data mining be so precise that it avoids racial profiling?  Are there other dangers to consider, and can they be warded off without shutting down data mining altogether?

Fighting fire with fire, data with data

Tuesday, October 28th, 2008

We all know that companies buy and sell databases of information about us.  They use it to make marketing decisions, including how to identify people who are looking to refinance their homes.  There are certainly interesting questions these practices raise about personal information and privacy, but more immediately, as this article makes clear, data mining can have real-life, adverse and sometimes devastating consequences for ordinary people.

For example, when Mercurion Suladdin, a county librarian in Utah, filled out an application to refinance her home at Ameriquest, “she quickly got a call from a salesman at Beneficial, a division of HSBC bank where she had taken out a previous loan.”

The salesman said he desperately wanted to keep her business. To get the deal, he drove to her house from nearby Salt Lake City and offered her a free Ford Taurus at signing.

What she thought was a fixed-interest rate mortgage soon adjusted upward, and Ms. Suladdin fell behind on her payments and came close to foreclosure before Utah’s attorney general and the activist group Acorn interceded on behalf of her and other homeowners in the state.

“I was being bombarded by so many offers that, after a while, it just got more and more confusing,” she says of her ill-fated decision not to carefully read the fine print on her loan documents.

HSBC knew she was a good target because of products offered by data vendors, like “mortgage triggers” and “TargetPoint Predictive Triggers” from Equifax, which “advertises ‘advanced profiling techniques’ to identify people who show a ‘statistical propensity to acquire new credit’ within 90 days.”

Data brokers and lenders defend themselves by saying they are offering a service similar to a second opinion from a doctor.  It’s a good point, especially when you consider that a good doctor would never recommend an alternative treatment without going over all the risks and possible side-effects.  The question isn’t whether it was right or not to give her options; the question is whether she had enough information to weigh those options, especially in comparison to all the information HSBC had about her.

The Internet Age allows all of us to have so much information all the time, but the kind of power that comes with data mining and data analysis tools belongs overwhelmingly to banks, search engine companies, and insurance companies, and not at all to to individuals and consumer protection organizations.  Sure, Ms. Suladdin could have tried to Google “HSBC Beneficial risks?”, but that pales in comparison to what HBSC Beneficial was able to do with her data.

Imagine if she had had data analytic tools similar to her bank’s.  I’m not talking about the ability to go to some online forum to ask questions about mortgages.  What if she had been able to access a database that would tell her what kind of mortgages others like her had received?  Or what was statistically likely to happen to interest rates?  Or a combination of these factors?

Or think about all these nonprofit organizations working hard to deal with the subprime mortgage crisis and the rapidly increasing rate of foreclosures.  Like many nonprofits that offer a service, the first step is to identify the people who need their help and notify them these services exist–not very different from what a business has to do in its marketing strategy.  Why shouldn’t nonprofits have access to the same data as the companies that offered those mortgages to these people in the first place?

Anyone who has read this blog knows, we don’t believe in fighting bad uses of data by shutting down data collection altogether.  Let’s fight fire with fire, data with data.

Politics and Privacy, Part II

Thursday, October 2nd, 2008

Last week, I wrote about how political data collection has shown that data collection doesn’t have to be a completely one-way street, but rather, can involve individuals’ active and sometimes almost enthusiastic participation.  Part of the enthusiasm comes from a belief that this is what democracy is about—we have the right to try to persuade our fellow citizens, whether from a soap box in the town square or by calling a voter list through a phone bank.  But the data collection by political campaigns encompasses a lot more than name, occupation, and email address.  Karl Rove revolutionized it, with his famous use of consumer preferences to identify and target likely Republican voters, but the Democrats have worked hard to catch up, Catalist being one of the big players in this effort. It’s one thing to compile donor lists; another to cross-reference “beer versus wine” preferences to voter lists.  How is democracy affected by intense, data-based voter profiling?

As Solon Barocas pointed out during his talk on voter profiling at the recent DIMACS workshop, researchers have found that micro-targeting voters can increase polarization and divisiveness.  As candidates are able to air one radio ad for the Latino voters in one state and a different one for the white voters in another, they’re able to espouse more extreme positions than they would if forced to appeal to a more general audience.

If true, this is a serious problem.  But I like to believe that in the long run, and done right, political data collection and analysis could actually enable new kinds of consensus and coalition-building.  For one, in an era where blogs monitor political campaigns hour-by-hour, a local radio ad can be made available to a national audience no matter which micro-audience was originally targeted.  (Update: we can even find out about “telephone” calls to the deaf community!)

But more importantly, I can imagine that if voters and not just campaigns were able to see who else felt the way they did on major issues, many might be surprised.  Solon mentioned that despite the headlines, the algorithms by which likely Democratic or Republican voters are identified is not as simple as beer = conservative, wine = liberal.  Yes, campaigns believe they can figure out who in a community might lean in their direction, but it’s a much more complicated calculation.

So if people chose to share and know who else felt similarly, in ways that were more fine-grained than national polls, really interesting things could happen to our political discourse.  The Left Coast environmentalist might learn the hunter in South Dakota shares a commitment to conservation.  The pro-choice atheist and the pro-life Catholic might learn they both oppose the death penalty.  I’m not advocating that we throw open the curtains on the voting booth.  But knowing how our fellow citizens feel about the issues facing all of us—it almost sounds like that old-fashioned American democratic institution, the town hall meeting.

After all, democracy is the ultimate social activity.  We’re supposed to be making decisions together.


Get Adobe Flash player