Archive for the ‘Protecting Privacy in Meaningful Ways’ Category

In the mix…data for coupons, information literacy, most-visited sites

Friday, June 4th, 2010

1) There’s obviously an increasing move to a model of data collection in which the company says, “give us your data and get something in return,” a quid pro quo.  But as Marc Rotenberg at EPIC points out,

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data.  The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again.  These data-centered companies are creating a network of users whose data are continually used in the business.  Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.

2) In a related vein, danah boyd argues that transparency should not be an end in itself, and that information literacy needs to developed in conjunction with information access.  A similar argument can be made about the concept of privacy.  In “real life” (i.e., offline life), no one aims for total privacy.  Everyday, we make decisions about what we want to share with whom.  Online, total privacy and “anonymization” are also impossible, no matter the company promises in its privacy policy.  For our datatrust, we’re going to use PINQ, a technology using differential privacy, that acknowledges privacy is not binary, but something one spends.  So perhaps we’ll need to work on privacy and data literacy as well?

3) Google recently released a list of the most visited sites on the Internet. Two questions jump out: a) Where is Google on this list? and b) Could the list be a proxy for the biggest data collectors online?

Mark Zuckerberg: It takes a village to build trust.

Friday, June 4th, 2010

This whole brouhaha over Facebook privacy appears to be stuck revolving around Mark Zuckerberg.

We seem to be stuck in a personal tug-of-war with the CEO of Facebook frustrated that a 26 year-old personally has so much power over so many.

Meanwhile, Mark Z. is personally reassuring us that we can trust Facebook which on some level implies we must trust him.

But should any single individual really be entrusted with so much? Especially “a 26 year-old nervous, sweaty guy who dodges the questions.” Harsh, but not a completely invalid point.

As users of Facebook, we all know that it is the content of all our lives and our relationships to each other that make Facebook special. As a result, we feel a sense of entitlement about Facebook policy-making that we don’t feel about services that are in many ways way more intrusive and/or less disciplined about protecting privacy (e.g. ISPs, cellphone providers, search).

Another way of putting it is, Facebook is not Apple! and as a result, needs a CEO who is a community leader, not a dictator of cool.

So we start asking questions like, why should Facebook make the big bucks at the expense of my privacy? Shouldn’t I get a piece of that?

(Google’s been doing this for over a decade now, but the privacy exposure over at Google is invisible to the end-user.)

At some point, will we decide we would rather pay for a service than feel like we’re being manipulated by companies who know more about us than we do and can decide whether to use that information to help us or hurt us depending on profit margin. Here’s another example.

Or are there other ways to counterbalance the corporate monopoly on personal information? We think so.

In order for us to trust Facebook, Facebook needs to stop feeling like a benevolent dictatorship, albeit one open to feedback, but also one with a dictator who looks like he’s in need of a regent.

Instead Facebook the company should consider adopting some significant community-driven governance reforms that will at least give it the patina of a democracy.

(Even if at the end of the day, it is beholden to its owners and investors.

For some context, this was the sum total of what Mark Z. had to say about how important decisions are made at Facebook:

We’re a company where there’s a lot of open dialogue. We have crazy dialogue and arguments. Every Friday, I have an open Q&A where people can come and ask me whatever questions they want. We try to do what we think is right, but we also listen to feedback and use it to improve. And we look at data about how people are using the site. In response to the most recent changes we made, we innovated, we did what we thought was right about the defaults, and then we listened to the feedback and then we holed up for two weeks to crank out a new privacy system.

Nothing outrageous. About par for your average web service. (But then again, Facebook isn’t your average web service.)

However, this is what should have been the meat of the discussion about how Facebook is going to address privacy concerns: community agency and decision-making, not Mark Z.’s personal vision of an interwebs brimming with serendipitous happenings.

Facebook the organization needs to be trusted. So it might be best if Mark Z. backed out of the limelight and stopped being the lone face of Facebook.

How might have that D8 interview have turned out if he had come on stage with a small group of Facebook users?

What governance changes would make you feel more empowered as a Facebook user?

Measuring the privacy cost of “free” services.

Wednesday, June 2nd, 2010

There was an interesting pair of pieces on this Sunday’s “On The Media.”

The first was “The Cost of Privacy,” a discussion of Facebook’s new privacy settings, which presumably makes it easier for users to clamp down on what’s shared.

A few points that resonated with us:

  1. Privacy is a commodity we all trade for things we want (e.g. celebrity, discounts, free online services).
  2. Going down the path of having us all set privacy controls everywhere we go on internet is impractical and unsustainable.
  3. If no one is willing to share their data, most of the services we love to get for free would disappear. Randall Rothenberg.
  4. The services collecting and using data don’t really care about you the individual, they only care about trends and aggregates. Dr. Paul H. Rubin.

We wish one of the interviewees had gone even farther to make the point that since we all make decisions every day to trade a little bit of privacy in exchange for services, privacy policies really need to be built around notions of buying and paying where what you “buy” are services and how you pay for them are with “units” of privacy risk (as in risk of exposure).

  1. Here’s what you get in exchange for letting us collect data about you.”
  2. Here’s the privacy cost of what you’re getting (in meaningful and quantifiable terms).

(And no, we don’t believe that deleting data after 6 months and/or listing out all the ways your data will be used is an acceptable proxy for calculating “privacy cost.” Besides, such policies inevitably severely limit the utility of data and stifle innovation to boot.)

Gaining clarity around privacy cost is exactly where we’re headed with the datatrust. What’s going to make our privacy policy stand out is not that our privacy “guarantee” will be 100% ironclad.

We can’t guarantee total anonymity. No one can. Instead, what we’re offering is an actual way to “quantify” privacy risk so that we can track and measure the cost of each use of your data and we can “guarantee” that we will never use more than the amount you agreed to.

This in turn is what will allow us to make some measurable guarantees around the “maximum amount of privacy risk” you will be exposed to by having your data in the datatrust.


The second segment on privacy rights and issues of due process vis-a-vis the government and data-mining.

Kevin Bankston from EFF gave a good run-down how ECPA is laughably ill-equipped to protect individuals using modern-day online services from unprincipled government intrusions.

One point that wasn’t made was that unlike search and seizure of physical property, the privacy impact of data-mining is easily several orders of magnitude greater. Like most things in the digital realm, it’s incredibly easy to sift through hundreds of thousands of user accounts whereas it would be impossibly onerous to search 100,000 homes or read 100,000 paper files.

This is why we disagree with the idea that we should apply old standards created for a physical world to the new realities of the digital one.

Instead, we need to look at actual harm and define new standards around limiting the privacy impact of investigative data-mining.

Again, this would require a quantitative approach to measuring privacy risk.

(Just to be clear, I’m not suggesting that we limit the size of the datasets being mined, that would defeat the purpose of data-mining. Rather, I’m talking about process guidelines for how to go about doing low-(privacy) impact data-mining. More to come on this topic.)

Recap and Proposal: 95/5, The Statistically Insignificant Privacy Guarantee

Wednesday, May 26th, 2010


Image from: From MATLAB fuzzy logic toolbox documentation.

In our search for a privacy guarantee that is both measurable and meaningful to the general public, we’ve traveled a long way in and out of the nuances of PINQ and differential privacy: A relatively new, quantitative approach to protecting privacy. Here’s a short summary of where we’ve been followed by a proposal built around the notion of statistical significance for where we might want to go.

The “Differential Privacy” Privacy Guarantee

Differential privacy guarantees that no matter what questions are asked and how answers to those questions are crossed with outside data, your individual record will remain “almost indiscernible” in a data set protected by differential privacy. (The corollary to that is that the impact of your individual record on the answers given out by differential privacy will be “negligeable.”)

For a “quantitative” approach to protecting privacy, the differential privacy guarantee is remarkably NOT quantitative.

So I began by proposing the idea that the probability of a single record being present in a data set should equal the probability of that single record not being present in that data set (50/50).

I introduced the idea of worst-case scenario where a nosy neighbor asks a pointed question that essentially reduces to a “Yes or no? Is my neighbor in this data set?” sort of question and I proposed that the nosy neighbor should get an equivocal (50/50) answer: “Maybe yes, but then again, (equally) maybe no.”

(In other words, “almost indiscernible” is hard to quantify. But completely indiscernible is easy to quantify.)

We took this 50/50 definition and tried to bring it to bear on the reality of how differential privacy applies noise to “real answers” to produce identity-obfuscating “noisy aswers.”

I quickly discovered that no matter what, differential privacy’s noisy answers always imply that one answer is more likely than another.

My latest post was a last gasp explaining why there really is no way to deliver on the completely invisible, completely non-discernible 50/50 privacy guarantee (even if we abandoned Laplace).

(But I haven’t given up on quantifying the privacy guarantee.)

Now we’re looking at statistical significance as a way to draw a quantitative boundary around a differential privacy guarantee.

Below is a proposal that we’re looking for feedback on. We’re also curious to know if anyone else tried to come up with a way to quantify the differential privacy guarantee?

What is Statistical Significance? Is it appropriate for our privacy guarantee?

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. Applied to our privacy guarantee, you might ask the question this way: When you get an answer about a protected data set, are the implications of that “differentially private” answer (as in implications about what the “real answer” might be) significant or are they simply the product of chance?

Is this an appropriate way to define a quantifiable privacy guarantee, we’re not sure.

Thought Experiment: Tossing a Weighted Coin

You have a coin. You know that one side is heavier than the other side. You have only 1 chance to spin the coin and draw a conclusion about which side is heavier.

At what weight distribution split does the result of that 1 coin spin start to be statistically significant?

Well, if you take the “conventional” definition of statistical significance where results start to be statistically significant when you have less than a 5% chance of being wrong, the boundary in our weighted coin example would be 95/5 where 95% of the weight is on one side of the coin and 5% is on the other.

What does this have to do with differential privacy?

Mapped onto differential privacy, the weight distribution split is the moral equivalent of the probability split between two possible “real answers.”

The 1 coin toss is the moral equivalent of being able to ask 1 question of the data set.

With a sample size of 1 question, the probability split between two possible, adjacent “real answers” would need to be at least 95/5 before the result of that 1 question was statistically significant.

That in turn means that at 95/5, the presence or absence of a single individual’s record in a data set won’t have a statistically significant impact on the noisy answer given out through differential privacy.

(Still 95% certainty doesn’t sound very good.)

Postscript Obviously, we don’t want to be a situation where asking just 1 question of a data set brings it to the brink of violating the privacy guarantee. However, thinking in terms of 1 question is helpful way to figure out the “total” amount of privacy risk the system can tolerate. And since the whole point of differential privacy is that it offers a quantitative way to track privacy risk, we can take that “total” amount and divide it by the number of questions we want to be able to dole out per data set and arrive at a per-question risk threshold.

Really? 50/50 privacy guarantee is truly impossible?

Monday, May 24th, 2010

At the end of my last post, we came to the rather sad conclusion that as far as differential privacy is concerned, it is not possible to offer a 50/50, “you might as well not be in the data set” privacy guarantee because, well, the Laplace distribution curves used to apply identity-obfuscating noise in differential privacy are too…curvy.

No matter how much noise you add, answers you get out of differential privacy will always imply that one number is more likely to be the “real answer” than another. (Which as we know from our “nosy-neighbor-worst-case-scenario,” can translate into revealing the presence of an individual in a data set: The very thing differential privacy is supposed to protect against.)

Still, “50/50 is impossible” is predicated on the nature of the Laplace curves. What would happen if we got rid of them? Are there any viable alternatives?

Apparently, no. 50/50 truly is impossible.

There are a few ways to understand why and how.

The first is a mental sleight of hand. A 50/50 guarantee is impossible because that would mean that the presence of an individual’s data literally has ZERO impact on the answers given out by PINQ, which would effectively cancel out differential privacy’s ability to provide more or less accurate answers.

Back to our worst-case scenario, in a 50/50 world, a PINQ answer of 3.7 would not only equally imply that the real answer was 0 as that it was 1, it would also equally imply that the real answer was 8, as that it was 18K or 18MM. Differential privacy answers would effectively be completely meaningless.

Graphically speaking, to get 50/50, the currently pointy noise distribution curves would have to be perfectly horizontal, stretching out to infinity in both directions on the number line.

What about a bounded flat curve?

(If pressed, this is probably the way most people would understand what is meant when someone says an answer has a noise level or margin of error of +/-50.)

Well, if you were to apply noise with a rectangular curve, in our worst-case scenario, with +/-50 noise, there would be a 1 in 100 chance that you get an answer that definitively tells you the real answer.

If the real answer is 0, with a rectangular noise level +/- 50 would yield answers from -50 to +50.

If the real answer is 1, a rectangular noise level +/-50 would yield answers from -49 to +51.

If you get a PINQ answer of 37, you’re set. It’s equally likely that the answer is 0 as that the answer is 1. 50/50 achieved.

If you get a PINQ answer of 51, well you’ll know for sure that the real answer is 1, not 0. And there’s a 1 in a 100 chance that you’ll get an answer of 51.

Meaning there’s a 1% chance that in the worst-case scenario you’ll get 100% “smoking gun” confirmation of that someone is definitely present in a data set.

As it turns out, rectangular curves are a lot dumber than those pointy Laplace things because they don’t have asymptotes to plant a nagging seed of doubt. In PINQ, all noise distribution curves have an asymptote of zero (as in zero likelihood of being chosen as a noisy answer).

In plain English, that means that every number on the real number line has a chance (no matter how tiny) of being chosen as a noisy answer, no matter what the “real answer” is. In other words, there are no “smoking guns.”

So now we’re back to where we left off in our last post, trying to pick an arbitrary arbitrary probability split for our privacy guarantee.

Or maybe not. Could statistical significance come and save the day?

Could we quantify our privacy guarantee by saying that the presence or absence of a single record will not affect the answers we give out to a statistically significant degree?

In the mix…DNA testing for college kids, Germany trying to get illegally gathered Google data, and the EFF’s privacy bill of rights for social networks

Friday, May 21st, 2010

1) UC Berkeley’s incoming class will all get DNA tests to identify genes that show how well you metabolize alcohol, lactose, and folates. “After the genetic testing, the university will offer a campuswide lecture by Mr. Rine about the three genetic markers, along with other lectures and panels with philosophers, ethicists, biologists and statisticians exploring the benefits and risks of personal genomics.”

Obviously, genetic testing is not something to take lightly, but the objections quoted sounded a little paternalistic. For example, “They may think these are noncontroversial genes, but there’s nothing noncontroversial about alcohol on campus,” said George Annas, a bioethicist at the Boston University School of Public Health. “What if someone tests negative, and they don’t have the marker, so they think that means they can drink more? Like all genetic information, it’s potentially harmful.”

Isn’t this the reasoning of people who preach abstinence-only sex education?

2) Google recently admitted they were collecting wifi information during their Streetview runs.  Germany’s reaction? To ask for the data so they can see if there’s reason to charge Google criminally.  I don’t understand this.  Private information is collected illegally so it should just be handed over to the government?  Are there useful ways to review this data and identify potential illegalities without handing the raw data over to the government?  Another example of why we can’t rest on our laurels — we need to find new ways to look at private data.

3) EFF issued a privacy bill of rights for social network users.  Short and simple.  It’s gotten me thinking, though, about what it means that we’re demanding rights from a private company. Not to get all Rand Paul on people (I really believe in the Civil Rights Act, all of it), but users’ frustrations with Facebook and their unwillingness to actually leave makes clear that the service Facebook is offering is not just a service provided to just a customer.  danah boyd has a suggestion — let’s think of Facebook as a utility and regulate it the way we regulate electric, water, and other similar utilities.

In the mix…Linkedin v. Facebook, online identities, and diversity in online communities

Friday, May 14th, 2010

1) Is Linkedin better than Facebook with privacy? I’m not sure this is the right question to ask. I’m also not sure the measures Cline uses to evaluate “better privacy” get to the heart of the problem.  The existence of a privacy seal of approval, the level of detail in the privacy policy, the employment of certified privacy professionals … none of these factors address what users are struggling to understand, that is, what’s happening to their information.  73% of adult Facebook users think they only share content with friends, but only 42% have customized their privacy settings.

Ultimately, Linkedin and Facebook are apples to oranges.  As Cline points out himself, people on Linkedin are in a purely professional setting.  People who share information on Linkedin do so for a specific, limited purpose — to promote themselves professionally.  In contrast, people on Facebook have to navigate being friends with parents, kids, co-workers, college buddies, and acquaintances.  Every decision to share information is much more complicated — who will see it, what will they think, how will it reflect on the user?  Facebook’s constant changes to how user information makes these decisions even more complicated — who can keep track?

In this sense, Linkedin is definitely easier to use.  If privacy is about control, then Linkedin is definitely easier to control.  But does this mean something like Facebook, where people share in a more generally social context, will always be impossible to navigate?

2) Mark Zuckerberg thinks everyone should have a single identity (via Michael Zimmer).  Well, that would certainly be one way to deal with it.

3) But most people, even the “tell-all” generation, don’t really want to go there.

4) In a not unrelated vein, Sunlight Labs has a new app that allows you to link data on campaign donations to people who email you through Gmail.  At least with regards to government transparency, Sunlight Labs seems to agree with Mark Zuckerberg.  I think information about who I’ve donated money to should be public (go ahead, look me up), but it does unnerve me a little to think that I could email someone on Craigslist about renting an apartment and have this information just pop up.  I don’t know, does the fact that it unnerves me mean that it’s wrong?  Maybe not.

5) Finally, a last bit on the diversity of online communitiesit may be more necessary than I claimed, though with a slightly different slant on diversity.  A new study found that the healthiest communities are “diverse” in that new members are constantly being added.  Although they were looking at chat rooms, which to me seems like the loosest form of community, the finding makes a lot of sense to me.  A breast cancer survivors’ forum may not care whether they have a lot of men, but they do need to attract new participants to stay vibrant.

In the mix…Everyone’s obsessed with Facebook

Friday, May 7th, 2010

UPDATE: One more Facebook-related bit, a great graphic illustrating how Facebook’s default sharing settings have changed over the past five years by Matt McKeon. Highly recommend that you click through and watch how the wheel changes.

1) I love when other people agree with me, especially on subjects like Facebook’s continuing clashes with privacy advocates. Says danah boyd,

Facebook started out with a strong promise of privacy…You had to be at a university or some network to sign up. That’s part of how it competed with other social networks, by being the anti-MySpace.

2) EFF has a striking post on the changes made to Facebook’s privacy policy over the last five years.

3) There’s a new app for people who are worried about Facebook having their data, but it means you have to hand it over to this company which also states, it “may use your info to serve up ads that target your interests.” Hmm.

4) Consumer Reports is worried that we’re oversharing, but if we followed all its tips on how to be safe, what would be the point of being on a social network? On its list of things we shouldn’t do:

  • Posting a child’s name in a caption
  • Mentioning being away from home
  • Letting yourself be found by a search engine

What’s the fun of Facebook if you can’t brag about the pina colada you’re drinking on the beach right at that moment? I’m joking, but this list just underscores that we can’t expect to control safety issues solely through consumer choices. Another thing we shouldn’t do is put our full birthdate on display, though given how many people put details about their education, it wouldn’t necessarily be hard to guess which year someone was born. Consumer Reports is clearly focusing on its job, warning consumers, but it’s increasingly obvious privacy is not just a matter of personal responsibility.

5) In a related vein, there’s an interesting Wall St. Journal article on whether the Internet is increasing public humiliation. One WSJ reader, Paul Cooper, had this to say:

The simple rule here is that one should always assume that everything one does will someday be made public. Behave accordingly. Don’t do or say things you don’t want reported or repeated. At least not where anyone can see or hear you doing it. Ask yourself whether you trust the person who wants to take nude pictures of you before you let them take the pictures. It is not society’s job to protect your reputation; it’s your job. If you choose to act like a buffoon, chances are someone is going to notice.

Like I said above, privacy in a world where the word “public” means really really public forever and ever, and “private” means whatever you manage to keep hidden from everyone you know, protecting “privacy” isn’t only a matter of personal responsibility. The Internet easily takes actions that are appropriate in certain contexts and republishes them in other contexts. People change, which is part of the fun of being human. Even if you’re not ashamed of your past, you may not want it following you around in persistent web form.

Perhaps on the bright side, we’ll get to a point where we can all agree everyone has done things that are embarrassing at some point and no one can walk around in self-righteous indignation. We’ve seen norms change elsewhere. When Bill Clinton was running for president, he felt compelled to say that he had smoked marijuana but had never inhaled. When Barack Obama ran for president 16 years later, he could say, “I inhaled–that was the point,” and no one blinked.

6) The draft of a federal online privacy bill has been released. In its comments, Truste notes, “The current draft language positions the traditional privacy policy as the go to standard for ‘notice’ — this is both a good and bad thing.” If nothing else, the “How to Read a Privacy Policy” report we published last year had a similar conclusion, that privacy policies are not going to save us.

Building a community: the implications of Facebook’s new features for privacy and community

Thursday, May 6th, 2010

As I described in my last post, the differences between MySpace and Facebook are so stark, they don’t feel like natural competitors to me.  One isn’t necessarily better than the other.  Rather, one is catering to people who are looking for more of a public, party atmosphere, and the other is catering to people who want to feel like they can go to parties that are more exclusive and/or more intimate, even when they have 1000 friends.

But this difference doesn’t mean that one’s personal information on Facebook is necessarily more “private” than on MySpace.  MySpace can feel more public.  There is no visible wall between the site and the rest of the Internet-browsing community.  But Facebook’s desire to make more of its users’ information public is no secret.  For Facebook to maintain its brand, though, it can’t just make all information public by default.  This is a company that grew by promising Harvard students a network just for them, then Ivy League students a network just for them, and even now, it promises a network just for you and the people you want to connect with.

Facebook needs to remain a space where people feel like they can define their connections, rather than be open to anyone and everyone, even as more information is being shared.

And just in time for this post, Facebook rolled out new features that demonstrate how it is trying to do just that.

Facebook’s new system of Connections, for example, links information from people’s personal profiles to community pages, so that everyone who went to Yale Law School, for example, can link to that page. Although you could see other “Fans” of the school on the school’s own page before, the Community page puts every status update that mentions the school in one place, so that you’re encouraged to interact with others who mention the school.  The Community Pages make your presence on Facebook visible in new ways, but primarily to people who went to the same school as you, who grew up in the same town, who have the same interests.

Thus, even as information is shared beyond current friends, Facebook is trying to reassure you that mini-communities still exist.  You are not being thrown into the open.

Social plug-ins similarly “personalize” a Facebook user’s experience by accessing the user’s friends.  If you go to CNN.com, you’ll see which stories your friends have recommended.  If you “Like” a story on that site, it will appear as an item in your Facebook newsfeed.  The information that is being shared thus maps onto your existing connections.

The “Personalization” feature is a little different in that it’s not so much about your interactions with other Facebook users, but about your interaction with other websites.  Facebook shares the public information on your profile with certain partners.  For example, if you are logged into Facebook and you go to the music site Pandora, Pandora will access public information on your profile and play music based on the your “Likes.”

This experience is significantly different from the way people explore music on MySpace.  MySpace has taken off as a place for bands to promote themselves because people’s musical preferences are public.  MySpace users actively request to be added to their favorite bands’ pages, they click on music their friends like, and thus browse through new music.  All of these actions are overt.

Pandora, on the other hand, recommends new music to you based on music you’ve already indicated you “Like” on your profile.   But it’s not through any obvious activity on your part.  You may have noted publicly that you “Like” Alicia Keys on your Facebook profile page, but you didn’t decide to actively plug that information into Pandora.  Facebook has done it for you.

Depending on how you feel about Facebook, you may think that’s wonderfully convenient or frighteningly intrusive.

And this is ultimately why Facebook’s changes feel so troubling for many people.

Although they aren’t ripping down the walls of its convention center and declaring an open party. As Farhad Manjoo at Slate says, Facebook is not tearing down its walls but “expanding them.”

Facebook is making peepholes in certain walls, or letting some people (though not everyone) into the parties users thought were private.

This reinforces the feeling that mini-communities continue to exist within Facebook, something the company should try to do as it’s a major draw for many of its users.

Yet the multiplication of controls on Facebook for adjusting your privacy settings makes clear how difficult it is to share information and maintain this sense of mini-communities.  There are some who suspect Facebook is purposefully making it difficult to opt-out.  But even if we give Facebook the benefit of the doubt, it’s undeniable that the controls as they were, plus the controls that now exist for all the new features, are bewildering.  Just because users have choices doesn’t mean they feel confident about exercising them.

On MySpace, the prevailing ethos of being more public has its own pitfalls.  A teenager posting suggestive photos of herself may not fully appreciate what she’s doing.  At the least, though, she knows her profile is public to the world.

On Facebook, users are increasingly unsure of what information is public and to whom.  That arguably is more unsettling than total disclosure.

In the mix — open data issues, bad econ stats, Facebook gaydar, and fraud detection in data

Friday, April 30th, 2010

1) It’s definitely become trendy for cities to open up their data, and I appreciated this article about Vancouver for its substantive points:

  • It’s important that data not only be open but be available in real time.  In all my conversations with people who work with data, though, whenever you have sensitive data, there’s going to be a significant time lag between when the data is collected and when it is “cleaned up” and made presentable for the public so as to avoid inadvertent disclosure.  This is why we think something like PINQ, a filter using differential privacy, could be revolutionary in making data available more quickly — it won’t need to be scrubbed for privacy reasons.
  • Licensing is an issue — although the city claims the data is public domain, there are terms of use that restrict use of the data by things like OpenStreetMaps.  It discusses the possibility of using the Public Domain Dedication and License, which is a project of Open Data Commons.  Alex heard some interesting discussion on this issue from Jordan Hatcher at the OkCon this past weekend.  This is a really fascinating issue, and I’m curious to see where else this gets picked up.

2) Existing economic statistics are riddled with problems.  I can’t say this enough — if existing ways of collecting and analyzing data are not quite good enough, we need to be open to new ones.

3) This is an old article, but highlights an issue Mimi and I have been thinking a lot about recently: How can data, even when shared according to your precise directions, reveal more than you intended? In this case, researchers found you could more or less determine the sexual orientation of people on Facebook based on their friends, even if they hadn’t indicated it themselves.  Privacy is definitely about control, yet how do you control something you don’t even know you’re revealing?

4) This past week, the Supreme Court heard a case involving the right to privacy of those who sign petitions to put initiatives on the ballot.  There is a lot of stuff going on in this case, gay rights, the experience of those in California who were targeted for supporting Prop 8, the difference between voting and legislating, etc., but overall, it’s a perfect illustration of how complicated our understanding of public and private has gotten.  We leave those lists open to scrutiny so we can prevent fraud — people signing “Mickey Mouse” — but public when you can go look at the list at the clerks’ office and public when you can post information online for millions to see are two different things.  There may be reasons we want to make these names public other than to prevent fraud (Justice Scalia thinks so), but are there other ways fraud could be detected among signatories that would not require an open examination of all petition signers’ names?  Could modern technology help us detect odd patterns, fake names and more without revealing individual identities?

Get Adobe Flash playerPlugin by wpburn.com wordpress themes