Social Network Analysis: Can Quantity Compensate for Quality?

February 15, 2009 at 7:40 am 1 comment

Science magazine has labeled Christakis and Fowler the “dynamic duo”

Nicholas Christakis of Harvard and James Fowler of UC San Diego have produced a series of ground-breaking papers analyzing the spread of various traits in social networks: obesity, smoking, happiness, and most recently, in collaboration with John Cacioppo, loneliness. The Christakis-Fowler collaboration has now become well-known, but from a technical perspective, what was special about their work?

It turns out that they found a way to distinguish between the three reasons why people who are related in a social network are similar to each other.

  1. Homophily is the tendency of people to seek others who are alike. For example, most of us restrict our dates to smokers or non-smokers, mirroring our own behavior.
  2. Confounding is the phenomenon of related individuals developing a trait because of a (shared) environmental circumstance. For example, people living right next to a McDonald’s might all gradually become obese.
  3. Induction is the process of one individual passing a trait or behavior on to their friends, whether by active encouragement or by setting an example.

Clearly, only induction can cause a trait to actually spread in a social network. To distinguish between the three effects and to prove causality, according to the authors, the key is longitudinal data–data from the same individuals collected over a period of years or decades. All of the works cited above are based on the Framingham Heart Study. This corpus of data is ideally suited in several ways:

  • It contains data from three generations of individuals.
  • Very few of the participants (10 out of over 5,000) dropped out:
  • “Even subjects who migrate out of the town of Framingham (to points throughout the U.S.) remain in the study and, remarkably, come back every few years to be examined and to complete survey forms.”
  • The original study sample comprised the majority of the population of Framingham, which is (presumably) a somewhat closed social network.

This illustrates the traditional way of doing things, using carefully selected high-quality data. With the growth of online social networking websites, however, a radically different approach is gaining prominence. A good example is this Slate article that analyzes the recent “25 random things” Facebook meme using well-known epidemiological models, and concludes that marketers should “introduce a wide variety of schemes into the wild and pray like hell that one of them evolves into a virulent meme.” For a more academic/rigorous example, see the paper “Characterizing Social Cascades in Flickr” (pdf), which looks at how information disseminates through social links.

Many analogies come to mind when comparing the Old School to the New School: the Cathedral vs. the Bazaar, or Britannica vs. Wikipedia. Information in social networking sites is collected through a chaotic, organic, unsupervised process. The set of participants is entirely self-selected. Against these objections stands the indisputable fact that the process produces several orders of magnitude more data at a fraction of the cost.

Despite being only a few years old, online social network analysis has already produced deep insights: the work of Jon Kleinberg springs to mind. But will it supplant the traditional approach? I think so. My hypothesis is that with sufficiently powerful analytical methods, quantity can compensate for noise in the data. Don’t take my word for it: Harvard professor Gary King considers the availability of data from online social networks to be the “most significant turning point in the history of sociology.”

The amount and variety of social network data available to researchers, marketers, etc. is rapidly increasing; there is a detailed survey in my forthcoming paper (at IEEE S&P) on de-anonymizing social networks. In spite of the rather serious privacy concerns that are identified in the paper, the balance of business incentives appears to be towards more openness, and my prediction is that social networks will continue to move in that direction. Facebook alone has an incredible wealth of as-yet untapped data on information flow–recent the feed-focused redesign instantly transformed posted items, group memberships and fan pages into meme propagation mechanisms.

The new approach to social network analysis has benefits other than the quantity of data available. Equally important is the fact that users of social networking sites are not participating in a study; we get to observe their lives directly. The data is thus closer to reality. Furthermore, there is the possibility of studying the population actively rather than passively. For instance, if the goal is to study meme propagation, why not introduce memes into the population? This gives the researcher much greater control over the timing, point of introduction, and content of the memes being studied. Of course, this raises ethical and methodological questions, but they will be worked out in due course.

A third benefit of the new approach is that social network users often express themselves using free form text; utilized properly, this could yield much deeper data than making study participants check boxes on a Likert scale in response to canned questions (such as the now famous “How does it feel to be poor and black?“). The Flickr paper cited above analyzes the tags people use to describe pictures. With more technical sophistication, it should be possible, for example, to apply automated sentiment analysis to blog posts, tweets, etc. to determine how your opinion of a movie or book is influenced by those of your friends.

True, we don’t yet have data spanning several decades, but then things happen on a far faster timescale in the online world. There will always be research questions that fundamentally depend on studying aspects of the real world that are not replicable virtually. By and large, however, I believe the new approach is about to supplant the old. There is still a ways to go in terms of developing the techniques we need for analyzing massive, noisy datasets, but we will get there in a few short years. The Christakis-Fowler papers may soon exemplify the exception, rather than the rule, for social network analysis.

Entry filed under: Uncategorized. Tags: , .

De-anonymizing the Internet Anonymous Data Collection: Lessons from the A-Rod Affair

1 Comment Add your own

  • 1. Nick Lim  |  August 18, 2009 at 3:32 pm

    Interesting topic. Arvind, have you found any studies on how real world phenomenon spread through the online social networks?

    It would be interesting to explore the scope of this type of social network analysis, both in terms of amount of data and the complexity of analysis. There are some indications from large network analysis that as size grows, the complexity of useful analysis actually drops…



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 265 other subscribers

%d bloggers like this: