Studying social networks at scale often runs into a bit of a problem: available data is tiny. Data is locked away in silos. I’m not going to repeat the larger issues, but having to deal with lawyers for any research use makes approaching the “big sites” unrealistic.
But the open social web, built on protocols like OStatus and Salmon exists and keeps growing. One particular platform, status.net, has a flagship central site at http://identi.ca with over 420 thousand registered local accounts and 71 million posted messages. Note the use of local above. identi.ca is part of a distributed, interlinked network of social network sites. Each site can be configured a little differently. By default, the software emulates a Twitter-like system. Change settings, and it works as bookmark sharing service. With settings and some plumbing, users can agglomerate all of their on-line profiles (up to “open-ness” of the data source). With styling changes, the status.net backbone can serve for photo-sharing, Facebook-like notice sharing (but without multiple levels of pseudo-privacy), or others.
Really, I’m more interested status.net’s federated and well-licensed content aspects. Those open the door to plenty of social network analysis research approaching real-life scales. There are other platforms, but I’m not as familiar with those and don’t treat them here.
Social network analysis
Social network analysis is a rather vague term. I’m considering a computational and statistical side and not classic, sociological social network analysis just as statistical mechanics is different than simmering a sauce on the stovetop. Both address similar issues from different directions and can learn much from the other. Perhaps one day our numerical, mathematical analysis will help better understand social networks. We’re still trying different things to see what fits.
There’s reasonable hope that the current computational tools can help analysts and researchers. Work between our group and the group at Pacific Northwest National Labs analyzed some canned data from the T-world and managed to stumble on a somewhat forgotten outbreak of H1N1 as well as a few “private” T-world accounts that passed along important, timely information during the Atlanta 2009 floods. So there’s value in the data and analysis, but we are not able always to predict what value in which analysis.
To see how complex the field is, consider Fortunato’s survey of different community detection methods. Fortunato gives a glimpse of how many different methods exist for a “simple” task: determining groups of similar items in a social network. We have to objective way compare these methods on real-world data at large scales yet, so there is no known right answer. Simultaneously frustrating and interesting.
License and structure
Instead posting into a locked-off silo, users of the flagship identi.ca site license their posted content with the Creative Commons Attribution (CC-BY) license. The site claims the data also is licensed under CC-BY, although that may not be a concert in many jurisdictions. Other federated sites encode their licenses, and the protocol includes a check for license compatibility. So long as you give credit in an appropriate form, you can use specific dents in examples without worry. I still twitch over including Twitter messages in our ICPP10 paper.
The open API emulates Twitter without the restrictions on use. Naturally, if you’re going to crawl the whole site, it’s wise to ask the developers and admins. They do respond, and they’re quite willing to help with nifty uses. They’ve even given me an open invitation to using a data dump, which I have no doubt would be extended to other researchers.
And the data is rich with semantic annotations:
- Subscriptions are directed, and users subscribe either to other users
or to groups.
- All subscribers see everything posted by a subscribed user/group, which makes data modeling a bit simpler.
- Messages have timestamps for some streaming work (don’t think they keep deletions around).
- Conversations group the messages by Reply-To functionality. It doesn’t catch everything, of course, but inferring missing links would be an interesting project.
- Users can mark messages as ‘favorites’.
- There are ‘repeats’ similar to re-tweets.
- Messages can have attachments as well as extra-service links.
- Tags galore on everything, although many doubt their use.
- Many messages have an associated location, although the location rarely works for me.
If you’re interested in exploring just about any semantic aspect or just the user graph, identi.ca makes the data available. Sites cross-posting also make their data available under a compatible license, so identi.ca works well as a central hub.
Active
Even the single, central site keeps growing. While nice for expanding the data set, status.net sites also are active in the sense that people are investigating better methods for implementing and analyzing the data already. Luke Slater has described identified a local core + fringe community model and is investigating use as a spam detector.1 A flour-covered, football-addled zombie is poking at recommendation heuristics.
Active developers means active code. There are deployed applications with API examples ripe for re-use. Luke Slater has prototype code for his analysis and crawling. So there’s no need to start from scratch. The developer’s community is quite active.
Distributed / federated
And using the federated sites provides data useful for emerging research areas. Different domains may provide different visibility, although most have the same license now. Treating the domains as opaque lets you compare methods on full data to methods on partially obscured or forcibly separated data. The federated social web is a great source of data.
For a motivating useful example, consider analyzing medical data and research coming from different countries. Each has its own privacy laws. You cannot simply copy data to one system and analyze it. You need to pass the least specific aggregate data possible across the legal boundaries. There is very little relevant research in this direction, but it has immediate application both in useful research as well as in multinational medical companies like Merck.
The twisted thing is that succeeding in this research direction could enable tighter private data silos. Ugh.
Drawbacks
As with any data source, there are drawbacks. identi.ca includes posts from just about every country. The location data is not correct enough for accurate filtering. So some projects may not be able to use the data. Some funding agencies are rightfully nervous about the public perception of monitoring citizens of their own or particular other countries. One useful project would be to fit the de-facto standard R-MAT generator to aspects of identi.ca data.
Also, identi.ca is a real, active, evolving social network. It’s complicated. There’s no demonstrable ground truth for any real question. However, many users are friendly and may respond to survey requests.
Unfortunately, I feel the need to point out that Stanford’s SNAP graph library is different than our (Georgia Tech’s) earlier SNAP library. It’s a shame they didn’t check with colleagues or even a search engine before choosing their name. ↩