Lately, I have been trying to see what we can learn about the structure of social interactions on a local scale from how people interact with each other on Facebook. Interactions on Facebook have been called “passive”, since people on Facebook interact passively with their friends by viewing (and occasionally liking/commenting on) the content they generate. However, not all friends are equal in terms of these passive interactions. Facebook’s EdgeRank algorithm, which decides the ordering of content in your News Feed, tends to display content from users you interact with far more often. For example, I see a lot of content from people I know at IIT Mandi and almost nothing from the people I am friends with because we went to the 7th grade together.
This means that public interaction between people (wall posts, comments, likes) should reveal a fair amount of information about the structure of social interactions within a community. In this post, I share some interesting observations found by looking at interaction data from Facebook, largely restricted to the IIT Mandi community.
In addition to seeing what we can find out about this structure, I’ve also tried to answer some interesting questions:
- Do users with similar interactions tend to interact with each other more frequently?
- What is the relation between geography and interactions? Do people interact more frequently with people coming from places close to their place of origin?
- How interactions between different groups of people vary over time. For example, did the shifting of one year to the Kamand campus have any effect of how the shifted batch interacts with the rest of the college?
- To what extent is homophily (the tendency of people to form ties with others who are similar to them) prevalent in the network with respect to various factors?
To keep things interesting, I have tried to illustrate things visually as opposed to just quoting numbers.
A Quick Sanity Check
As a quick check to make sure there is nothing odd about the data, the first thing I did was to draw a graph of different types of interactions vs time.
Do people with similar interests interact with each other more frequently?
The next thing I did was to investigate whether there was any correlation between a user’s likes and the people they interact with.
One challenging problem to solve was deciding how to numerically represent the similarity of two users’ tastes. After considering a few alternatives, I went with the Sørensen index of set similarity, where the two sets considered are the sets of pages liked by each user.
Now admittedly, this is a fairly crude metric of similarity that doesn’t capture the sort of similarity in interests I was looking for (which was more along the lines of something that would be able to indicate that, for example, the two users are similar because they have a common interest in photograph or a similar taste in music). The pages liked by a user are often not comprehensive or up to date, and moreover each “interest” often has several pages dedicated to it. Given that this is the case, I wasn’t really expecting to get anything out of this, and was surprised by what I saw.
The first visualization I looked at was a hex-binned plot of the number of interactions vs the Sørensen index.
It isn’t very easy to make out any trend in this plot. There are a lot of users who interact with each other very infrequently. If we try to bin all of the points according to the similarity index and plot a bar graph of the average number of interactions in each bin (shown below), we see a slight upward trend.
Take this with a grain of salt because the similarity metric is very crude, but this could indicate that people with similar interests do tend to interact with each other slightly more frequently. However, it does not tell us whether this is because people with similar interests tend to form ties or rather because people align their interests with those whom the interact with frequently.
In fact, with more data and a better similarity metric, it might be possible to look at a friendship and determine whether it was formed as a result of common interests or due to other factors. That would be really interesting!
Interactions between different groups of people
In order to be able to analyze the interactions between groups of people, I had to go over each of over 400 people in my friend list and manually assign tags to each person to allow me to differentiate between different groups of people.
Once that was done, the next step was trying to answer the question of how shifting the second year students to Kamand affected their interaction on Facebook with others. Here’s a plot of the number of interactions within, across and outside of the group of second year students vs time.
This graph does not help us answer the question at hand but it does have a few interesting features. The first thing to notice is the spike in the number of interactions in Aug 2012. That corresponds to the start of a semester, but the number of interactions between second years students and across groups stays the same for the next two months; which corresponds to the stay in Hotel River Bank without an internet connection. In Oct 2012, the batch was shifted to Kamand and now had access to internet in the hostels; this explains the rise in interactions in Oct and Nov 2012. The interactions go back down after Nov 2012, which marks the end of the semester. Because of the lack of internet connectivity during the first two months, we cannot actually infer anything about whether shifting to Kamand affected social interactions between the second years and the rest of the student body. But what if we plot the fraction of interactions in each category as a percentage of the total number of interactions that month?
In this image we can see that the number of interactions across groups is the about the same in Nov/Dec 2012 as in Feb 2012, which was in the previous semester. I would have expected it to be higher, given that the number of people in the “other” group has gone up from 200 to 300. This hints that shifting to Kamand has, as expected, affected social interactions of the second year students with the rest of the college. We’ll look at this again when looking at the interaction network. But before we go there, let’s see what effect distance between hometowns has on the number of interactions between people.
How much of a role does geography play?
Given that several people at IIT Mandi tend to form groups based upon the region they come from, I expected to see a strong relationship between the distance between hometowns and the number of interactions between two people. The image below is a scatter plot of the distance between the hometowns reported by 344 people and the number of interactions between them. To put the distances in context, the distance between India’s north and south extremities is about 3000 km.
The scatterplot shows that people clearly interact more frequently with others whose hometowns are close to their own, but there are a fair number of interactions with people who are further out as well.
Interaction Network Graph
Here’s what the directed interaction network graph looks like:
The green nodes in the top right are first years, the pink nodes on the left are primarily second years. The green nodes starting below the second year cluster are also mainly second years, but the ones closer to the brownish blob are mainly fourth years. The brownish colored nodes are a mixture of fourth years and third years.
The most interesting thing about this network is that it captures the underlying social structure in the network very well. Zooming in a bit makes different real-world social cliques clearly visible, which says a lot for the hypothesis that the people you interact with more frequently on social networks are also likely to be the people you interact with frequently offline.
One very interesting application of this would be finding the best path to reach someone… if you need a favor from someone who isn’t part of your social clique sometimes it is better to find someone else to ask on your behalf than to approach them directly. This is what the startup Hachi does, they show you the best path through your social networks to a target person you want to be introduced to, though to the best of my knowledge they don’t consider how frequently people interact while deciding the optimal path.
Another interesting thing to be noted is the higher density of edges between the first year cluster and the third/fourth year cluster as compared to that between the first year cluster and the second year cluster.
Automatically detecting interesting relations
One really interesting thing to do with this dataset is to take attributes of each user and try to correlate them with whether or not the users interact frequently… if we feed the data into something like the CN2 induction algorithm or generate a classification tree, the resulting classifiers can describe a lot of interesting splits in the dataset.
Here’s an example of the kind of output we get from CN2 and classification trees considering only the year of the two users.
As such, the images above don’t tell us anything that we couldn’t have discovered ourselves with ease. The real power of this method shows when we have more attributes in consideration. For example, this is what happens when we add in gender (the classification tree was too big to show here, so it has been restricted to the case where the first user is in the third year batch):
Adding more attributes into the mix is bound to help us discover interesting correlations. That would definitely be something really interesting to look into.