I have been growing somewhat obsessed with graph theory lately. When you let yourself go this deep into graphs, you start seeing them in day to day life. They consume you. When I look at someone, I just see another node on a graph. No more can I simply walk from my home to the shop. Now I have to traverse the path between the source and a target vertex. I'm growing somewhat concerned. Yet, unlike my sanity, there is no doubt in the prevalence of graph theory. It's applications range from linguistics to computer science, you would be hard-pressed to find a field in which graph theory couldn't be applied. Networks really are everywhere. I think its time you learnt a bit more about them. And what better place to start than one of graph theories most central concepts? Centrality! The identification of the most important nodes within a network. It's a big topic, so I have prepared a dataset that hopefully helps you intuitively understand these concepts.

This is my Facebook friends network (mouse over it if it seems a little small). Every node on this graph is friends with me on Facebook. The edges (lines) between the nodes connect two people that are also Facebook friends. If there isn't a link between two nodes, then they don't know each other. When enough people in a group aren't connected to the rest of the network, social cliques will form. These cliques can contain many people, some of which will be part of other cliques. Some people will have no clique at all...

Can you see how it becomes easy to start seeing this graph as an actual social network? Perhaps I just haven't been outside enough lately. No matter, if you are willing to indulge my obsession further, I'd like to begin discussing the titular question! Who is the most important person in this network?

Obviously, it's me. Look, I'm right there in the middle! The first step in properly answering this question will be removing myself from the network!:

We have managed to form 3 different sub-graphs. Nobody links 'Undergrad Friends II' or 'Postgrad Friends' to the rest of the network. Which means tough luck if I met you at university, I can already tell you that you aren't the most important person. From here on in, we will be only focusing on the largest of these 3 networks, which we will call the 'Main Network'.

Take a closer look at this 'Main Network'. Who is the most important person? There a lot of potential answers, my mother is in that graph. I doubt that she would be pleased with my questioning of her importance. So, to avoid offending anyone, we will have to let heartless maths and logic guide us to an answer. I posit that an important person is also a popular one. The most popular person in a network will be the one with the most friends. Most popular = most important. Simple.

Yes, this idea already has a name. Degree centrality is arguably the most simple of the centrality measures. A node's degree is given by the number of connections it has. You have 10 friends? Your degree is 10. We actually give degree as a fraction. If your degree is 0.5, you are friends with half the people in a social network!

You have actually seen degree centrality once, I used it in the first graph. But, now we will just apply degree centrality to the main network:

Okay, so have we done it? Is 'Dan.We' the most important person? No, I don't think so. To be more precise, I don't believe that degree is the best indicator of a person's importance. Some of the most popular nodes are important but, not all of them. The top 8 (on the bar chart) are all located in the same cluster. Many of the nodes in this cluster have a high degree, but some of them have no connection outside of this group. 'Man.Ha' has the 4th most friends in the network but, if he was removed, the strength of the network wouldn't be affected because all of their connections are all contained to a single dense group.

A social network isn't just one group of people, it is comprised of many smaller groups. People you work with; friends from school and blokes from the pub. Popularity is only a good measure of the importance of an individual within these groups. However, I would argue that, in a social network, it is the people who connect these groups together that are the most important.

We will deploy betweenness centrality to find the people who connect this network together the most! How does it work, you ask?

Let's consider a pair of people at the opposite ends of the graph. Because the network is fully connected, we can trace a line from node to node to connect these people. If you do this with enough pairs, you will find that you pass through some nodes quite often. Betweenness centrality finds these nodes with high traffic.

It does this by finding all of the shortest paths between every node pair in the network. Then a nodes betweenness centrality is given by the fraction of the total shortest paths that the node is on. In mathematical notation, this is given by:

$$g(v) = \sum _{s\neq v\neq t} \frac{\sigma _{st} (v)}{\sigma _{st}}$$

where $\sigma _{st}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma _{st}(v)$ is the number of those paths that pass through $v$.

Hopefully, you can see why betweenness centrality can help us in identifying important people. Nodes that connect the network together the most will have high betweenness centrality because lots of paths will pass through them to connect node pairs. For example, if you work with your mate Dave and Dave knows some of the lads from the pub. Dave is the link between two groups. All paths between the two groups have to pass through Dave. Dave has a high betweenness centrality.

This has done a far better job than degree centrality at highlighting the important people. Nodes like 'Man.Ha', which only have connection within a single cluster, are no longer present in the top 8. Now all of the top 8 nodes can be found on the edges of clusters or between them. These are the nodes that really bring the network together. But, there is still a problem. I wanted to find one most important person in this network. The betweenness value of 'Sam.Hi' is 0.1978 and 'Mat.Pr' is 0.1971. These values are far too close for me to comfortably call one of them more important. Further study is needed.

'Sam.Hi' is a lot like the node 'Man.Ha', they both have many connections within the main cluster, however, 'Sam.Hi' can generate a lot of betweenness from a few connections outside of the cluster. With links to 'Jac.St' and 'Dav.Pr' traffic heading into the main cluster will flow from these two nodes to 'Sam.Hi'. 'Sam.Hi' also has a monopoly on connecting 10 nodes above the main cluster to the whole network by being their only connection to it.

On the other hand, 'Mat.Pr' is physically in the centre of the graph. Being one of only 2 connections that links the left and right side of the graph, causing a lot of traffic will pass through the node. However, unlike 'Sam.Hi', if 'Mat.Pr' was removed, no nodes would be lost from the network.

This is a bit of a dilemma, both nodes have a strong case for being the most important. With the betweenness centrality being so close, we will have to turn to another metric to solve the deadlock.

Our solution can be found deep within the realms of game theory. In cooperative games, the Shapley value is used to fairly distribute payoff among a group based on the contributions of the individuals. In this paper , Shapley values are applied to betweenness centrality creating Shapley betweenness centrality. It adjusts the centrality value to take into account the importance of connecting groups of nodes. Or, as they put it in the paper "The more a vertex contributes to the performance of any possible group of vertices (that this vertex belongs to), the higher its betweenness centrality should be".

How Shapley betweenness centrality does this is rather clever. Feel free check out my implementation if you are curious:

But, to put it in words that we can all understand: Shapley betweenness centrality calculates the value of a node by examing repercussions of the node failing (being removed from the network) on different subsections of the network. Essentially, if the removal of a node were to be detrimental to any group of nodes, the node will have high Shapley betweenness centrality. Can you see how this will be useful for finding important people in a social network?

We have a clear winner! As well as strong candidates for all the top 8 nodes. By weighting the betweenness value toward nodes that connect smaller groups, the Shapley betweenness centrality has provided the best evaluation we have seen so far. The top nodes all have the greatest effect on the strength of the network. With the extreme of this is seen in 'Sam.Hi' which, if removed, would lead to the loss of nodes from the network.

Are finally ready to answer the titular question? Well yes, but, I still have one measure of centrality left to discuss. My own!

I like the idea of evaluating a node based on the amount of strength that the network would lose if it was removed. For example, if a group of nodes are held to a network by only 2 other nodes, a lot of the network's strength would be lost if one of these nodes was removed.

I couldn't find a centrality indicator that could intuitively do this, so, I made my own! (And given it a mathy name). It is similar to degree but, with one major twist. Instead of increasing degree by one for each connection a node a has, the degree is increased by 1 over the degree of the target node. To put this in maths terms:

$$c(v)= \sum _{s \in V(v)} \frac{1}{deg(s)}$$

where $V(v)$ are the set of nodes linked to node $v$ and $deg(s)$ is the degree of a node $s$.

This measure will give higher values to nodes that are connected to less connected nodes. My hypothesis is that a node which is connected to many low degree (low popularity) nodes will be an important node on a social network:

Looks pretty good, if I do say so myself. Many of the nodes that we have seen rank highly so far are again highlighted here. For such a computationally cheap indicator, I think it has done a good job. I am proud of my creation.

It also has, like the previous two indicators, highlighted 'Sam.Hi' and the most important node on the network. I will concede that the question is finally answered. Congrats to 'Sam. Hi' and well done to you for learning more about my most recent obsessions. I won't be sharing much of the code I have produced for this post as is mainly to handle the social network data but, I will share a working example of Shapley Betweenness as there aren't any python packages to do that for you.