Analyzing networks - Introduction to Social Media Investigation: A Hands-on Approach, 1st Edition (2015)

Introduction to Social Media Investigation: A Hands-on Approach, 1st Edition (2015)

Chapter 21. Analyzing networks

Abstract

Social network analysis (or SNA) involves studying the structure of people's connections—especially things like who is most important or influential in the network and which groups of people are closely connected. This chapter introduces basic concepts in SNA and describes some available tools for conducting it.

Keywords

Social media

Social networks

Social network analysis

So far in this book, we've addressed how to collect information that people post on their social media profiles. After collection, there may be useful things to learn from that information as is. But if you want to go further, then the information must be analyzed.

There are many kinds of analysis, but a particularly useful one is social network analysis: the analysis of social connections a person has with others. Social network analysis (or SNA) involves studying the structure of people's connections—especially things like who is most important or influential in the network and which groups of people are closely connected.

This chapter will introduce basic concepts in SNA and describe some available tools for conducting it. These are more advanced techniques than those covered so far.

Introduction

Before learning any terminology or technical details, we can begin with an intuitive first example of what can be done with SNA.

First, recall that in almost all social media sites, people have friends or other social connections. In doing SNA, we look at the connections among the target's friends. For example, if Alice is friends with Bob and Chuck, we would consider whether or not Bob and Chuck are friends with one another.

Visualizations

Much of the analysis involves using pictures (called visualizations) of a social network. Each circle in the image represents a person. If two represented people are friends on the social media site, then they are connected by a line.

A Simple Visualization

Consider the simple example of Figure 21.1. The three circles represent three people, labeled Alice, Bob, and Chuck. Lines from Alice to Bob and from Alice to Chuck indicate that she has connections to both of them. Conversely, since no line exists between Bob and Chuck, we know they are not connected.

f21-01-9780128016565

FIGURE 21.1 A visualization of a small social network. This shows that Alice has connections to Bob and Chuck, but that Bob and Chuck do not have a connection to one another.

Most social network visualizations are much larger than this. At those scales, patterns tend to emerge.

A Complex Visualization

Take a look at Figure 21.2. It will seem complex at first but should be clear with a bit of explanation.

f21-02-9780128016565

FIGURE 21.2 A visualization of a more complex network. As before, each dot represents a person and each line connecting two dots indicates that those people have a social connection.

This picture shows all the followers of a specific target on Facebook. The lines that connect two dots indicate that the people have a social connection. In this case, since it is a Facebook network, a line between two dots means that the two people are friends. The target is not in the picture, since we know the target would have a line connecting him or her to every other dot. Remember that although the dots/circles are smaller than in Figure 21.1, each one still represents a person here.

Sometimes, when the dots are really close together, you can't see a line between them. On the other hand, the closer the dots are to one another, the more strongly they (and their friends) are connected. Thus, even if you can't see a line between very close dots, there's often a connection between those overlapping dots.

The color coding in the image is used to indicate which people are most tightly grouped together. You'll also notice that some dots are bigger than others. The size indicates the influence that person has in the network. Both the groups of nodes and the importance rankings are things that can be calculated by tools that produce these visualizations, and we will see how that works later in the chapter.

Learning from the Visualization

Proceeding only from the picture in Figure 21.2, we can draw some conclusions. (Recall that the dots are people and the lines are social connections.) First, this target has very distinct groups of friends. The red group toward the top of the image and the green one at the bottom are totally disconnected from everyone else. Only one person connects the blue and purple groups on the right with the larger yellow group on the left. If we put names on the dots, we would know which people are part of which group, who is most influential, etc.

SNA allows us to generate these pictures and make calculations about people's role in the social network. The rest of this chapter will explain the basic terminology and computations of SNA, along with the software available to help you collect and analyze this data.

Terminology

If you choose to apply any SNA to your targets, there are some terms that are important to know.

Nodes, Edges, and Graphs

People in your network are called nodes. Nodes are represented by the circles/dots in the image shown in Figure 21.1. The relationships between people, shown as lines connecting the nodes in Figure 21.1, are called links or edges.

A group of nodes and edges make up a social network. This is also called a graph or social graph.

Nodes and edges can have information attached to them. For example, nodes represent people, so they could be labeled with the names of the people they represent. Edges can also have labels. This could describe the type of relationship people have (e.g., family, friend, and coworker), or it could be a numeric value that describes how often people interact with one another or how strong their relationship is.

You can use any labels you want on nodes and edges if it makes the network more informative.

Edge Directionality

Edges can also have a direction.

An undirected edge generally means that the people connected by the edge know one another. For example, on Facebook, when Bob adds Chuck as a friend, Chuck has to approve it. The friendship relationship on Facebook implies that both people want the social connection. However, that is not always the case. For example, on Twitter, Bob may follow Chuck, but Chuck may not follow Bob back. In that case, we would want to represent that the following relationship only goes one way.

A directed edge is usually drawn with an arrow head pointing in the direction of the relationship. If there are some directed edges in a network, all the edges should have a direction indicated, even if they are mutual. Figure 21.3 shows a couple ways of drawing these edges. In this case, A has a directed edge to B, while A and C have a mutual relationship. The graph on the left shows a double-headed arrow between A and C indicating their relationship. This can also be drawn with two arrows that show each direction, like in the graph on the right.

f21-03-9780128016565

FIGURE 21.3 Two ways of drawing directed networks. The graph on the left has a one-way relationship from node A to node B and a bidirectional relationship between A and C indicated by a two-headed arrow. The same is shown on the right, but two edges connect A and C.

With any network (directed or undirected), there are common features that are helpful to consider. Figure 21.4 shows a sample network that we will use to think about these concepts. The network in Figure 21.4 is undirected.

f21-04-9780128016565

FIGURE 21.4 A sample network with 8 nodes and 8 edges.

Paths

A path is a series of edges connecting two nodes. For example, though Ed and Alice do not have an edge connecting them (there is not a direct relationship), there is a path from Ed, to Heidi, to Alice. Paths have lengths, which are measured in the number of edges you have to traverse to get from one person to the next. In the path from Ed to Alice, the shortest path between them has a length of 2 (one edge from Ed to Heidi and another from Heidi to Alice).

From Frank to Chuck, there are two shortest paths: the first step can go to either Ed or Gerda; then, the second step goes to Heidi, then to Alice, and then to Chuck. The path length is 4.

Shortest path length is a very important property in SNA. It helps you understand how closely two people are connected. If the shortest path connecting two people is very long, it is unlikely that they may ever interact or influence one another. Two people with a path of length 2 between them mean that it's more likely they will have some interaction.

Node Degree

People may have many friends or few. In a network, the number of connections a node has is called a degree. In Figure 21.4,

• Alice has the highest degree: 4 (connections to Heidi, Bob, Chuck, and Dan);

• Dan has a degree of 1 (only one connection to Alice); and

• Frank has a degree of 2 (connections to Gerda and Ed).

If you have a directed network, nodes also have an in degree and an out degree, which correspond to the number of edges coming in (with arrows pointing at the node) and going out (arrows pointing away), respectively.

Egocentric Networks

It is not possible to look at the full Facebook network of 1.4 billion people. It is too big and it is unlikely to yield many interesting insights from looking at the whole thing. However, one way we often look at social network data is through an egocentric network. This is a social network focused around one individual. The egocentric network for Alice will consist of Alice's friends' any edges that exist between her friends.

In Figure 21.4, Alice has four friends: Heidi, Bob, Chuck, and Dan. There are no edges directly connecting any of them to one another, so Alice's egocentric network would have only 4 stand-alone nodes.

In social media, egocentric networks are often much more complex and interesting. For example, Figure 21.2 is actually a visualization of the author's egocentric network on Facebook. The colored clusters represent different groups of friends. The yellow group in the center is work friends. The blue group toward the upper right is family and the purplish group next to that is high school friends. The large blue node connecting family and high school friends is her brother. The green group isolated at the bottom is her hockey team, and the red group at the top is a group of internet friends who met on a forum.

Clusters

This egocentric network also highlights another important concept for analyzing networks: clusters. Clusters are groups of nodes that have many connections between them and are more tightly grouped than others. There is no real technical definition of a cluster, but they are easy to see when looking at a picture of the network, like in Figure 21.2.

This chapter will primarily focus on egocentric networks—specifically, a target's egocentric network. The analysis will reveal information about the target's social circles, the people important in the target's life, and those who have the most influence among the target's friends.

Analysis

One of the most interesting things we can do when analyzing a social network is to determine which nodes are most important. There are a number of ways to do this, and each measure tells us about importance in a different way. In this section, we will look at the most popular measures of node importance, but we will not go over the details of how to calculate these values. If you use a network analysis program (discussed later in this chapter), it will calculate those values for you.1 Thus, the important thing to take from this section is to recognize the names and meanings of each measure.

Centrality

Centrality is the term used to describe the collection of measures that indicate how important a node is. There are a number of ways to calculate centrality, but we will focus on four major methods: degree centrality, closeness centrality, betweenness centrality, andeigenvector centrality.

Degree Centrality

Degree centrality is the simplest centrality measure to compute. Recall that a node's degree is simply a count of how many social connections (i.e., edges) it has. The degree centrality for a node is simply its degree. A node with 10 social connections would have a degree centrality of 10. A node with 1 edge would have a degree centrality of 1.

Sometimes, a SNA program will convert those numbers into a 0-1 scale. In such cases, the node with the highest degree in the network will have a degree centrality of 1, and every other node's centrality will be the fraction of its degree compared with that most popular node. For example, if the highest-degree node in a network has 20 edges, a node with 10 edges would have a degree centrality of 0.5 (10 ÷ 20). A node with a degree of 2 would have a degree centrality of 0.1 (2 ÷ 20).

For degree centrality, higher values mean that the node is more central. As mentioned above, each centrality measure indicates a different type of importance. Degree centrality shows how many connections a person has. They may be connected to lots of people at the heart of the network, but they might also be far off on the edge of the network. For example, in Figure 21.5, both nodes labeled “Bob” have the same high degree (i.e., lots of social connections, 9 in this case), but the two roles they play are very different. The one on the right is very central and the one on the left is peripheral. These show that while degree centrality accurately tells us who has a lot of social connections, it does not necessarily show who is in the “middle” of the network.

f21-05-9780128016565

FIGURE 21.5 Two networks where “Bob” has a degree of 9. In (a), he is on the periphery of the main network. In (b), he is right in the middle.

Closeness Centrality

Closeness centrality looks for the node that is closest to all other nodes. Recall that a path is a series of steps that go from one node to another. Closeness centrality for a node is the average length of all the shortest paths from that one node to every other node in the network.

To see how it works, we can do a simple example with the network in Figure 21.6.

f21-06-9780128016565

FIGURE 21.6 A sample network.

Let us determine the closeness centrality for node D and for node A.

Start by computing the average shortest path length of node D.

Next, we need the distance from D to every other node in the network. It has a distance of 1 to each of its direct friends: C, E, and H. The following table shows all of the shortest path lengths for D.

Node

Shortest path from D

A

3

B

2

C

1

E

1

F

2

G

2

H

1

The average of those shortest path lengths is

si1_e

We divide it by 7 because there are 7 other nodes. Thus, the closeness centrality for node D is 1.71.

We do the same process for node A. The table below has all the shortest path lengths.

Node

Shortest path from A

B

1

C

2

D

3

E

4

F

5

G

5

H

4

Here, the average shortest path length is

si2_e

Thus, node A's closeness centrality is 3.43.

In the case of closeness centrality—unlike with degree centrality—smaller values mean that the node is more central, because it means that it takes fewer steps to get to other nodes. So, since D's value of 1.71 is smaller than A's value of 3.43, D is more central.

Closeness centrality corresponds the closest to what we see visually. Nodes that are very central by this measure tend to appear in the middle of a network. A node with strong closeness centrality also tends to be close to most people. In an investigation, that means the person will be in a good position to hear from most friends of friends. They will be a good source of secondhand information since it can reach them quite easily.

The final two centrality measures are more complicated in their calculation, but they also offer additional insights.

Betweenness Centrality

Betweenness centrality is a widely used measure that captures a person's role in allowing information to pass from one part of the network to the other.

For example, consider Bob in Figure 21.5 (A). He is the critical mode that allows information to pass from the cluster on the right to all the individual people he knows that were shown on the left. All information passing to and from those notes on the left must go through Bob if it is going to reach anyone else.

Thus, Bob is very important to the flow of information through this network. This is what betweenness centrality captures. Technically, it measures the percentage of shortest paths that must go through the specific node. The computation of this is quite complex, but every network analysis software tool will compute it for you. The important thing to know is that betweenness is a measure of how important the node is to the flow of information through a network.

In an investigation, a node with high betweenness is likely to be aware of what is going on in multiple social circles. For example, in Figure 21.2, the large blue node in the upper right connects the blue group to the purple group. It is the only node that does so. Thus, talking to this large blue node with high betweenness is likely to yield insights about what both groups are doing and what is going on between those two groups.

Eigenvector Centrality

The final centrality measure is eigenvector centrality. It measures the influence that a node has in a network. Again, the computation is quite complex, but any software package you use will compute it for you. (Interestingly, this measure is very similar to what Google uses to rank web pages by importance.)

A node may have a low-degree centrality—and maybe even weak closeness centrality and betweenness centrality—but it can still be influential. Although a node that is central by one measure is often central by several other measures, this is not necessarily always the case.

Figure 21.7 shows centrality according to the four measures we have looked at. Red nodes are very central according to the given measure, and blue nodes are not central. Notice how there are large differences among the four pictures of the same network.

f21-07-9780128016565

FIGURE 21.7 This is the same network shown four times. Color coding indicates centrality according for different measures. Red nodes are more central and blue nodes are less central. Version (a) is degree centrality, (b) uses closeness centrality, (c) shows betweenness centrality, and (d) is eigenvector centrality. This visualization is adapted from Claudio Rocchini.

In summary, in an investigation, it is worth taking a look at anyone who has high centrality according to any of these measures. It's important to remember what each measure of centrality means:

Degree centrality shows people with many social connections.

Closeness centrality indicates who is at the heart of a social network.

Betweenness centrality describes people who connect social circles.

Eigenvector centrality is high among influential people in the network.

Obtaining Social Network and Data

For anyone—of any skill level—obtaining data about the social network is often the most challenging part of the process. Although social networks can be encoded by hand, social media networks often have hundreds or thousands of people in them, with thousands or tens of thousands of edges, making hand encoding impossible.

Fortunately, there are tools available online and built into network analysis tools that will grab the social network for a target user and allow you to save it in a format that an analysis tool can open. Those tools are constantly evolving and changing, and new ones are becoming available.

In the next chapter, we will look at popular and stable tools that are likely to be available to you. Those tools will also allow you to analyze social networks.

In addition to computing the centrality measures described above with other statistics, these tools create visualizations. The images in Figures 21.2 and 21.7 are network visualizations. It is essentially a picture of the network. Nodes are shown as dots and edges are lines connecting them. There are many automated techniques for arranging the network into patterns that highlight important features, like clusters and important nodes.

The next chapter will also illustrate how to get started with these tools, including how to calculate some of the statistics mentioned here and how to create a visualization. But before we get to the details of how to get the data and create a visualization, let's look at some examples of what you should look for in a visualization and what it means in the context of an investigation.

Example Analyses

To see how this analysis could work in an investigation, this section will present several example networks and walk through some of the insights that come from looking at them.

Example 1

Start by looking at the network in Figure 21.8. (This network is contrived for illustrative purposes, not created from real-world data.) We will treat it as the egocentric network of a target (i.e., each node is a person the target knows). The edges connect which of those people know one another.

f21-08-9780128016565

FIGURE 21.8 A sample egocentric network of a target. Color indicates betweenness centrality, with darker nodes being more central. Size indicates degree centrality, where larger nodes have higher degree.

The nodes in Figure 21.8 indicate additional information as follows:

• Color indicates betweenness centrality (higher betweenness nodes are darker).

• Size indicates degree centrality (higher-degree nodes are larger).

Each node is labeled with a letter to make discussion easier.

What are some things we can learn about the target from looking at this network?

First, remember that this is an egocentric network. It shows only people who are connected to the target. The edges indicate which of the target's connections know one another. For example, since node Z and node S are connected in the network, that means that the target knows both Z and S and they know one another.

With that in mind, the groups at the top and bottom become very interesting. At the bottom, node Z is connected to six people (T, U, V, W, X, and Y) who are not connected to anyone else. Since this is an egocentric network, that means that the target and node Z know all these people, but they do not know anyone else in the social circle. This probably implies that the target and Z have a special kind of relationship where they are together when they meet other people. There is no way to know what kind of relationship, but spouses would be one example where we might see this. If the target and node Z are married, it is likely that, as a couple, they would have met people together. The nodes connected to Z could be six people they had met as a couple.

Even more interesting is the fact that at the top of the network in Figure 21.8 is another node with which the target had a similar relationship. Node A is connected to 9 other nodes that are not connected to anyone else in the network. Could this be another romantic relationship for the target, where the couple is meeting different people? Could it be a business partner and could this part of the network reflect business contacts that the partners have met together? We can't tell that from the network, but the fact that the target has these type of relationships means that both node A and node Z would be interesting people to look to for more information.

There is also a tightly connected cluster of people in the middle of the network, and node A knows two of these people as well. While this group is very tightly connected, it is not uncommon to see groups like this in egocentric networks. It usually reflects a tight group of friends or coworkers who are all connected to one another. Since, basically, everyone in this group knows everyone else, it makes it interesting that only two of the nodes know node A. Does that mean these two nodes (M and L) are connected to another part of the target's life? Do they hold a position of special privilege? Why would they know node A when no one else in the tightly clustered group does? This makes nodes M and L interesting for a couple reasons: first, because there are these open questions about their relationship with the target and, second, because they can likely provide more insights into the target's relationship with node A.

Node S is also connected to two people in this cluster: P and Q. But node S has only one other connection—Z.

This also raises a few questions. First, what does it mean that node Z does not know anyone in the central cluster? If node Z has a special relationship with the target, why is Z in the dark about people with whom the target obviously has a close relationship?

It also leads to questions about node S. Is S like the other singleton nodes known by the target and node Z, and it just happens that S knows other nodes in the main cluster? Or do the target and node S have a special relationship where node S is connected to the target's main group and the group connected to node Z.

Example 2

Now, let's consider a larger network that is more typical of what you might find when looking at a target's social media network. Figure 21.9 shows the egocentric network for a Twitter user.

f21-09-9780128016565

FIGURE 21.9 An egocentric network for a Twitter user. Color indicates the community where our analysis tool guesses each person belongs to. Size indicates betweenness centrality.

There are two major features of this network that pop out in the visualization:

• First, there is a large, tightly grouped cluster shown in red that has some connections to the large central group but is mostly separated.

• Second, there is a very large blue node in the center cluster. It has a thick black border added to make it easier to see in this visualization. This node happens to connect the red cluster to the main part of the network.

Understanding who these groups and individuals are will provide insight into what the relationships are in this network.

We can begin by focusing on that large node. It has high betweenness (indicated by the large size) because many shortest paths pass through it. Essentially, this node connects the main part of the network with the red cluster. Thus, we know that both this node and the target, who is not shown in the network, know one another and that they both know two groups of people: the red group and the main group including blue, green, orange, and purple nodes.

The next step is to understand what makes the red cluster unique. This is not something we can deduce just from looking at the visualization. Instead, we need to look at who the people are whom these nodes represent. On social media, that means finding the accounts of people in the cluster and looking for common patterns. When there are hundreds of nodes, as in this case, you can start by picking a handful to examine to see if patterns emerge. In this example, we would do that by displaying or finding the usernames of some nodes in the red cluster and going to their profiles at http://twitter.com/«username».

These users are not revealed here to protect their privacy, but an analysis would show that the users in the red cluster almost exclusively post in Japanese and are located in Japan. Users in the main cluster, on the other hand, are primarily English speakers. Thus, our target (and the large blue node in the center) appears to have connections to a community of Japanese users in addition to their main contacts who speak English.

In many people's social media networks, you are likely to find clusters like this, and they will often share a distinguishing trait. It may not be something as obvious as language. You may have to probe deeper into their profiles to see where they are from, what they topics are that they discuss, or what other personal attributes they have in common.


1 If you're interested in learning how to do this analysis in a more mathematically technical way, the author recommends you read her other book, Analyzing the Social Web!