F# Deep Dives (2015)
Part 2. Developing analytical components
Chapter 5. Understanding social networks
Evelina Gabasova
Social networks play an important role in our society. When we look at the community of people who are using F#, Twitter plays an important role. People communicate and make interesting posts on Twitter. In this chapter, I’ll go through a basic exploratory analysis of a part of the F# community that is active on Twitter.
Social network analysis is helpful in analyzing how groups work, the relations within groups, and how information spreads within a community of people. It can also help you identify people who are important or influential within a community.
Throughout this chapter, you’ll download data from Twitter with help from the F# JSON type provider. Then we’ll look at the structure of the network and compute some basic network characteristics. You’ll also visualize the entire network with D3.js and use R provider for other descriptive plots. Finally, you’ll implement the PageRank algorithm, which is a good measure of importance or centrality of nodes in a network. This way, you’ll identify the most important people in the F# community on Twitter, according to their social connections.
Analyzing a social network can give you important information about online communities. For example, let’s say you want to advertise a new product to a specific group of people. After you perform some network analysis, you can identify which people are central and most influential in the community. You can then target these specific people, and the information should spread throughout the group. You can get a good idea about the importance of people in a community purely by looking at the structure and connectivity of a network. You can also identify subcommunities in a larger social group, which may have different characteristics that can be used in advertising and marketing.
F# provides a great tool for doing all the necessary steps of exploratory data analysis. You can download and preprocess the data, compute basic characteristics of a network, look at the data in a few plots, and implement more advanced algorithms using the data. Thanks to F# type providers, you can directly work with JSON data files and call statistical and plotting functions from R. This makes F# a good tool for fast exploratory datacentric analysis: you can quickly access different data sources and run analyses without having to change environment.
Social networks on Twitter
Twitter is a service that allows people to share tweets: short messages of up to 140 characters. To receive messages that a specific user is sharing, you can follow the user. The act of following is asymmetric; a user can follow other users without their mutual acceptance. Followers merely express shared interests; Twitter connections don’t necessarily represent realworld friendships.
These factors make Twitter a good place to start with social network analysis. The nature of Twitter is open, and most of the tweets and connections are public and accessible through the API. Because people group around their shared interests, you can easily extract social networks between people.
You’ll first create a network model for Twitter. A standard network is formed of nodes and links between them. The links can be either undirected or directed, with a specific orientation that’s usually shown with an arrow. You’re modeling a social network between users, so nodes in the network represent the users. Connections between them form the links. The direction of a link is important on Twitter because it represents the act of following. For this reason, you’ll use a directed network to represent the Twitter community. On the other hand, if you wanted to represent standard connections on Facebook, you’d choose an undirected network, because those friendships are symmetric and don’t have a specific orientation.
Throughout this chapter, you’ll look at part of the active F# community on Twitter. You’ll concentrate on the social network that exists around the account of the F# Software Foundation, @fsharporg (see figure 1). The foundation is an organization that promotes and advances the F# language and supports the F# community. This is a good starting point for an exploratory analysis of the F# community on Twitter. You can expect that people who follow this account are interested in F# and that most of them are actively involved in the F# online community. This network won’t include all people interested in F# on Twitter, but it probably includes a reasonable proportion of them.
Figure 1. You’ll analyze the social network around the F# Software Foundation’s Twitter account, @fsharporg. Nodes in the network are user accounts that either are followers or are being followed by the @fsharporg account. Links are the connections between these users.
The community around @fsharporg is formed by people/accounts who follow @fsharporg or whom @fsharporg follows. You’ll download a list of such accounts using the Twitter API and extract connections between them. This will form the F# social network that you’ll analyze in this chapter.
Connecting to Twitter
In this section, we’ll look at how to connect to Twitter from an F# application using the FSharp.Data.Toolbox Twitter provider. The first step in connecting to Twitter is to obtain an API key that will identify your application and authorize it to access tweets and user data. Then you’ll use the Twitter provider to send requests to the Twitter API to obtain information about users connected to the F# Software Foundation.
If you want to work through this section of the chapter, you’ll need to log in to your Twitter account (or open a new account). Twitter requires developers to register applications that want to connect to the Twitter API at Twitter Apps (https://apps.twitter.com/). During the registration process, you have to supply a name and purpose for the application. The application can then access your Twitter account and all other Twitter content. After registering, you receive two authorization details: an API key and an API secret. The key is used to connect to Twitter, and the secret is used to sign the request. The connection is done via the OAuth authorization framework.
Twitter offers two authentication types that give access to different sets of API requests they can send. There are also different limits on the number of possible requests per time window for each of the two methods.
The simplest method is applicationonly access. It requires only the application key and secret. This type of authentication can access friends and followers and search within tweets.
The second type of access requires full OAuth authentication with an explicit signin from a Twitter user. The application then gets full access on behalf of the user and can, for example, search for users and post tweets on behalf of the registered user.
For the purpose of this chapter, you’ll use requests that require applicationonly access, although full access gives a higher rate of possible requests. You can find more information about both types of access in the FSharp.Data.Toolbox documentation and the official Twitter API documentation.
Listing 1. Connecting to Twitter
After you connect to Twitter, you obtain a Twitter context in the variable twitter. You use the Twitter provider from FSharp.Data.Toolbox to send requests to Twitter and parse the response. This provider is a light wrapper around actual HTTP requests to the Twitter API. Twitter responds to requests with JSON documents containing the requested data. The F# Data Toolbox library then uses the JSON type provider to get statically typed access to the data and exposes this statically typed view of the response.
Downloading the social network around the F# Software Foundation
As I’ve described, you’ll download the social network around the F# Software Foundation’s account, @fsharporg. The Twitter documentation uses a specific terminology for social connections in its network. When an account follows @fsharporg, it’s called a follower. On the other hand, if an account is being followed by @fsharporg, it’s called a friend. Figure 2 illustrates these two relations.
Figure 2. Twitter terms for different types of connections. Followers of @fsharporg are accounts that follow @fsharporg. Friends are accounts that @fsharporg itself follows.
Now that you know the terminology, you can say that nodes in the network are the followers and friends of @fsharporg. Links are the relations between these friends and followers. This way, you get a closed group of users around @fsharporg. I don’t include the actual @fsharporg node in the network because it would be connected to all the nodes and wouldn’t give you any additional insight.
Using the Twitter API provider and the Twitter context you opened in listing 1, you can directly download friends and followers of a specific account:
The functions twitter.Connections.FriendsIds and twitter.Connections.FollowerIds take the screen name of the F# Software Foundation and return JSON documents that contain a list of Twitter ID numbers of friends and followers. Twitter IDs are unique numbers that are assigned to users when they register. Because users can change their screen names, Twitter uses the IDs as unique identifiers of accounts.
After downloading the ID numbers, you extract them from the JSON object and collect them into an immutable set, idsOfInterest. Because the set doesn’t allow duplicate entries, you automatically include accounts that are both friends and followers of @fsharporg only once:
> follows.Ids.Length;;
val it : int = 16
> followedBy.Ids.Length;;
val it : int = 1100
> idsOfInterest.Count;;
val it : int = 1109
At the time of writing, @fsharporg followed 16 accounts and had 1,100 followers. When combined, you have 1,109 total users in the network around @fsharporg.
Nodes in the Twitter network
The ID numbers in idsOfInterest now become the nodes in your social network. Although Twitter uses IDs to identify accounts, they aren’t very informative. You would like to also get the user names that belong to those ID numbers. Unfortunately, Twitter doesn’t give you this information directly—you have to ask for it in a separate request. The following listing shows how to download information about Twitter users using the Lookup function.
Listing 2. Twitter screen names from user ID numbers
Because you can send a lookup request for up to 100 accounts at a time, you first group the IDs of interest into groups. You have 1,109 IDs, so you need 12 groups . You then send a lookup request for each group separately . After Twitter returns the requested data, you extract and return the screen name for each ID .
There’s one thing you need to keep in mind when getting information through the Twitter API: Twitter limits the number of requests allowed by a single application per 15 minutes. When you’re getting screen names from user ID numbers, the number of requests is limited to 180 per time window with full authentication.^{[}^{1}^{]} The limit for applicationonly access is only 60 requests per 15 minutes. You’re sending just 12 requests, so you don’t have to keep an eye on these limits yet, but they can become a limitation for downloading more data.
^{1} Twitter, rate limits chart, https://dev.twitter.com/docs/ratelimiting/1.1/limits.
Links in the Twitter network
Now that you’ve downloaded the nodes for the social network, all you need to add to your network model are the links between them. You’ll represent the links as tuples of IDs, where the first element is the source of the link and the second element is the target. Because you have a closed list of nodes in the network, you’re interested only in connections among this set of nodes; you’ll ignore links that lead outside of the network. The next listing shows how to download a list of friends for a specific ID.
Listing 3. Twitter connections between users
Because you want to keep the network focused on the F# community, you first write a helper function that checks whether a given ID is a member of the network or not. You go through the set of nodes’ IDs and download a list of all their friends from Twitter . Again, Twitter returns ID numbers of accounts that you filter to get only those in the network. Some users on Twitter prefer to keep their connections private, in which case you can’t get their friends’ IDs . For these accounts, you return an empty array: they will appear as isolated nodes without any connections in the final network. When you get the list of friends, you return each link as a pair of source and target nodes .
Twitter again limits the number of requests your application can send per 15 minute time window. For friends and followers requests, the limits are more restrictive than for a user lookup: this time you’re restricted to only 15 requests per 15 minutes, regardless of your authentication type. Because of this, the function needs to wait 1 minute between requests . You have a lot of IDs of interest, so downloading lists of friends takes a long time. If you don’t want to wait, the downloaded data is included in the source code for this chapter.
Network representation in the JSON format
In this section, you’ll save the network downloaded from Twitter into a JSON document. I chose the JSON format because it’s the input format for the visualization library D3.js that you’ll use to plot the network. It’s also easy to import the data back into F# with the JSON type provider.
D3.js (Data Driven Documents) is a JavaScript library for data manipulation and visualization. You’ll use it to plot your F# network in SVG that you can easily incorporate into a website. You’ll store the downloaded data directly in a format that D3.js can import.
First let’s look at the format for storing nodes in the network. You’ll use the following syntax, which lists user ID numbers and screen names as JSON values:
{
"nodes":[
{"id":12345,"name":"Alice"},
{"id":67890,"name":"Bob"},
...
{"id":11235813,"name":"Zara"}
}
You use the F# Data JSON extension to save data in this format. You can create JSON objects by calling the appropriate method of JsonValue. For example,
JsonValue.String "xyz"
produces a JsonValue string object that holds “xyz”. The next listing shows how to export nodes into JSON objects.
Listing 4. Exporting the network’s nodes into JSON
The jsonNode function transforms each user’s ID number and screen name into a JSON object that holds this information . You collect the list of JSON objects into an array and wrap it in a toplevel nodes object. Finally, you export the entire JSON object into a file by calling itsToString() method.
Next let’s look at how to save connections between Twitter users that represent links in the social network. The JSON format used by D3.js is straightforward:
{
"links":[
{"source":0,"target":1},
{"source":2,"target":0},
...
{"source":1108,"target":267}
}
For each link in the network, it stores its origin (source) and its target. The nodes are represented by their indices in the list of nodes you created in the previous step.
In listing 3, you downloaded the links from Twitter; currently they’re stored as pairs of node IDs, one for the origin of a link and one for its target. To save them in the desired format, you only need to change the ID number to a corresponding index into the array of nodes and save it as a sequence of JSON values.
You can create a simple dictionary to directly translate Twitter ID numbers to zerobased indices. Items in idsOfInterest are ordered as increasing values in the set. Let’s translate it into an immutable Dictionary:
Now you can proceed to export the links, which is similar to converting nodes into JSON.
Listing 5. Exporting the network’s links into JSON
In this section, you looked at how to save downloaded Twitter names and connections into a JSON format because it’s easy to read back into F# and it’s also a format that you can directly give D3.js to visualize your network. Both JSON files are included in the source code for this chapter.
Visualization with D3.js
At this moment, you have the complete Twitter network around the F# Software Foundation. An important part of every exploratory data analysis is visualization. In the context of network analysis, it gives you a good idea about what kind of network you’re dealing with.
You’ll use the JavaScript library D3.js for Data Driven Documents. This library can make various different visualizations that can be easily incorporated into a website. For network analysis, the library contains a socalled force layout algorithm for visualization. This algorithm isn’t limited to D3.js; it’s the basic algorithm for plotting networks, and some variant of this algorithm is present in all network analysis libraries.
This is a chapter about using F# to analyze social networks, so we won’t dive too deeply into JavaScript. All you need to do to visualize the network is to download the basic scaffolding for D3.js. At http://bl.ocks.org/mbostock/4062045, there is a code sample for visualizing a small social network of characters in Les Miserables.
Forcedirected layout
Force layout is an algorithm for plotting networks. The aim of all networklayout algorithms is to get a twodimensional picture of a network where the links have approximately the same length and with as few link crosses as possible. This task is difficult for large, complex networks such as a Twitter network. Forcedirected algorithms solve this problem by creating a physical simulation.
You can imagine nodes in a network as objects that repel each other. Links in a network are forces that keep connected nodes together. Forcedirected layout algorithms simulate this complex physical system. In the beginning, nodes are scattered randomly. The algorithm applies the repulsive and attractive forces iteratively to update the network layout until the whole system reaches a stable state. This method usually converges to a nice visual representation of a complex network.
When you look in the HTML scaffold file, the sample code reads data from a single file (miserable.json) where nodes and edges are put together. The only modification you have to make is to tell D3.js to read data from your two JSON files, one for nodes and the other for links. You can do this by changing the callback function that reads the data into two nested callback functions. The original callback function is the following:
d3.json("miserables.json", function (error, graph) {
...
}
Change this single JavaScript function into two nested callback functions, where one reads the nodes in the graph and the other reads the links:
d3.json("fsharporgNodes.json", function (error, graphNodes) {
d3.json("fsharporgLinks.json", function (error, graphLinks) {
...
}
}
You also need to change the original single variable graph with elements graph.nodes and graph.links from the original source into two distinct variables graphNodes and graphLinks, with elements graphNodes.nodes and graphLinks.links.
Now you can finally start an HTTP server and look at your visualization. You should see a chaotic network that gradually takes a more compact and structured shape as the force layout optimizes its shape.
You’ll probably notice that the visualization is far from perfect—the network gradually drifts out of the canvas as the layout adjusts itself. You’ll have to make the canvas larger to fit the entire network, and you also have to tell the nodes to stay close to the center of the canvas. If you adjust a couple of parameters, the visualization should start looking better.
Visualization Parameters
You’ll change parameters for methods of the force class:
· Gravity —Calling this method changes the strength of the force that keeps nodes close to the center of the layout. Its default value is 0.1, which isn’t enough to keep the big F# network in the canvas. If you change it to a higher value, say 1.0, the force increases and the nodes should stay in the desired area.
· Size —You have to increase the size of your canvas so the entire network fits in it. Try changing the size from 960 × 500 to 1500 × 800.
The relevant part of the code should now look like this:
var width = 1500,
height = 800;
var force = d3.layout.force()
.charge(120)
.linkDistance(30)
.size([width, height])
.gravity(1);
These changes should be enough for you to see the network form a nice compact layout (see figure 3). If you hover your mouse pointer above any node, you’ll see that node’s Twitter name.
Figure 3. Visualization of the F# Software Foundation network on Twitter using D3.js
The majority of the network is tightly linked together. Some nodes are isolated and not connected to the rest of the network. These nodes are either users who are connected only to @fsharporg and not to any other node in the network, or users who keep their list of followers and friends private. The core of the network is highly connected with a surrounding circle of lessconnected nodes. There are no immediately visible communities in this network. If there were any communities, the graph would appear to have more than one tightly connected center. This suggests that the F# community on Twitter is homogeneous and there are no competing subcommunities that don’t talk to each other.
The visualization can be improved. Currently, many of the nodes in the middle overlap, and you can’t isolate them. Play with different parameters for the force class methods to adjust the visualization into a nicer form.
Parameters to adjust network visualization
Some parameters you can adjust to improve the network visualization include the folowing:
· Charge —Adjusts the strength with which nodes repel or attract each other. Positive charge values attract nodes to each other and are better suited for other applications. In a social network, you want the layout to spread to reveal structure: therefore the nodes should repel each other, and this parameter should have a negative value. The magnitude of the charge changes how much repulsive force is applied.
· Link distance —Sets the ideal distance of nodes in the network, or the desired edge length between them. The layout tries to converge to this distance. By increasing the distance, the nodes in the core get more space to move around and spread.
Play with these and other parameters to get a feel how for how the layout adjusts when they change. A comprehensive description of all the methods and parameters is available in the documentation for D3.js at https://github.com/mbostock/d3/wiki/ForceLayout.
In this section, you looked at how to use your JSON documents generated in F# to easily visualize the F# Software Foundation network on Twitter. The network seems to have a tightly connected core of enthusiastic F# users; around them is a group of more casual users who aren’t as deeply involved in the community. By observing a visualization of a social network, you can tell a lot about the community structure and what types of users appear there.
The next section looks in more detail at the structure of the community and individual characteristics of users. You’ll see how to measure individual involvement in the network and how to find people who are worth following.
Exploring the social network
In the last section, you saw how to visualize the Twitter community around the F# Software Foundation. You’ve seen that the network is tightly connected, especially in the center, and that there are no communities in the network.
Questions we’ll explore in this part of the chapter are about individual nodes in the network. Which users are the most connected? If you want to spread an idea through the community, whom should you contact? If a wellconnected person tweets something, the message reaches many people in the group.
In the rest of the chapter, we’ll look at some characteristics of the F# social network. We’ll explore how much the network is connected, who is most followed, and who the most central people in the community are.
Representing a network with an adjacency matrix
We’ll represent the network using a socalled adjacency matrix (see figure 4). This is a binary matrix that represents connections between nodes. It’s a square matrix with a number of rows (and columns) equal to the number of nodes in the network. For your Twitter network, the dimension of the matrix is the number of users. If there is a link from node i to node j, the adjacency matrix has the value 1 in position [i,j]; otherwise the value is 0. The matrix isn’t symmetric, because the network is directed. Note that rows of the matrix represent source nodes, and columns represent targets of each link.
Figure 4. Encoding links as an adjacency matrix. Every link from node i to node j is a 1 in the [i,j] position in the adjacency matrix.
Reading JSON files with type providers
In the previous part of the chapter, you downloaded users and connections from Twitter. You’ve saved the data in a JSON format that is easy to use as an input for forcelayout visualization in D3.js. Now you’ll read the network back into F# so you can analyze its properties.
In the following listing, you load both JSON files into F# with the JSON type provider. Note that you have to give a sample document to the JSON provider, and you load the actual data file afterward.
Listing 6. Loading JSON data with type providers
Now you can read the actual data from your JSON files. You first create a few helper functions that you’ll use during the network analysis. For example, you might want to easily translate between Twitter ID numbers and corresponding screen names:
let idToName =
dict [ for node in userNames.Nodes > node.Id, node.Name.String.Value ]
let nameToId =
dict [ for node in userNames.Nodes > node.Name.String.Value, node.Id ]
In the code, you use the type Users that you created from the JSON file in listing 6. You look at all nodes in the JSON element Nodes and extract their Id and Name values. You use those to generate a list of tuples. Finally, you create an immutable dictionary from the list that you can use to translate each ID to its corresponding name and vice versa. For example, let’s try @tomaspetricek, the editor of this book:
> nameToId.["tomaspetricek"]
val it : int64 = 18388966L
As the next step, you load the links from the JSON file. In listing 5, you encoded links in the network using zeroindexed source and target nodes. You’ll use the zerobased index in your adjacency matrix as well. The additional helper functions in the next listing will come in handy to translate Twitter ID numbers to network indices and back.
Listing 7. Helper functions for Twitter IDs
For convenience, you also add two functions for working with zerobased indexing. Twitter names are more userfriendly than int64 ID numbers, and the two functions can help make your code easier to interpret. The first function, idxToIdName, takes the zerobased index and translates it to its corresponding Twitter ID number and user name. You also want to be able to find out what Twitter name belongs to any entry in the adjacency matrix. For this, the function nameToIdx takes the screen name and returns the corresponding index.
You can easily find the position of any user in your network from the user’s Twitter name:
> nameToIdx "dsyme";;
val it : int = 313
It’s equally straightforward to find the Twitter ID and name from an index in your network:
> idxToIdName 313;;
val it : int64 * string = (25663453L, "dsyme")
Now you can finally create the adjacency matrix.
Listing 8. Sparse adjacency matrix
You’ve created a sparse matrix using data from the JSON file. Links in the JSON file are indexed from zero, so you don’t have to change the values. In the JSON file, the links are stored as tuples of a source node and a target node for each directed connection. All you need to do is use the data to fill the adjacency matrix.
Because there are fewer connections than elements in the adjacency matrix, you store the data in a sparse matrix. Then you specify only nonzero elements of the matrix, and it’s stored in a memoryefficient structure. (Storing the full matrix would quickly become infeasible for larger networks.) You use the sparse matrix implementation in Math.NET Numerics, a library that is part of the Fslab Nuget package.
Let’s look at some information you can get about the sparse matrix in F# interactive:
> links;;
val it : Matrix<float> =
SparseMatrix 1109x1109Double 1.17 % Filled
. . .
{ColumnCount = 1109;
IsSymmetric = false;
Item = ?;
NonZerosCount = 14412;
RowCount = 1109;
Storage = MathNet.Numerics.LinearAlgebra.Storage.SparseCompressedRowMatrixStorage`
1[System.Double];}
You can see that only 1.17% of the matrix is filled. If you need to go through the matrix and work with the data, only the nonzero elements will be processed, which greatly speeds up the computation. It’s also much more space efficient to store only the nonzero elements. The output also tells you that the matrix isn’t symmetric. This shouldn’t come as a surprise, because links in the network are directed. Undirected networks have symmetric adjacency matrices. Also note that the network has 1,109 nodes (number of rows and columns) and 14,412 edges (number of nonzero items in the matrix). It’s stored as a sparse structure by row.
In this section, you’ve read the data stored in a JSON format into F# with type providers, and you’ve created a sparse adjacency matrix to represent the network. In the next section, we’ll look at some simple characteristics of the network that reveal how well connected users on Twitter are.
Indegrees and outdegrees
In this section, we’ll look at the degrees of nodes in the Twitter network. In general, a degree is the number of connections a node has. For directed networks like your social network, you can also define an indegree and an outdegree.
The indegree is the number of connections that lead into a node. Similarly, the outdegree is the number of links leading from a node; see figure 5. For the Twitter network, the indegree is the number of followers each user has in the network and the outdegree is the number of friends that person follows. Nodes with a higher indegree should be more influential because they’re followed by more people. Later, we’ll look at the PageRank of nodes in the network, which is a better indicator of influence.
Figure 5. In Twitter, the indegree is the number of followers and the outdegree is the number of friends (number of followed accounts).
With the adjacency matrix you created in the previous section, calculating the inand outdegrees is straightforward. Remember that if the adjacency matrix contains 1 in position [i,j], there’s a link from i to j in the network. Source nodes are in the rows of the matrix, and target nodes are in the columns.
To find the outdegree of each node in a network, all you need to do is to sum the values in each row. For a given source node, the value is 1 for every link that originates in the source node in the adjacency matrix and 0 otherwise (see figure 6). This means if you sum the values in each row, you get an array of outdegrees of all the nodes in the network. You get the indegrees similarly by summing the values in each column, because each column has the value 1 for each link that ends in the corresponding node.
Figure 6. The outdegree of a node is the sum of the elements in a row in the adjacency matrix. The indegree is the sum of the column elements for a specific node.
Listing 9. Outdegree and indegree
Using the EnumerateRows() and EnumerateColumns() methods on the sparse matrix, you can efficiently go through the matrix and sum the row and column elements. There’s also a faster method to compute in and outdegrees that uses matrix algebra, as shown in listing 10.
Faster degree computation with matrix algebra
For larger matrices, it’s inefficient to explicitly go through each row/column and sum the elements. Math.NET Numerics contains an efficient implementation of sparse matrix multiplication that you can use to sum values in each row or column.
The sum of a row in a matrix is equivalent to multiplying the row by a unit column vector. A unit vector is a vector that contains only ones. If you multiply a matrix by a unit column vector, the result is a vector that contains a sum of every row in the matrix:
outdegree[s] = Σ^{N} Links[s,t] * I[t]
The equation says that the outdegree of a source node s is the sum of the sth row in the link matrix times a vector of ones of the same size. Because the row in the link matrix contains 0 when there’s no link and 1 when there’s a link, the resulting sum counts 1 for every link. The result is exactly the same as the explicit computation in listing 9. Due to efficient implementation of matrix multiplication in Math.NET, multiplying a matrix by a vector is faster than explicit row enumeration.
Similarly, you compute the indegree by leftmultiplying the link matrix by a unit vector, only this time the vector must be a row vector. This multiplication sums each column in the link matrix.
The following figure shows a visual representation of matrix multiplication to illustrate the process.
Computing the outdegree efficiently with matrix multiplication
Listing 10. Calculating the indegree and outdegree with matrix multiplication
You can compare how much faster the matrix algebra is versus explicit degree computation. Use the directive #time to turn on the timing function in F# interactive:
> #time;;
> Timing now on
> outdegree links;;
Real: 00:00:00.142, CPU: 00:00:00.140, GC gen0: 0, gen1: 0, gen2: 0
> outdegreeFaster links;;
Real: 00:00:00.001, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
The increase in speed is immediately visible. It’s usually significantly faster to use standard matrix algebra operations compared to elementbyelement computations.
Finding the mostconnected users
Now that you have indegrees and outdegrees for every node in the F# network, you can determine the mostconnected users. You can print the degree of any user using the helper functions defined in listing 7. For example, let’s see how many people in the network follow Don Syme, the designer and architect of the F# language:
> indegrees.[nameToIdx "dsyme"];;
val it : float = 644.0
This result means Don Syme has 644 followers in the network. That’s more than a half the nodes! Let’s see who the other major accounts in the F# Foundation network are.
Listing 11. Top users from a ranking
You define a function that takes a general ranking of users in the network and returns the specified number of users with the highest ranking. Now you can get a list of people with the most followers by applying it to the indegrees:
topUsers indegrees 5
> Seq.iteri (fun i (id, name, value) >
printfn "%d. %s has indegree %.0f" (i+1) name value)
This code prints the list of top five mostfollowed people in the network:
1. dsyme has indegree 644
2. tomaspetricek has indegree 556
3. migueldeicaza has indegree 545
4. VisualFSharp has indegree 483
5. c4fsharp has indegree 457
You can see that Don Syme is the user with the most followers. In second place is Tomas Petricek, the editor of this book. Miguel de Icaza, founder of the Mono project (and many other open source projects), is in the third position. He is also the only person in this list who isn’t directly connected to F#, which shows that people around the F# Software Foundation are interested in open source development.
How do these highranking nodes compare to the average number of followers in this network?
> Seq.average indegrees;;
val it : float = 12.99549143
On average, people have only 13 followers in this network. This is typical for certain types of social networks, where there are a few highly connected nodes and a large number of lessconnected nodes. Let’s visualize the entire degree distribution to get a bigger picture.
Using the R provider to visualize the degree distribution
Let’s look at a few simple visualizations you can use to examine the degree distribution in a social network. Because F# itself doesn’t have rich visualization libraries, you’ll use R to produce some plots. The R language is an open source statistical programming language with rich visualization packages. Using it from F# with the R provider allows you to use all the functions that exist in R directly in the F# environment. To be able to use the R provider, you’ll need to install R on your computer from www.rproject.org. After installation, you can use R directly from F# with a help from the R type provider. Like Math.NET Numerics, the R provider is part of the FsLab package. Calling an R function is straightforward:
You can pass your array of indegrees directly to the plot function in R, and it creates a figure with a point for every node in the network (see figure 7). The y axis in the plot represents the indegree values of the nodes. You can see that there are a few nodes with a high indegree and that most users are concentrated around the low indegree values. Because most of the points blend together in the lower section of the graph, you might get a better visualization with a different type of plot.
Figure 7. A simple visualization of indegrees in the F# network with the R type provider
Log degree distribution and scalefree networks
Figure 7 shows many nodes with small degrees and a couple of nodes with very high degrees. Such a highly nonsymmetric distribution is usually visualized using the socalled loglog plot. This type of plot has a logarithmic scale on both the x and y axes. This way, it’s able to show quantities with different scales in a single plot.
Let’s create the loglog plot using the R plot function. On the x axis, you’ll have unique values of indegrees in the network. On the y axis, you’ll have the number of nodes with that specific indegree. For example, let’s say you have 10 nodes with indegree 15. There will be a dot in the plot in location [15,10]. Both axes in the plot will have a logarithmic scale.
Listing 12. Degree distribution of nodes
To call a function in R with the R provider, you can pass in specific named parameters that Visual Studio helpfully displays in a tooltip. Alternatively, you can use the full functionality of R and call functions with a richer set of optional parameters. Here, you want to adjust a couple of more advanced parameters in the resulting chart, such as labels and axis types.
In the R provider, any R function can take an IDictionary<string,obj> as its input. In this dictionary structure, you can specify values to any R parameters that the function can take. The R provider also gives you a useful function namedParams that creates the dictionary structure from a sequence of tuples. For any function, you can look in the R documentation to access the full list of potential parameters.
In listing 12, you compute the degree distribution, which shows how many nodes of each degree are present in the network. You include several advanced parameters for the R plot. By specifying the log parameter, the resulting chart in figure 8 is on a loglog scale.
Figure 8. Loglog plot for indegrees in the F# network. Both axes have a logarithmic scale.
When you look at the loglog plot, you can again see that there are many nodes with small degrees. As the degree increases, the number of nodes with each degree gets smaller. This degree distribution is typical for a growing social network. If the indegree distribution looks like a straight line in the loglog plot, it tells you that the network follows a powerlaw type of distribution.
Scalefree networks and the power law distribution
Scalefree networks are networks with a degree distribution that follows the power law. It can be formulated mathematically by looking at the number of nodes in the network that have an indegree equal to d compared to the number of nodes in the entire network.
Let P(d) be the fraction of nodes of degree d in the network. In a scalefree network, this number is approximately
P(d) ~ d^{–γ}
where γ is a parameter that is specific to each network, typically around 2 or 3. The larger the degree, the smaller the number of nodes with that degree in the network. This equation is linear on the loglog scale.
Powerlaw networks can be modelled using the socalled preferential attachment model. When a network grows over time, new nodes are more likely to connect to nodes that are popular and already have many connections. This way, the older nodes have high indegrees that keep growing. Newer nodes have less time to accumulate connections, and their indegrees stay smaller and grow more slowly. The network has a richgetricher property, because the older nodes keep accumulating links due to preferential attachment. Another example of a network that follows the powerlaw degree distribution is the entire World Wide Web.
Some other types of social networks don’t follow the powerlaw indegree distribution. Networks with separate strong communities are a typical example of a different type of network.
The highdegree nodes in your network are sometimes called hubs. In the context of a Twitter network, hubs are users who efficiently spread information through the network. If a person with a high degree tweets something, it reaches more users in the network. Without such highdegree nodes, information spreads much more slowly.
Networks with a powerlaw degree distribution are also robust with respect to failure. If a node drops out of the network randomly, the hubs have a low probability of being affected. Also, if a hub disappears from the network, several other hubs still connect the network together. This is a nice property of a Twitter network: if a user closes their account, the entire community isn’t significantly affected, because the other nodes in the network keep the connections in the group alive. On the other hand, if too many hubs drop out, the network splits into a set of disconnected subnetworks that don’t communicate anymore.
In this section, you’ve learned that there are some highly connected users with many followers in the F# network. Does the high connectivity reflect the underlying importance of each node in the network? In the next section, we’ll look at a better measure of node centrality.
PageRank
Indegree and outdegree aren’t good measures of actual importance in the network. A node that is connected to every other node probably isn’t interesting. On the other hand, an account on Twitter that is being followed only by important and wellconnected nodes may be important but not widely known in the network.
In this section, we’ll look at the importance of individual nodes in the network with the PageRank algorithm. The overall importance of each node doesn’t depend only on its in or outdegree, but also on who the followers of that node are.
The PageRank algorithm was originally used by Google to rank web pages based on their relative importance. When searching for a term, it’s important to return not only the result that best fits a query, but also the most important page that is relevant. The importance of a web page can be estimated from the web graph using the PageRank algorithm. This algorithm has become very popular as a measure of the centrality of nodes in complex networks and has been used in many different areas, such as identifying cancer genes in biological networks.
PageRank and the random surfer model
PageRank is based on a probabilistic model called the random surfer model, developed by Google to rank web pages based on their importance. Imagine a web surfer who clicks links randomly. Each link takes the surfer to a different web page, where the surfer clicks a link again, and so on. If there’s no outgoing link, the surfer jumps randomly to any other page on the web. Sometimes the surfer gets bored and jumps to any other web page randomly, so the surfer doesn’t get stuck in one area of the network. This way, the surfer moves around the web indefinitely.
The PageRank value is the total proportion of time the surfer stays at each page during this infinite websurfing session. Mathematically, it’s a probability distribution over web pages; therefore the sum of all PageRank values is equal to 1.
We assume that important pages are those that have many links pointing toward them. The random surfer should visit such pages often. Also, if an important web page links to another web page, then that other web page is probably important as well. A random surfer will also visit such pages often, because important pages link to them.
Mathematical formulation of PageRank
The basic PageRank algorithm computes the importance of each node by looking at the number and importance of nodes that link to that node and at the quality of those links. The importance is given by each node’s PageRank. The quality of a link is measured by the outdegree of its source node. This definition of quality tells you that if an important node links to just a few other nodes, the links are significant. If an equally important node links to thousands of other nodes, the individual links don’t have much weight. The fewer links lead out of a node, the more significant they are.
Let’s look at a formal definition. The PageRank of a node n is computed by summing the PageRanks of all nodes that link to that node, divided by their outdegrees:
The sum in the equation goes over all links that point to a node n. For each of these links, you take the PageRank of its source node s and divide it by its outdegree. This process is illustrated in figure 9.
Figure 9. Basic PageRank computation. A node’s PageRank is the sum of the PageRanks of nodes that link to that node weighted by their inverse outdegree.
As you can see, this definition is recursive. Every node’s value depends on the value of the nodes that link to it and also on the significance of that link. You can compute the values by iteratively cycling through all the nodes in the network until you reach a stable result.
The inverse outdegree of each node in the previous equation is equal to the transition probability. If the random surfer is in a node s and selects any outgoing link at random, the probability of choosing a link is
You’ll precompute these quantities and store them in a matrix. This matrix is called a transition matrix because it represents the transition probabilities. The first part of the following listing shows how to compute this transition matrix using the outdegrees you computed earlier.
Listing 13. Transition matrix
With the basic formulation of the transition probabilities, you’ll run into problems with nodes that have zero outdegree. These nodes represent users who either don’t follow anyone in the community or keep their connections private. Are there any such nodes in your Twitter network?
outdegrees
> Seq.countBy ((=) 1.0)
val it : seq<bool * int> = seq [(false, 969); (true, 140)]
There are 140 users who don’t have any outgoing links in the network. Such nodes are usually called dangling nodes, and you have to correct for them.
In the random surfer model, if the surfer lands on a page with no outgoing links, the surfer jumps to any web page at random. You implement this by adding virtual links from dangling nodes to all nodes in the network with uniform probabilities. If the network has N nodes, the probability of randomly jumping to any of them is equal to 1/N. The full transition matrix is implemented in the second part of listing 13.
Adding virtual links fills the sparse transition matrix with many additional entries. The effect is relatively large:
> transitionBasic;;
val it : Matrix<float> =
SparseMatrix 1109x1109Double 1.17 % Filled
> transitionMatrix;;
val it : Matrix<float> =
SparseMatrix 1109x1109Double 9.38 % Filled
The full transition matrix now has eight times more elements than the original one without the correction. This approach would quickly fill the transition matrix for networks with more dangling nodes, and you’d have to use a different approach. For this network, the transition matrix is still sparse enough that it won’t affect efficiency of the code.
Now that you have all the transition probabilities, you could iterate the PageRank equation to get an estimate of importance for nodes in the network. But first you need to make an additional modification.
Calculating PageRank with a damping factor
If you implemented the basic PageRank equation and iterated over all nodes in the network, you’d run into problems with parts of the network that aren’t fully connected. Some nodes aren’t accessible from other parts of the network, and other nodes have no incoming links and therefore zero PageRank values. The computation might get stuck in an isolated part and not propagate PageRanks through the entire network.
The random surfer model includes random jumps between nodes when the surfer gets bored. These jumps help avoid getting trapped in a subpart of the network. You include this behavior through an additional term in the PageRank. In the extended version of the algorithm, the random surfer follows links between nodes with probability d. With probability 1d, the surfer jumps randomly to any node in the network:
Probability d is called the damping factor. Generally recommended values are around 0.85. The extended equation now counts the incoming PageRank values of other nodes weighted by d.
Let’s look at the new term in the equation. The probability of being in node n due to a random jump is the same for all nodes: it’s equal to 1/N, N being the number of nodes in the network. It’s weighted by the probability of the random jump, which is (1 – d).
This completes the PageRank algorithm. Let’s use an iterative method to compute it for all nodes in the Twitter network. The PageRank algorithm starts by initializing all PageRank values to uniform values 1/N. Then it iteratively computes the PageRank equation for all nodes in the network until convergence. The final PageRank value represents the overall probability of being in a node at any given time; you interpret it as the importance of the node. Figure 10 shows PageRank values for a small network of three nodes.
Figure 10. PageRank values for a sample network with three nodes
Here’s the corresponding adjacency matrix:
A 
B 
C 

A 
0 
0 
0 
B 
1 
0 
0 
C 
1 
1 
0 
And here’s the transition matrix:
A 
B 
C 

A 
1/3 
1/3 
1/3 
B 
1 
0 
0 
C 
1/2 
1/2 
0 
Using MapReduce to compute PageRank
A popular model for PageRank implementation is the MapReduce algorithm. It allows parallel distribution of computations across multiple machines and makes the PageRank algorithm applicable to largescale problems, such as the World Wide Web. You’ll use the same basic computational model on a smaller scale because it translates well into a functional paradigm.
The MapReduce algorithm has mapper and reducer stages and works with key/value pairs. For this PageRank computation, the keys will be the nodes in the network and the values will be their PageRanks.
In the first stage, the mapper function goes through all the key/value pairs and produces a set of key/intermediate value pairs. The reducer function then collects intermediate values for each key and combines them to form a new value for that key. The mapper and reducer function are implemented in the following listing.
Listing 14. Mapper and reducer functions
Let’s look at these functions in more detail. The mapper function goes through all links in the network. For every link, it outputs the target node of the link as the key and as its intermediate value. The value is the partial contribution to PageRank of the target node from that particular link . If you look back at the PageRank equation, these are the individual elements in the sum.
If a node has no incoming links (it’s not a target for any link), it’s skipped in the computation and doesn’t appear as a key in any of the key/intermediate value pairs. It effectively disappears from further computations. For this reason, you add a dummy key/value pair for each node in the network with value 0 . This doesn’t affect the PageRank values but keeps the nodes from disappearing. The output from mapper is then used as an input to reducer.
The reducer function takes the intermediate values computed by mapper and groups them by key . This way, it collects all the partial PageRank contributions that are coming to the target node through its incoming links. You sum these values for each target node, which gives you the result of the basic PageRank computation . Then you add the correction for random jumps in the network via the damping factor. The results are PageRank values for all the nodes in the network . The first iteration of the MapReduce algorithm for the simple network in figure 10 is shown in figure 11.
Figure 11. First iteration of the MapReduce algorithm for the PageRank computation for figure 10’s simple network with three nodes
The mapper function can be easily parallelized or distributed among different machines because all the computations are independent of each other. reducer then receives data from different machines and only combines the results; again, results for different nodes are independent.
Now you can put together the full PageRank algorithm. You initialize all PageRank values to equal starting values of 1/N. Then you run mapper and reducer repeatedly until the PageRank values don’t change anymore. The next listing shows a function that computes PageRank with MapReduce.
Listing 15. PageRank algorithm
In each iteration, the function runs mapper and reducer and computes the new PageRank values. The new values are compared with the previous ones. If the sum of all the differences is smaller than some predefined threshold, the algorithm converged and the final results are returned.
PageRank results
When you run the MapReduce algorithm on the Twitter network, it converges after only 17 iterations. To view the results, you can reuse the function to display the topranking accounts in the Twitter network from listing 11:
topUsers pr 5
> Seq.iteri (fun i (id, name, value) >
printfn "%d. %s has PageRank %f" (i+1) name value)
PageRank gives different result than the previous indegree metric:
1. migueldeicaza has PageRank 0.033130
2. dsyme has PageRank 0.032783
3. tomaspetricek has PageRank 0.027757
4. LincolnAtkinson has PageRank 0.021993
5. VisualFSharp has PageRank 0.020233
Don Syme is no longer at the top; he’s fallen to the second position, after Miguel de Icaza. There are differences between PageRank and indegree in other positions, as well. For example, the account @LincolnAtkinson has a different rank in the two metrics. Based on his indegree, he’s in position 46 with indegree 76. When you look at his PageRank, he jumps up to fourth place! This major difference is caused by the users who follow him. He’s followed by users with high PageRank who pass their high rank to him as well.
You can see part of the cause when you look at the differences in average indegree of followers for each user. You can write this function as a simple exercise. Followers of Miguel de Icaza in the Twitter network have an average of 14.6 followers. Lincoln Atkinson’s followers have much a higher average number of followers: 38.7. This is one reason some accounts that don’t have a high number of followers by themselves have a high PageRank.
This method gives you new insight into the Twitter network. You can even use PageRank values to suggest Twitter accounts that might be worth following. It’s a more reliable measure of significance and helps to uncover important nodes in the network that aren’t immediately obvious from just their number of followers. If a user with a high PageRank tweets something, it quickly reaches other important nodes in the network.
In this section, you’ve looked at how the PageRank algorithm works and how to implement it using the MapReduce algorithm. You’ve seen that this algorithm uncovers some important nodes in the network that were not apparent before. To conclude the chapter, you’ll visualize the information that you got from PageRank to see where the important nodes are located in the Twitter network.
Visualizing important nodes
To see the importance of nodes, you’ll use the PageRank values to modify the nodes’ diameter in the network visualization. You need to go back to listing 4, where you generated a JSON document with data about nodes in the network. A simple modification of the code adds another attribute r that holds each node’s PageRank value.
Listing 16. JSON file for nodes with PageRank information
To update the plot you created with D3.js, you need to go back to the HTML scaffold file and add the following line:
node.attr("r", function (d) { return Math.sqrt(d.r) * 100 + 3; })
Here you read the PageRank in the r attribute. You use it to modify the radius of each node in the visualization. Additionally, you scale the value so the area of each node is proportional to its PageRank. You also need to adjust some parameters of the visualization, such as the strength of repulsive forces, gravity, and link distance, to achieve a nice plot of the entire network as shown in figure 12.
Figure 12. Visualization of the F# Twitter network with nodes scaled proportionally to their PageRank values
Summary
This chapter looked at part of the F# community on Twitter and gave you some new insights into this community. You determined which users are important in the network and which users play the role of hubs. You also produced several different plots to analyze various aspects of the network’s structure.
If you want to spread information through the F# community, you know which users you should target on Twitter. The same ideas can be used in other areas: for example, in advertising and marketing, to create effective campaigns that use insights from social networks.
You downloaded data directly from Twitter, exported it for use with external webbased visualization tools, and used the data for an advanced mathematical analysis of the network structure. Using F# for the entire exploratory analysis allowed you to combine all the different steps in one framework and therefore shorten the time to market.
The code also shows other aspects of F# that were highlighted in chapter 1. Because the mathematical formulation of a problem is close to a functional programming style, you can easily translate PageRank equations into an efficient implementation. This property of functional code greatly simplifies the correct implementation of complex mathematical problems.
About the author
Evelina Gabasova is a machinelearning and datascience enthusiast. She studied computational statistics and machine learning at University College London, and she will be finishing her PhD at Cambridge University in bioinformatics and statistical genomics in 2015.
Evelina has used many different languages in her work, such as MATLAB, R, and Python. In the end, F# is her favorite, and she uses it frequently for data manipulation and exploratory analysis. She writes a blog on F# in data science at www.evelinag.com, and you can find her on Twitter as @evelgab.