Introduction to Social Media Investigation: A Hands-on Approach, 1st Edition (2015)
Chapter 22. How to use NodeXL
Derek Hansen1; Marc Smith2 1 Brigham Young University, Provo, UT, USA
2 Social Media Research Foundation, Belmont, CA, USA
Network data are inherently different than traditional datasets and require specialized software to analyze and visualize. New tools, such as NodeXL, are making network analysis increasingly accessible, particularly to nonprogrammers. NodeXL is a free add-in for Microsoft Excel, supported by the Social Media Research Foundation. In this chapter, we'll introduce the basics of using NodeXL.
Getting Started with NodeXL
Network data is inherently different than traditional datasets and require specialized software to analyze and visualize. New tools, such as NodeXL, are making network analysis increasingly accessible, particularly to nonprogrammers. NodeXL is a free add-in for Microsoft Excel, supported by the Social Media Research Foundation. In this chapter, we'll introduce the basics of using NodeXL. For a more comprehensive treatment, see Analyzing Social Media Networks with NodeXL: Insights from a Connected World.
Installing and Navigating
Users of recent versions of Windows and Office can run NodeXL. The application is downloaded from http://nodexl.codeplex.com. Once installed, type “NodeXL” into the start menu and choose “NodeXL Excel Template.” This will open a blank NodeXL workbook file, which includes a custom NodeXL menu ribbon as shown in Figure 22.1. The menu provides access to all NodeXL features, which are organized into meaningful groups such as Data, Graph, Visual Properties, and Analysis. Network data is stored in Excel worksheets, while the graph pane displays the network visually as shown in Figure 22.1.
FIGURE 22.1 NodeXL worksheets (left), graph pane (right), and custom menu (top) in Excel showing Twitter data.
Each NodeXL workbook file (which ends in .xlsx like all Excel files) includes a single network. Within each file, there are several specialized worksheets, which each contains data associated with different dimensions of a network. For example, the Verticesworksheet includes a row for each person (i.e., node) in the network as shown in Figure 22.1. Additional information about each person is shown in the many columns to the right of the person. For example, the number of followers and tweets associated with each user in Figure 22.1 are shown. Other columns show the visual properties of a node (e.g., its color, size, and shape), labels, and centrality metrics that help identify how “important” a person is in the network.
The Edges worksheet includes a row for each link, tie, or connection between two entities along with related information. For example, the first row of Figure 22.2 shows that Twitter user bostontweetup mentioned ga_boston in a tweet posted 1 Dec, 2014 at 21:00, along with other information. Other columns show information such as the visual properties (e.g., thickness and style) and labels associated with each edge.
FIGURE 22.2 NodeXL Vertices worksheet and graph pane showing data from Twitter.
Additional worksheets contain additional information about the summary of overall network metrics, the composition of groups (i.e., clusters of related nodes), and the summary of textual analysis of text content. The following sections explain how to capture social media network data similar to this example and introduce the techniques needed to gain insights into network data by calculating appropriate network metrics and creating meaningful visualizations.
Collecting Network Data
Although network data is at the core of social media sites, these services do not always make it easy to extract. NodeXL helps collect social media network data by providing data importers that automatically grab data from popular social media sources such as Twitter, email, YouTube, and Flickr. Additionally, third-party data importers (which can be installed separately) allow users to import data from other sources such as Facebook and MediaWiki (see http://nodexl.codeplex.com/wikipage?title=Third-Party%20NodeXL%20Graph%20Data%20Importers). Network data can also be manually entered, copied in from another spreadsheet, and imported from another network analysis tool (e.g., GraphML file) or from the NodeXL Graph Gallery (found athttp://nodexlgraphgallery.org), which features collections of NodeXL network datasets available for download.
Importing Twitter Search Data
To import tweets that contain a certain keyword or phrase, choose the NodeXL > Data > Import > From Twitter Search Network menu option from the NodeXL menu ribbon. This will open the import dialog shown in Figure 22.3. This feature can create a network of Twitter users who recently used the keyword(s) you specify (“social media” in this example). Twitter users will be connected to each other based on mention and reply-to relationships as shown in Figure 22.2.
FIGURE 22.3 Twitter Search Network importer with limit changed to 1000 tweets and the search phrase “social media” entered.
The search term(s) that you want to map and measure can be entered in this dialog along with other options as desired. By default, the importer only grabs the latest 100 tweets, which should be increased to as much as 18,000 for more popular items. Twitter's data access rules and restrictions limit the amount of data you can download, particularly for follow relationships, as described in the “More about this option” section. Additionally, only recent data is collected—anywhere from a few hours to about a week depending on the popularity of the topic. The first time you use the NodeXL Twitter importer, you will need to authenticate with Twitter as described in the checkboxes in the lower left corner of Figure 22.3. Selecting the “Expand URLS in Tweets” option will convert shortened URLs into their underlying form.
After clicking OK and waiting for the network data to download from Twitter, NodeXL will populate the Edges and Vertices worksheets with the basic network data and the additional Twitter statistics shown in Figures 22.1 and 22.2.
Other NodeXL network importers work in a similar way, although the networks they create differ depending on the types of connections supported by the social media platform. For example, the YouTube Video Network imports a network where the nodes are YouTube videos and the edges connecting them are generated based on the number of people who comment on both videos (or the number of category tags they share).
Importing a Sample Facebook Ego Network
The remainder of this chapter will use data from one of the author's anonymized Facebook ego network. In other words, it shows the relationships among all of one of the author's Facebook friends (though their names have been changed to the most popular baby names of 2014). You can find the Sample Facebook Ego network on the NodeXL Teaching Resources web page, which also links to other resources that may be of interest: https://nodexl.codeplex.com/wikipage?title=NodeXL%20Teaching%20Resources. Once downloaded to your desktop (or other “trusted” location on your computer), open a new NodeXL file and choose NodeXL > Data > Import > From NodeXL Workbook Created on Another Computer. This will create a local copy of the NodeXL network data file, which you can then navigate to and open.
Alternatively, you can import your own Facebook ego network after installing the Social Network Importer for NodeXL plug-in (see http://socialnetimporter.codeplex.com for download and installation instructions). There are three different Facebook importers that extract networks from Facebook Fan Pages, Groups, and personal Friend networks. To import your personal ego network, use the default settings in the From Facebook Personal and Timeline Network importer.
After importing the sample file or your own network, change the Graph Type to Undirected. To do this, find the Type: drop down in the Graph portion of the NodeXL ribbon and select Undirected. Facebook Friendship relationships must be mutual or “undirected.” When analyzing other networks, such as Twitter follow networks, the network Type should be set to Directed, which will assure that arrows appear on network edges to indicate a relationship that starts with one person and ends with another.
NodeXL includes several tools to help analyze network data, which are available via the Analysis section of the NodeXL menu ribbon. This section introduces the most commonly used features including a guide to creating groups of nodes based on network clustering algorithms and calculating common social network analysis metrics such as those described in the prior chapter.
Creating Groups of Nodes
It is often useful to identify network groups or collections of nodes that belong together. Groups can be created based on different techniques in NodeXL, which are available via the Groups drop-down menu in the Analysis section of the NodeXL menu ribbon. One technique is to create groups based on a shared attribute of the vertices (e.g., Facebook users that share the same gender and Twitter users that are in the same time zone). Another technique, illustrated in this section, is to create groups based on the network structure itself.
As illustrated in the prior chapter, there are often subsets of nodes that are highly connected to one another, which are only loosely connected to other nodes or subgroups in the network. This will become apparent as we calculate network-based groups, called clusters in NodeXL, in the Sample Facebook Ego network file. The sample Facebook file includes all of one of the author's Facebook friendship interconnections. The author is not shown in the graph itself, since he would be connected to all other nodes, making it unnecessarily cluttered. The focus instead is on the shared connections among his friends.
Once you have opened the file, choose the “Group by Cluster” option in the NodeXL > Analysis > Groups drop-down menu, which will open a window similar to that shown in Figure 22.4. There are several clustering algorithms (also called “community detection algorithms”) to choose from, each of which will give slightly different results. Trial and error can help identify the one that creates the most meaningful groupings. For now, use the Clauset-Newman-Moore algorithm and make sure to check the box, which will create a single group that includes all nodes that are disconnected from all other nodes. This can help keep graphs with many isolated individuals less cluttered.
FIGURE 22.4 Group by Cluster options.
Once the clustering algorithm calculation completes, you will be taken to the Groups worksheet shown in Figure 22.5. Each row on the worksheet shows a different group (named G1, G2, G3…) each of which is given a default color and shape, which becomes visible when the NodeXL > Graph > Refresh Graph menu option is selected. Clicking on one of the rows will highlight all of the nodes in that group. Labels can be entered in the label column (see Figure 22.5). Do not change the group names themselves (e.g., G1), since changing them will break the group functionality. Labels only show up if the network visualization is configured to show each group in a separate area on the graph pane, as described later in this chapter. NodeXL can automatically calculate network metrics for each group as described in the following section.
FIGURE 22.5 Groups worksheet showing eight different groups (G1, G2, … G8).
In the Facebook social network, the groups created and shown in Figure 22.5 correspond to different groups of the friends of one of the authors. For example, G1 consists of one of the author's family and friends, G2 includes friends from graduate housing, G3 includes work colleagues, and so forth. To be clear, the network clustering algorithm knows nothing about the individuals' attributes (e.g., where they went to graduate school). It creates groups based solely on which nodes are densely connected to one another in distinct clusters. Labels, such as “work colleagues,” can be applied only by someone who knows the network and can interpret it.
The list of which nodes are included in each group is stored in the NodeXL Group Vertices worksheet. For example, in the sample file, we see that Becket, Tucker, and many others are part of G1 since they are next to G1.
Calculating Network Metrics
Many commonly used network metrics can be calculated using NodeXL. Choose Graph Metrics from the Analysis portion of the NodeXL ribbon to open the dialog shown in Figure 22.6. This shows a list of all possible metrics that can be calculated including the centrality metrics described in the prior chapter (e.g., Degree, Closeness, Betweenness, and Eigenvector Centrality), Overall Metrics (described in the table at the bottom of the dialog), and Group Metrics. For moderately sized networks, you can Select All to calculate all of them and then decide which to use later.
FIGURE 22.6 Graph Metrics dialog after choosing Select All.
The various network metrics are displayed on the appropriate worksheets. For example, the centrality metrics, which are calculated for each individual node, are displayed on the Vertices worksheet in a set of columns labeled Graph Metrics. Likewise, the Group Metrics are displayed on the Groups worksheet. The Overall Metrics that describe the entire network graph are displayed on their own Overall Metrics worksheet. If Twitter search network top item metrics were calculated (on imported Twitter data), they will also show up on their own worksheet.
The meaning of network metrics differs depending on the specific dataset you are analyzing. For example, in the Sample Facebook Ego network dataset, the degree centrality metrics (found on the Vertices worksheet) can be interpreted as the number of shared friends that the person has with the author. Betweenness centrality helps identify bridge spanners—that is, individuals who uniquely connect to otherwise disconnected groups. For example, Lena has a high betweenness centrality because she was the only friend connected to work colleagues and graduate housing friends.
Sorting by Network Metrics
A useful strategy for identifying the most “important” individuals in the network is to use Excel's built-in sort feature. For example, navigate to the Betweenness Centrality column on the Vertices worksheet and choose “Sort Largest to Smallest” from the drop-down menu found within the title cell. This will sort the entire table so that the vertices with the highest betweenness centrality will show up at the top of the table and those with the lowest will be at the bottom as shown in Figure 22.7. Note that eigenvector centrality in NodeXL is calculated as 1/(average shortest distance) so that higher numbers indicate that the node is more central to the network (i.e., has a shorter average path distance to all other nodes).
FIGURE 22.7 Sorting on betweenness centrality from largest to smallest to identify bridge spanners.
Networks are complex structures, and to fully understand them, it is often useful to visualize them. Network metrics give only a partial picture, just as summary statistics like the average and standard deviation only give a partial picture of a distribution. Sometimes, a picture really is worth a thousand words—or numbers. NodeXL provides a very sophisticated and highly customizable network visualization toolset that is briefly introduced in this section. Features associated with network visualization are found in the Graph and Visual Properties sections of the NodeXL menu ribbon and the top of the graph pane itself.
Displaying and Laying Out Networks
To visualize a network in NodeXL, simply click on the Show Graph/Refresh Graph button in the NodeXL graph pane or menu ribbon. This will place the vertices on the graph pane and show the edges between them. If it is a directed network, such as an email network, then it will state that the Type is Directed in the NodeXL ribbon and edges will include arrows. If it is an undirected network, such as a Facebook friend network, then no arrows will be shown. The network type can be changed at any time in the NodeXL ribbon menu.
Individual nodes can be moved by simply clicking and dragging them to a different location. Groups of nodes can be selected by drawing a rectangle around them or by right-clicking on a node and choosing Select Adjacent Vertices, which will select all nodes that are directly connected to the selected node. Groups of selected nodes can be moved all at once. When a node is selected in the graph pane, it is highlighted in red, and its corresponding data in the Vertices worksheet are also selected. It is often useful to fine-tune layouts in order to reduce unnecessary edge crossings or nodes that are obscured by other nodes or edges.
Choosing a Layout Algorithm
There are a number of different “layout algorithms” that determine the position of nodes on the graph pane. NodeXL allows you to choose which layout algorithm to use as shown in Figure 22.8. The first two options are typically best for large social media datasets (including the Sample Facebook Ego network), though the Circle layout can also be used effectively at times. The Fruchterman-Reingold layout pushes and pulls nodes apart as if they were connected via springs. It is meant to be run many times in a row, where each successive run moves the nodes to a better position. By default when you Show, Refresh, or Lay Out Again the graph, it will run for 10 iterations. It is often helpful to click Lay Out Again several times to make sure the nodes have settled into a reasonable location. You can also change the default number of iterations and the “repulsive force” in the Layout Options dialog that's available at the bottom of the drop-down shown in Figure 22.8. The Harel-Koren Fast Multiscale algorithm is not iterative. Instead, each time it is run, it will create a unique layout, though the overall structure of the network is similar. Try these out on the same network and you'll get a feel for their effect. Once you have a layout that you're happy with, you can choose None from the layout algorithm drop-down menu. That way, if you refresh the graph later, it will not relayout all of the nodes.
FIGURE 22.8 Graph pane with the layout algorithm drop-down expanded.
Using the Group in a Box Layout
Visualizing large or dense networks can be challenging since they often end up looking like a hairball. To help gain insights into complex networks, it is often useful to visualize each connected component (i.e., subnetwork that is not connected to other nodes in the network) or cluster of nodes in its own section of the graph pane. This is called the “Group in a Box” layout, since it assigns a different group into its own box as shown in Figure 22.9, which displays the Sample Facebook Ego network. To recreate this layout, choose Layout Options from the layout algorithm drop-down menu on the graph pane, which will open up the Layout Options dialog shown on the left-hand side of Figure 22.9. Choose “Lay out each of the graph's groups in its own box” and check the box that says “Use the Grid layout for groups that don't have many edges.” Click OK and refresh the graph to see the changes take effect. Labels that have been entered into the Groups worksheet, as shown in Figure 22.9, will now appear at the center of each group box. Play around with some of the other options, such as the box layout algorithm and intergroup edges in the Layout Options menu to see some alternatives.
FIGURE 22.9 Group in a Box layout with group labels and Layout Options dialog shown.
Each node and edge has a number of visual properties such as color that can be used to create more meaningful and readable network graph visualizations. The available visual properties for each edge are found on the Edges worksheet and include Color, Width, Style (e.g., solid, dashed, and dotted), Opacity, and Visibility (e.g., if the edge is shown at all). Nodes can be styled on the Vertices worksheet by changing their color, shape (including geometric shapes and images and labels, which pull image or label data from other columns), Size, Opacity, and Visibility. The Groups page also has columns for Vertex Color, Vertex Shape, Visibility, and Collapse (which will combine all nodes in a group into a single large node).
Data can be entered into the visual properties cells in several ways. Some columns such as Shape have a list of preset options that become available in a drop-down menu once you click inside of a specific cell. To choose a color, click into a cell and click on the color picker icon in the Visual Properties section of the NodeXL ribbon. After choosing a color, the red, green, and blue values or the official name of the color will populate the cell. Notice that tool tips drop-down when the mouse hovers in these cells to explain what values are expected in these columns. In addition to manually selecting values, Excel formulas can also be used. For example, if Facebook data is provided on gender in a separate column, a simple If formula could be written to change the color of the nodes to differ for males and females. However, the easiest way to change visual properties based on other data is to use the NodeXL Autofill Columns feature as described below.
Using Autofill Columns
The Autofill Columns feature in NodeXL automatically fills the Visual Properties data (or other columns such as Label or Tooltip) based on data stored in some other column on the same spreadsheet. To see how this works, click on the Autofill Columns button in the NodeXL menu ribbon to open the dialog (Figure 22.10). Next, click on the Vertices tab and choose Betweenness Centrality from the drop-down menu next to the Vertex Size: row and click Autofill button at the bottom. This will make nodes with a higher betweenness centrality a larger size helping to draw visual attention to them.
FIGURE 22.10 Autofill Columns dialog showing the Vertices tab and basing the Vertex Size based on Betweenness Centrality.
Notice that the Size column on the Vertices worksheet now has been populated with values ranging from 10 (for Becket) to 1.5 (for most of the nodes with a relatively small betweenness centrality). By default, a linear mapping is used. In other words, the largest betweenness centrality value (43,269) becomes 10, the smallest value (0) becomes 1.5, and something right in the middle would be mapped to a size right in the middle. As a result, since Becket has such a high betweenness centrality compared with the others, his circle becomes large (size 10) and all the others look incomparably small.
To create a more meaningful visualization, click on the arrow in the Options section (see right-hand side of Figure 22.10) next to the Vertex Size row and choose “Vertex Size Options….” This will open up the dialog shown in Figure 22.11, which allows you to change how size is impacted by betweenness centrality. Increase the maximum vertex size (To this vertex size) to 15 instead of 10. Next, check the “Ignore outliers” box. This will make it so that Becket's extremely high betweenness centrality (considered an outlier) will not be used to determine the size of all other nodes. It will be given the maximum size (of 15), but the next nonoutlier score (Lena's betweenness centrality of 12,522) will become the new maximum size when determining the size of all other nodes. Click OK and the Autofill button and look at the Size column on the Vertex worksheet. It should look like the one in Figure 22.12.
FIGURE 22.11 Autofill Column Vertex Size Options dialog.
FIGURE 22.12 Network visualization with size based on betweenness centrality and the top six individuals with labels.
Choosing Between Group and Vertex Level Data Sources for Determining Vertex Color and Shape
A final warning is in order. As you may remember, both the Vertices worksheet and the Groups worksheet have columns that define the Color and Shape of vertices. For example, the color on the Vertices worksheet may be based on the gender of the person, while the color on the Groups worksheet indicates which group the person is a part of. So, which worksheet takes precedence if they indicate different colors? The answer is that either one can take precedence, though you'll need to specify which one you want using the Group Options, which are available under the Groups drop-down menu on the NodeXL ribbon (see Figure 22.13). Simply choose which worksheet you want to govern the color and shape of the nodes in the graph pane. In Figure 22.13, I have changed the default so they are governed by the Vertices worksheet instead of the Groups worksheet.
FIGURE 22.13 Group Options window configured so that the colors and shapes on the graph will be from the Vertices worksheet.
Filtering Nodes Using the Visibility Column
To create more readable and insightful visualizations, it is important to determine which elements to display in a graph. One way to do this in NodeXL is to use the Visibility column, which is available on the Edges, Vertices, and Groups worksheets.
The Visibility options on the Vertices worksheet are shown in Figure 22.14. To reduce clutter, only nodes with at least one edge connected to them are shown by default (Show if in an Edge). This excludes any isolated nodes, such as Facebook Friends who do not have any connections to any other Friends in the sample ego network dataset. These nodes can be shown by typing or selecting “Show” into the Visibility column for each of the nodes that have no other connections (see Figure 22.14). Sometimes, there are too many nodes in the graph. To remove nodes from the graph pane, there are two options. The “Skip” option is typically what is desired. This will act as if the node did not exist in the network at all. In fact, if groups or metrics are calculated after Skip is selected for certain nodes, those nodes will not be included in the groups or metrics calculations. It is as if they were not in the spreadsheet at all. They also will not be shown on the graph. Alternatively, the “Hide” option simply makes the node (and its associated edges) 100% transparent in the graph pane. The layout algorithm, metrics, and groups all act as if it is still there, but it is simply not visible.
FIGURE 22.14 Visibility options on the Vertices worksheet.
Though beyond the scope of this brief introduction, readers are encouraged to also explore the Dynamic Filters feature, which provides interactive sliders to hide certain nodes that do not meet selected criteria. Additionally, note that the Autofill Columns feature can be used to automatically set the Visibility column data based on some other data (e.g., if degree is below 2, Skip; otherwise, Show).
Exporting Network Visualizations
To save your network visualization as an image file, simply right-click on the image background and select Save Image to File > Save Image. There are Image Options available that can determine the size of the image and text for a header and footer. Images are best saved in a lossless format such as .png or .gif. The .xps vector format is also available, which will scale as large as you need it to. This is recommended for high-resolution printing (e.g., posters).
Simplifying NodeXL with Automate and “Recipes”
There are many settings, options, configurations, and adjustments possible in NodeXL. Expert users can make use of the many features in the application to achieve sophisticated results. But new users often struggle to get similar results. To save time for new users, experts can share their settings configurations with other users. All configuration changes in NodeXL can be saved and shared.
1. Start a new NodeXL session. For example, in Windows 7, click the Start button and then All Programs and then scroll down the list from the top until you see the NodeXL Excel Template item. Click it and wait until NodeXL is loaded; that is, you see the Document Action pane on the right. In Windows 8, click Start and type “NodeXL” to search for it.
2. Download some data. Select the NodeXL ribbon on the Excel menu's top line; then, in the Data section (the first one from the left), click the Import drop-down, and click on the From Twitter Search network… item in the menu. In the box under “Search for tweets that match this query,” type (say) “your query” (without quote marks). If you have a Twitter account but haven't used it in NodeXL before, you'll have to select the first radio button under “Your Twitter Account” and follow the steps to authorize NodeXL to use your account to import Twitter networks. To keep it quick, just leave the other defaults the way they are; you can play with them later. Click OK.
3. Get a recipe from the NodeXL Graph Gallery. Once you have downloaded network data, you have to change various NodeXL settings to fine-tune the network graph to get the results you want. An easy alternative way to do this is to copy a recipe file from a network graphs saved to the NodeXL Graph Gallery. Many of these networks have a link to the recipe used to create them (technically referred to as the “NodeXL Options file”). Scroll to the bottom of many NodeXL Graph Gallery pages, and click the link “Download the NodeXL Options Used to Create the Graph.” Save this to a convenient folder, for example, the place where you keep your NodeXL data; you'll now have a file there called WorkbookOptions-#####.NodeXLOptions.
4. Apply the NodeXL data recipe to your network data. Returning to the NodeXL ribbon in Excel, in the Options section, click on Import. This opens a file open dialog; navigate to and open the file WorkbookOptions-#####.NodeXLOptions that you just downloaded.
5. Automatically compute metrics and visualize the network graph. Click the Automate button in the Graph section on the NodeXL ribbon. (You can deselect tasks if you wish by unchecking boxes, and you can change Options for those tasks by selecting a task and clicking the Options… button.) Click the Run button. (You may get an error message that reads: “This workbook can't be saved”; you can avoid it by following the instructions in the dialog box. Just click OK for now.)
6. Explore the results. The network graph will now be in Document Actions pane. If you don't see the pane, you can bring it back by selecting the View ribbon in Excel and clicking the Document Actions icon in the Show section. Click on the various worksheets to dig into the data that NodeXL has created about each edge, vertex, group, and the network as a whole.
7. Export to NodeXL Graph Gallery. NodeXL will summarize information about the network for you and can export the visualization and report to the NodeXL Graph Gallery. Click the Export drop-down in the Data section of the NodeXL ribbon and then click on the “To NodeXL Graph Gallery…” entry. Select the “I don't have an account” radio button, and provide a guest name; for now, just leave the defaults the way they are. (You can create an account on http://nodexlgraphgallery.org by clicking on the Create Account link in the top left-hand corner.) When Excel's finished churning (uploading may take a while; wait for the dialog box to disappear), go to the NodeXL Graph Gallery home page http://nodexlgraphgallery.org/Pages/Default.aspx; you should see your graph near the top of the page. Click on the thumbnail to go to the page.