Mining the Gold in Data and Text - R Recipes: A Problem-Solution Approach (2014)

R Recipes: A Problem-Solution Approach (2014)

Chapter 16. Mining the Gold in Data and Text

Part of the Big Data revolution is the rapid growth of text and unstructured data. Instead of retrieving information as we would from a structured dataset or a known set of search terms, text and data mining are concerned with extracting information, which is fundamentally different. Each of us is bombarded with information every day, much of it in numeric form, but the majority in text form. Mining the “gold” from data and text has become a very important task and a very big business. We can see an evolution of information from data to information to knowledge tointelligence. Organizations need ways to find and leverage the valuable intelligence in vast and unorganized collections of data and text, and that is what we will talk about in this chapter.

To solve this complex and challenging problem, we need a capable and scalable solution. R offers that and more. Before we get into specifics, let’s examine exactly what data mining is (and is not). We might define data mining as the extraction of predictive (read valuable) information from (relatively) large datasets. To the extent possible, the extraction should be automatic, and the information should be predictive of something valuable to us. Implicit in this definition is that there is some statistical methodology that allows us to extract the information. There are many different data mining algorithms and tools, and it is somewhat of an idiosyncrasy to become enamored of a specific algorithm at the expense of understanding the results and their implications. Data mining tools typically include decision trees, various approaches to classification, the use of neural networks and machine learning, inducing rules or associations, reducing the dimensionality of data, and various approaches to clustering. Although there may be some overlap in purpose, data mining is not typically associated with data visualization, various queries of structured data, or data warehousing.

If you use statistics on a regular basis, you are probably already (at least) a fledgling data miner, perhaps without realizing it. For example, my doctoral dissertation involved the creation of subgroups from a sample of several thousand insurance agents on the basis of their responses to an autobiographical questionnaire with several hundred questions. I used principal components analysis to reduce the dimensionality of the biodata questionnaire, and then used hierarchical clustering of individual profiles to identify subgroups of agents who had various prior experiences in common. In support of the psychological adage that the best predictor of future behavior is past behavior, I found indeed that certain groups of agents were more likely than others to have both interest in and potential for becoming agency managers. I also found that other subgroups of agents might be very successful as agents, but were not interested in, nor likely to be good at, a management position with the company. At the time (back in the late 1970s), I had never heard of data mining, but the same techniques I used then are used in data mining today.

I did the data analysis for my dissertation using SAS running in batch mode on a mainframe IBM computer, performing a principal components analysis with varimax rotation that took from midnight to 6 a.m. As in many other areas, such as modern robust statistics, the techniques for data mining have often been conceptualized and developed long before the computing power and speed were available to make them feasible for use with today’s incredibly large and loosely structured datasets.

Because data mining is a diverse field, there is not a single R package capable of all the data mining techniques that I mentioned. We will therefore have to pick and choose some illustrative applications. Progress, however, is being made toward the goal of a general-purpose R-based data mining package, and currently, the rattle package written by Graham Williams is the clear front-runner. The rattle package provides a graphical user interface for R and gives the user a point-and-click approach to data mining that works similarly to the way John Fox’s R Commander package is used for standard statistical analyses. The rattle package must be installed as any other R package is, but when launched, provides its own Gnome-based interface (see Figure 16-1).

9781484201312_Fig16-01

Figure 16-1. The Rattle package running under R 3.1.2

Beyond requiring the user to know how to install and launch an R package, Rattle does not require the user to be expert in the use of R, per se, but does assume familiarity with various approaches to data mining. The interface for rattle is laid out in such a way that it shows the progression of a typical data mining project, from loading data to various explorations, tests, transformations, and other manipulations of data to examining clusters or associations, choosing models, and then evaluating the models. We will return to rattle later in this chapter, comparing some of its output to that of R functions for the same analyses.

For now, we will use some of the built-in R functions for various data mining applications. Again, we will not delve into the depths of data mining or cover the entire set of procedures, but you will get a brief exposure to some of the most commonly used tools.

Recipe 16-1. Reducing the Dimensionality of Data

Problem

Data mining, as the name implies, seeks to discover information from unstructured data, when such information is not obvious. If the information were obvious, it would exist in a structured format. Though there are still challenges retrieving and analyzing data when the volume of structured data becomes very large, the challenge of extracting information from unstructured data is much greater. With large datasets consisting of multiple variables, it is often difficult to detect connections and patterns. It is also difficult to determine whether the interrelationships among the variables are such that we can reduce the dimensionality of the data to something more manageable and reasonable. In Recipe 16-1, you will learn how to use R for a principal components analysis (PCA). PCA and the closely related method of factor analysis have been used for many years, and have quickly demonstrated themselves to be useful in data mining applications.

Solution

For those interested in the history of statistics, factor analysis was developed in the early 1900s by British psychologist Charles Spearman, though most people are more likely to associate Spearman with the rank correlation coefficient. Spearman used factor analysis to support the theory that there is a general intelligence factor, g, which explains the relationships among different cognitive tests.

PCA is related to factor analysis, but PCA is more useful as a data reduction and descriptive statistical technique, whereas factor analysis is used to identify (exploratory) or verify (confirmatory) the presence of underlying unobserved (latent) variables called factors (in a completely different sense from a factor in an R data frame or table). In PCA, the primary goal is to find a relatively small number of “components” that can be used to summarize a set of variables without the loss of too much information. In PCA, the correlation matrix used to determine the number of components has 1’s on the diagonal. This means that in PCA, we are seeking to explain (or account for) 100% of the variance (including the variance unique to each variable, the variance that is common among variables, and the error variance). In factor analysis, on the other hand, rather than unities (1’s), the diagonal of the correlation matrix has “communalities” (that is, the variances shared in common with other variables), excluding the variance unique to each variable and error. For our current purposes, PCA will work fine.

Here are the 11 statements on a survey concerning the use of laptop computers in class (see Table 16-1). Statements marked with a superscripted a are negatively worded and reverse scored. My student Neal Herring and I developed the survey as part of a class project. The university where the survey was developed and administered has a requirement that students must have a laptop computer available for every class (ubiquitous computing requirement). We were interested in knowing whether students were in favor of this or not, and used Likert scaling to develop the survey from a much larger list of candidate items. The survey was administered to 65 students, with 61 complete cases (no missing data). Respondents rated their agreement or disagreement with each statement on a five-point response scale with 1 = Strongly disagree, 2 = Disagree, 3 =Undecided, 4 = Agree, and 5 = Strongly agree.

Table 16-1. 11-Item Scale of Students’ Attitudes Toward Laptop Computers in Class

Item

Statement

1

I pay better attention in class when I use my laptop for note-taking.

2a

Using a laptop in class does not increase my learning.

3a

It is a hassle to bring my laptop to class.

4

Using a laptop in the classroom makes me more efficient by reducing my study time outside of class.

5a

The weight of my laptop makes it hard to carry to class.

6

Using a laptop in class will help prepare me for a job or for advanced studies in my field.

7a

I find it distracting when others use their laptops in class.

8

Using a laptop in the classroom will make my grades higher.

9

I am more likely to attend classes where laptops are required or permitted.

10

Using a laptop in class makes class more interesting.

11

I think students should be permitted to use their laptops in all their classes.

Here are the first few rows of the dataset. See that in addition to item responses, we collected data on the respondents’ sex, GPA, and class standing (1 = freshman to 4 = senior).

> head(laptops,3)
id Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11 Sex
1 1 4 3 3 4 2 2 5 2 4 4 4 1
2 2 2 5 2 4 1 5 5 5 3 5 5 1
3 3 3 2 1 3 2 2 2 2 3 4 3 0
Status GPA
1 4 2.60
2 4 3.70
3 3 3.64

We will use the R package psych to find the principal components from the correlation matrix, as discussed earlier. We will also rotate the factor solution to a varimax (orthogonal rotation). The princomp() function in base R produces an unrotated solution, while the psych package allows rotation to several different solutions. As mentioned, the correlation matrix with 1’s on the diagonal is used in PCA. We find that the 11 items are represented well by three principal components. The number of factors to extract is a bit of a controversial issue. A good rule of thumb is that any factor (or component) with an eigenvalue greater than 1 is a candidate for retention, but ultimately, the choice of the number of factors extracted often involves some subjective judgment based on the researcher’s interpretation of the factor loadings. PCA uses an eigendecompostion of the square correlation or covariance matrix to produce uncorrelated (orthogonal) factors (represented by eigenvectors). The vector of eigenvalues represents the proportion of the total variance explained by each component, and of necessity, must sum to the number of original items as there are 1’s on the diagonal of the correlation matrix. Let us convert our laptop survey item responses to a matrix and use the eigen() function to study the eigenvalues and eigenvectors. We simply subset the laptop data by selecting the 11 columns of item responses, and then use the corfunction to produce our correlation matrix:

Items <- as.matrix(laptops[2:12])

> corMat <- cor(Items)
> corMat
Item1 Item2 Item3 Item4 Item5 Item6 Item7
Item1 1.0000000 0.4231445 0.0511384 0.4632248 0.16643381 0.2183705 0.2547533
Item2 0.4231445 1.0000000 0.3432559 0.5255746 0.29104875 0.5608472 0.3984790
Item3 0.0511384 0.3432559 1.0000000 0.3186916 0.73461102 0.3065492 0.4264412
Item4 0.4632248 0.5255746 0.3186916 1.0000000 0.27198596 0.4911948 0.2528662
Item5 0.1664338 0.2910488 0.7346110 0.2719860 1.00000000 0.2568671 0.3633775
Item6 0.2183705 0.5608472 0.3065492 0.4911948 0.25686714 1.0000000 0.3198655
Item7 0.2547533 0.3984790 0.4264412 0.2528662 0.36337750 0.3198655 1.0000000
Item8 0.5146774 0.6398297 0.3175904 0.5751981 0.28702317 0.6802546 0.4252158
Item9 0.2245341 0.3846825 0.2548962 0.4542136 0.31866815 0.3373920 0.2738553
Item10 0.1668403 0.3887571 0.1902351 0.3770664 0.11772733 0.4267274 0.2666067
Item11 0.3247979 0.4227840 0.1262275 0.3682178 0.04660024 0.3849067 0.2605502
Item8 Item9 Item10 Item11
Item1 0.5146774 0.2245341 0.1668403 0.32479793
Item2 0.6398297 0.3846825 0.3887571 0.42278399
Item3 0.3175904 0.2548962 0.1902351 0.12622747
Item4 0.5751981 0.4542136 0.3770664 0.36821778
Item5 0.2870232 0.3186681 0.1177273 0.04660024
Item6 0.6802546 0.3373920 0.4267274 0.38490670
Item7 0.4252158 0.2738553 0.2666067 0.26055018
Item8 1.0000000 0.3550858 0.4997211 0.45178437
Item9 0.3550858 1.0000000 0.5822105 0.30319201
Item10 0.4997211 0.5822105 1.0000000 0.44632809
Item11 0.4517844 0.3031920 0.4463281 1.00000000

Here are the eigenvalues and eigenvectors. We see that three components have eigenvalues greater than 1, as discussed earlier:

> eigen(corMat)
$values
[1] 4.6681074 1.5408803 1.0270909 0.7942875 0.7335682 0.5896213 0.4394996
[8] 0.4240592 0.3663119 0.2118143 0.2047596

$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.2462244 0.22107473 0.58114752 0.447515243 -0.25136101 -0.02506555
[2,] -0.3595942 0.06915554 0.16927109 -0.127637015 0.11454822 -0.07828474
[3,] -0.2506909 -0.58829325 -0.04217347 -0.058873014 0.06611702 0.28137063
[4,] -0.3390437 0.10370137 0.12366386 0.346201739 0.31534535 0.06486155
[5,] -0.2334164 -0.60038263 0.03159179 0.203807163 0.04681617 0.19067543
[6,] -0.3365345 0.08027461 -0.03125315 -0.419702484 0.47974475 -0.15637109
[7,] -0.2681116 -0.23657580 0.10141271 -0.331553795 -0.65831805 -0.41855783
[8,] -0.3854329 0.13565320 0.18437972 -0.175408233 0.14521421 -0.16420035
[9,] -0.2898889 0.02672144 -0.50324733 0.497063117 -0.11046030 -0.19075495
[10,] -0.2947311 0.21566774 -0.55816283 0.008082903 -0.11482431 -0.09832977
[11,] -0.2705290 0.31780613 -0.07405850 -0.233659252 -0.32814554 0.77514458
[,7] [,8] [,9] [,10] [,11]
[1,] -0.348837008 -0.006217720 -0.06016296 -0.32343802 -0.242340463
[2,] 0.288701437 -0.788357274 0.28749655 -0.06086925 0.102473468
[3,] 0.005269449 0.076511614 0.30203619 -0.02857311 -0.635889572
[4,] 0.552818181 0.484383586 0.21064937 -0.03194788 0.223311571
[5,] -0.294511458 -0.091790248 -0.21930085 -0.03416408 0.599625476
[6,] -0.087058677 0.109537971 -0.49618779 -0.40890910 -0.124819321
[7,] 0.281105085 0.218739682 -0.06533240 -0.05651648 0.095040577
[8,] -0.362175484 0.133992368 0.08834002 0.75050977 -0.005736071
[9,] 0.159102079 -0.200051674 -0.42913546 0.22679185 -0.255745814
[10,] -0.389241570 0.108029551 0.49528135 -0.31485066 0.156660772
[11,] 0.089435341 0.007522468 -0.20566674 0.06668854 0.049009955

Now, let us use the principal function in the psych package, extract three components, and rotate them to a varimax solution. The varimax solution rotates the axes of the factors while retaining their orthogonality. Rotations usually assist in the interpretation of factor or component loadings.

> pcaSolution <- principal(corMat, nfactors = 3, rotate = "varimax")
> pcaSolution$loadings

Loadings:
RC1 RC2 RC3
Item1 0.835
Item2 0.675 0.284 0.323
Item3 0.897 0.133
Item4 0.631 0.225 0.347
Item5 0.894
Item6 0.519 0.246 0.458
Item7 0.356 0.529 0.161
Item8 0.753 0.237 0.366
Item9 0.126 0.252 0.758
Item10 0.185 0.872
Item11 0.502 0.495

RC1 RC2 RC3
SS loadings 2.830 2.205 2.201
Proportion Var 0.257 0.200 0.200
Cumulative Var 0.257 0.458 0.658
> pcaSolution$values
[1] 4.6681074 1.5408803 1.0270909 0.7942875 0.7335682 0.5896213 0.4394996
[8] 0.4240592 0.3663119 0.2118143 0.2047596

Although we are warned by some authorities that we should not attempt to interpret the factor loadings or the underlying variables in principal components, this is still routinely done by many researchers, including myself. The factor loadings are the correlations of the individual items with the rotated factors. The examination of the highest loadings shows that the first factor RC1 is related to learning and performance, the second factor RC2 is related to convenience and concentration, and the third factor RC3 is related to motivation and interest. Figure 16-2 is a structural diagram produced by the fa.diagram function in the psych package, which shows only the highest-loading items for each component. This diagram shows the latent structure and the loadings of the items on the factors.

> fa.diagram(pcaSolution)

9781484201312_Fig16-02

Figure 16-2. Structural diagram of the rotated three-component solution

Recipe 16-2. Finding Clusters of Individuals or Objects

Problem

We often have data that lend themselves to dimension-reduction techniques, as you saw in Recipe 16-1. The question then becomes, “So what? Now that I have a three-component solution explaining about 2/3 of the variance in students’ attitudes toward the use of laptop computers in class, how can that help me?” Often the answer is that the components can be used to create groupings of the individuals or objects we are measuring. This is just as true of people as it is of animals or manufactured goods. These groupings may be useful for both empirical and theoretical purposes.

Solution

There are many ways to cluster objects or individuals. We will continue with our example and develop subgroups of the individuals who completed the survey described in Recipe 16-1. As with factor analysis, deciding on the appropriate number of subgroups or clusters is part art and part science. To use clustering, we need some metric. One of the most commonly used is the distance (difference) between objects or individuals. For my dissertation, I clustered insurance agents on the basis of the distance or similarity in their profiles based on factor scores. The distance matrix was used to develop a hierarchical clustering solution. Such solutions are commonly shown in a dendrogram (tree diagram). The algorithm starts with each object or individual as a separate entity, and then combines the objects or individuals into clusters by combining the two most similar clusters with each successive iteration until there is one cluster that contains all the objects or individuals. At some point, the researcher decides that there is a balance between the number of clusters and the specificity with which the differences between the clusters are interpretable. This is a classic problem in statistics, that of classifying objects or individuals.

We will continue with our laptop survey data, and extract the scores for each of the 61 students for whom we have complete data. Each person will have a score for each component. The principal function uses the traditional regression method by default to produce factor scores, and these are saved as an object in the PCA. We will then compute a 61 × 61 distance matrix, which will give us the distances between each pair of individuals. Next, we will use hierarchical clustering to group the individuals and plot the dendrogram to see what a good solution might be regarding the number of clusters to retain. Finally, on the basis of some demographic data and GPA, which were collected at the time the survey was administered, we will determine if the clusters of individuals are different from each other in any significant and meaningful ways. We will have to use the raw data rather than a correlation matrix for the purpose of producing the component scores, so we must run the PCA again using raw scores:

> Items <- as.data.frame(Items)
> head(Items)
Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11
1 4 3 3 4 2 2 5 2 4 4 4
2 2 5 2 4 1 5 5 5 3 5 5
3 3 2 1 3 2 2 2 2 3 4 3
4 5 5 4 4 4 5 3 5 5 5 5
6 2 2 2 2 3 3 2 2 2 3 2
7 2 2 4 2 4 2 4 2 3 4 3
> pcaSolution <- principal(Items, nfactors = 3, rotate = "varimax")
> pcaScores <- pcaSolution$scores
> head(pcaScores)
RC1 RC2 RC3
1 0.6070331 -0.1135057 0.2163606
2 1.5849366 -0.8321524 1.4991802
3 -0.2064288 -1.1233926 0.1235260
4 2.1816945 0.3399202 1.2432899
6 -0.6549734 -0.1209022 -0.6975968
7 -1.1428872 1.0286981 0.1273299
> distMat <- dist(as.matrix(pcaScores))
> hc <- hclust(distMat)
> plot(hc)

The dendrogram is shown in Figure 16-3. The numbers shown are the id numbers of the subjects, so that we can identify the cluster membership.

9781484201312_Fig16-03

Figure 16-3. Hierarchical clustering dendrogram for the laptop data

The examination of the clustering reveals that three clusters may be a good starting point. To add rectangles surrounding the proposed clusters, use the rect.hclust function and specify the model name and the number of clusters as follows. After playing around with various solutions, I stuck with three clusters. The updated dendrogram appears in Figure 16-4.

> rect.hclust(hc, 3)

9781484201312_Fig16-04

Figure 16-4. Updated dendrogram with clusters identified

We can then save the cluster memberships using the cutree function, and append the cluster numbers to the laptop dataset as follows:

> cluster <- cutree(hc, 3)
> laptops <- cbind(laptops, cluster)
> head(laptops)
id Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11 Sex
1 1 4 3 3 4 2 2 5 2 4 4 4 1
2 2 2 5 2 4 1 5 5 5 3 5 5 1
3 3 3 2 1 3 2 2 2 2 3 4 3 0
4 4 5 5 4 4 4 5 3 5 5 5 5 1
6 6 2 2 2 2 3 3 2 2 2 3 2 1
7 7 2 2 4 2 4 2 4 2 3 4 3 1
Status GPA cluster
1 4 2.60 1
2 4 3.70 1
3 3 3.64 1
4 3 3.00 1
6 3 2.80 2
7 3 3.60 3

Let’s see if the clusters are useful for determining differences in GPA, sex, or status. We will use chi-square analyses for sex and status, and an analysis of variance for GPA. To make our analyses easier, we can attach the data frame so that we can call the variables directly. We determine from our chi-square tests that there are no significant associations between cluster membership and sex or status.

> attach(laptops)
> chisq.test(cluster, status)

Pearson's Chi-squared test

data: cluster and status
X-squared = 8.9984, df = 6, p-value = 0.1737

Warning message:
In chisq.test(cluster, status) : Chi-squared approximation may be incorrect
> chisq.test(cluster, sex)

Pearson's Chi-squared test

data: cluster and sex
X-squared = 3.0361, df = 2, p-value = 0.2191

Warning message:
In chisq.test(cluster, sex) : Chi-squared approximation may be incorrect

> oneway.test(GPA ~ cluster)

However, the three clusters of students have significantly different GPAs, and this would be worth examination to determine what other characteristics besides their attitudes toward laptops might be different for these groups of students. For those interested in knowing, my dissertation was based on a large-scale project for which I was a graduate research assistant at the University of Georgia. The principal investigator, the late Dr. William A. Owens, who was also my dissertation chair, was fond of saying, “You never really understand factor analysis until you do one by hand.” I am sure that is true, but by the time I got around to learning it, there were computer programs available, so I never personally did a factor analysis by hand, even though I am quite positive Doc Owens was sure that it would have benefitted me greatly. Here are the results of the one-way test of cluster differences in GPA:

One-way analysis of means (not assuming equal variances)

data: GPA and cluster
F = 7.2605, num df = 2.000, denom df = 34.025, p-value = 0.002366

> aggregate(GPA ~ cluster, data = laptops, mean)
cluster GPA
1 1 2.992593
2 2 3.466111
3 3 3.193125

Recipe 16-3. Looking for Associations

Problem

Often, we use data mining to find previously undiscovered associations. In one common but most likely anecdotal example, a large retailer found that beer and diaper sales both increased on Friday. The story is told about other stores and other days, but the gist is always the same. By discovering a hitherto unknown association through data mining, the retailer was able to take advantage of the information and place displays of beer close to the disposable diapers. Presumably, because diapers come in large bundles, women asked their husbands to buy diapers, and with the weekend imminent, the man also decided to stock up on beer for the weekend. The location of beer and diapers in close proximity caused beer sales to increase. Moreover, the retailer could also charge full price for the beer because of the higher demand. This often-repeated tale is one of those that, as my former college provost said, “Should be true if it isn’t.”

Solution

In Recipe 16-3, we will examine association rule learning. In contrast to sequence mining, association rule mining is typically not concerned with the order of items within transactions or across transactions. One major consideration is that with large datasets, we often see associations that appear to be important, but are spurious and due to chance only. With large numbers of items, we must control the risk of interpreting such associations as meaningful or useful by establishing a higher standard for statistical soundness than we might employ in a controlled experiment.

Association analysis is especially useful for finding patterns and relationships to assist retailers, whether online or in stores. At a simplified level, imagine a shopping basket (or online shopping cart) that could have any of a collection of items in it. The aim of association rule analysis is to identify collections of items that appear together in multiple baskets. We can model a general association rule as

image

where LHS is the left-hand side and RHS is the right-hand side of a grouping of items. Often we think of LHS as an antecedent and RHS as a consequent, but we are not strict on that because we are looking at correlations or associations rather than causation. Either the LHS or the RHS could consist of multiple items. For a grocery, a common pattern might be as follows:

image

We will explore association rule analysis using the arules package. This package uses the apriori algorithm that is now part of IBM’s proprietary SPSS Model Builder (formerly Clementine). Our example will be what is known as basket analysis. Think of a basket as representing a collection of items, such as a basket of shopping items, a panel of blood test results, a portfolio of stocks, or an assortment of medications prescribed to a patient. Each “basket” has a unique identifier, and the contents of the basket are the items contained in the basket listed in a column next to the identifier, which could literally be a shopping basket, but could also be a particular customer, a patient, an online shopping cart, or some other unit of analysis. The contents of the baskets are the target variable in the association rule analysis.

In association rule analysis, the search heuristic looks for patterns of repeated items, such as the examples given earlier. We want to find combinations of items that are “frequent enough” and “interesting enough” to control the number of association rules to something manageable. The two primary measures used in association analysis are support and confidence. Support is expressed as some minimum percentage of the time two or more items appear together. It must typically be set fairly low, because obvious transactions that appear together frequently (say chips and dip) are of less interest than unusual ones (such as beer and diapers). Confidence is also expressed as a proportion, but more formally, it is a conditional probability; that is, the probability of RHS given LHS, which equals

image

Another measure of interest is lift, which we can define as the improvement in the occurrence of RHS given the LHS. Lift is the ratio of the conditional probability of RHS given LHS and the unconditional probability of RHS:

image

Let us use the arules package and perform an apriori rules analysis of the dataset called “marketBasket.csv.” The data consist of 16 different items that might be bought at a grocery store, and 8 separate transactions (baskets). The data are as follows:

> basket <- read.csv("marketBasket.csv")
> basket
basket item
1 1 bread
2 1 butter
3 1 eggs
4 1 milk
5 2 bologna
6 2 bread
7 2 cheese
8 2 chips
9 2 mayo
10 2 soda
11 3 bananas
12 3 bread
13 3 butter
14 3 cheese
15 3 oranges
16 4 buns
17 4 chips
18 4 hotdogs
19 4 mustard
20 4 soda
21 5 buns
22 5 chips
23 5 hotdogs
24 5 mustard
25 5 pickles
26 5 soda
27 6 bread
28 6 butter
29 6 cereal
30 6 eggs
31 6 milk
32 7 bananas
33 7 cereal
34 7 eggs
35 7 milk
36 7 oranges
37 8 bologna
38 8 bread
39 8 buns
40 8 cheese
41 8 chips
42 8 hotdogs
43 8 mayo
44 8 mustard
45 8 soda

To use the apriori function in the arules package, we must first convert the dataset into a transaction data structure:

> library(arules)
> newBasket <- new.env()
> newBasket <- as(split(basket$item, basket$basket), "transactions")
> newBasket
transactions in sparse format with
8 transactions (rows) and
16 items (columns)

Now, we can use the apriori function to find the associations among the items in the eight shopping baskets. We will save the rules as an object we can query for additional information. We specify the parameters for the association rule mining as a list. We will look for rules that have support of at least .20, confidence of .80, and minimum length of 2, which means that there must be both a LHS and a RHS. The default is a length of 1, and because RHS must always have 1 (and only 1) item, this would result in a rule in which the LHS is empty, as in {empty} image {beer}.

> rules <- apriori(newBasket, parameter = list(supp = .2, conf = .8, minlen = 2, target = "rules"))

parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.8 0.1 1 none FALSE TRUE 0.2 1 10 rules
ext
FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Warning in apriori(newBasket, parameter = list(supp = 0.2, conf = 0.8, target = "rules")) :
You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[16 item(s), 8 transaction(s)] done [0.00s].
sorting and recoding items ... [15 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.00s].
writing ... [246 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

Although we were warned about potential memory limitations, none materialized. The algorithm located 246 rules fitting our criteria. As a point of reference, if we had accepted the defaults, we would have had more than 2,000 rules to consider.

> rules
set of 246 rules
> summary(rules)
set of 246 rules

rule length distribution (lhs + rhs):sizes
2 3 4 5 6
32 89 84 35 6

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 4.000 3.569 4.000 6.000

summary of quality measures:
support confidence lift
Min. :0.2500 Min. :1 Min. :1.600
1st Qu.:0.2500 1st Qu.:1 1st Qu.:2.000
Median :0.2500 Median :1 Median :2.667
Mean :0.2866 Mean :1 Mean :2.582
3rd Qu.:0.3750 3rd Qu.:1 3rd Qu.:2.667
Max. :0.5000 Max. :1 Max. :4.000

mining info:
data ntransactions support confidence
newBasket 8 0.2 0.8

We can use inspect() to find the rules with high confidence (or support or lift) as follows:

> inspect(head(sort(rules, by = "confidence")
+ )
+ )
lhs rhs support confidence lift
1 {cereal} => {eggs} 0.250 1 2.666667
2 {cereal} => {milk} 0.250 1 2.666667
3 {bananas} => {oranges} 0.250 1 4.000000
4 {oranges} => {bananas} 0.250 1 4.000000
5 {butter} => {bread} 0.375 1 1.600000
6 {eggs} => {milk} 0.375 1 2.666667

Let us use the rattle package for this same application to illustrate how to do the association rule analysis. Install the rattle package, and then load and launch it using the following commands:

> install.packages("rattle")
> library("rattle")
Rattle: A free graphical interface for data mining with R.
Version 3.3.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle()

The Rattle GUI() opens in a new window. Every operation in Rattle requires that you execute it. For example, if you choose a dataset but do not click the Execute icon, the data will not be loaded into Rattle. We will load the marketBasket data and then use Rattle to “shake, rattle, and roll” the data, as the package advertises. Several example datasets come with Rattle, the most complex of which is the “weather” dataset. Rattle uses the arules package and the apriori function for the association rule analysis, just as we have done. Rattle will install any additional packages it requires on the fly, which is a nice feature, but which can be disconcerting when you first begin using Rattle.

We will load the CSV version of our basket dataset(), and compare the output of Rattle to that of the arules package. There are eight data types supported by Rattle (see Figure 16-1). We will use the “spreadsheet” format for our source, which is really CSV.

1. Click the little folder icon, navigate to the required directory, and then click the file name to open the data (see Figure 16-5).

2. Click Open and then click Execute in the Rattle icon bar (or use the shortcut F2). Failure to execute will mean the file will not be loaded.

3. In the Data tab, change the role of the basket variable to Ident and the role of the item variable to Target.

4. Rattle can be used for two kinds of association rule analysis. The simpler of these is market basket analysis, and it uses only two variables—an identifier and a list of items, as we have been doing. The second approach is to use all the input variables to find associations. To select basket analysis, check the box labeled Baskets on the Associate tab in Rattle. Remember that clicking Open will not actually load the data. You must also click Execute to bring the dataset into R.

9781484201312_Fig16-05

Figure 16-5. Opening a CSV file in Rattle()

Let us use the same settings in Rattle that we did in arules, and compare the results. Because Rattle uses the same R packages and functions we used, our results should be the same. Remember to click Execute after specifying a basket analysis and changing the settings in Rattle to match the ones we used in arules (see Figure 16-6).

9781484201312_Fig16-06

Figure 16-6. The Rattle summary of our association rule analysis()

We did get the same results with Rattle that we did using the R functions, but Rattle packaged the results very nicely and added features to view plots and sort the rules, which we had to do manually in R.

Recipe 16-4. Mining Text: A Brief Introduction

You learned about working with strings earlier in this book in Recipes 2-3 and 5-2, and you were briefly introduced to the concept of text mining. In Recipe 16-4, we will cover text mining in enough depth to get you started mining text on your own. One of the “coolest” things you can do with text mining, at least in my opinion, is to produce “word clouds,” so we will do that as well.

Problem

Written and spoken words are everywhere, and are being recorded, stored, and mined with greater frequency than ever before. Although the focus of data mining is on finding patterns, associations, trends, and differences between and among data values that are primarily numeric, the greater majority of the information in the world is text rather than numbers, as we discussed previously.

Text mining is used in a number of different fields, just as data mining is, and there are many ways to go about mining text. Finding patterns and associations among words becomes very important when such patterns and associations help solve practical problems. I am a quantitativeresearcher by training and preference, but qualitative research, which uses narratives the way quantitative researchers use numbers, is growing both in importance and popularity. Interpreting the results of interviews, written narratives, and other text-based data is becoming pervasive, and qualitative research is seen by many, if not most, as being on a par with quantitative research. In fact, many research projects now combine both quantitative and qualitative methods in what is known as mixed-methods research.

As with all of the techniques discussed in this chapter, there are elements of judgment and subjectivity in text mining, just as in data mining. Those analysts with keener (and often more intuitive) judgment are obviously more effective than those who simply follow heuristic rules dogmatically. At present, computers are still relatively “dumb” when it comes to understanding what words spoken or written by humans really mean. For example, what would a computer program “think” of the sentence, “Time flies like an arrow.”? Most humans know intuitively that this is a simile, and that we are saying that time passes swiftly, just as an arrow flies (relatively) swiftly. But a computer could just as easily take this as an instruction to time the flight speed of houseflies in the same way that it would time the speed of the flight of an arrow.

As of today, computers don’t really have much of a sense of humor, or much intuition, though advances are being made every day, and the day may soon come when a computer program can pass the Turing test, in which a human judge cannot determine reliably, whether he or she is talking to a computer or to another human. Alan Turing posed the question in 1950 as to whether machines can “think,” though his question was presaged by Descartes in his musings about automata centuries earlier.

The year 2012 marked the 100th anniversary of Turing’s birth, and as of today, the Turing test has not been passed with anything better than a 33% rate of a computer program convincing a human that it is another human in conversations facilitated by keyboards and computer monitors rather than face-to-face communication. Text mining combines analytics with natural language processing (NLP). Until the 1980s, NLP systems were based mostly on human-supplied rules, but the introduction of machine learning algorithms revolutionized the field of NLP. A discussion of NLP and statistical machine learning to the contrary notwithstanding, text mining is still a viable and valuable endeavor, assisted by computer programs that allow us to mine the depths of text in ways we only imagined previously.

Solution

The tm package in R is very widely used for text mining. There are many other tools used by text miners as well. We will experiment with several of these. In text mining, we are interested in mining one or more collections of words, or corpora. Inspection of Rattle’s interface (see Figure 16-1) reveals that two of the input types supported are corpus and script. A corpus is a collection of written works. In text mining, some of the more mundane tasks are removing “stop words” and stripping suffixes from word stems so that we can mine the corpus more efficiently.

Text mining is also known as text data mining or text analytics. Just as in data mining in general, we are looking for information that is novel, interesting, or relevant. We might categorize or cluster text, analyze sentiment, summarize documents, or examine patterns or relationships between entities.

We will use the tm package to mine a web post about Clemson University and CUDA (Compute Unified Device Architecture), which is a parallel computing platform developed by NVIDIA and implemented by using the graphics processing units (GPUs produced by NVIDA). Clemson (where I teach part-time) was recently named a CUDA teaching institution and a CUDA research center. The announcement was published online, and I simply copied the text from the web page into a text editor and saved it. The article is used with the permission of its author, Paul Alongi.

To get the text into shape for mining, we must read the text into R, convert it to a character vector, and then convert the vector to a corpus. Next, we need to perform a little “surgery” on the corpus, converting everything to lowercase, removing punctuation and numbers, and removing the common English “stop words” and extra spaces. We can use the tm_map function to create and revise the corpus. First, we need the tm package and the SnowballC package:

library(tm)
library(SnowballC)
text.corpus <- readLines("CLEMSON.txt")
text.corpus <- Corpus(VectorSource(text.corpus))
text.corpus <- tm_map(text.corpus, tolower)
text.corpus <- tm_map(text.corpus, removePunctuation)
text.corpus <- tm_map(text.corpus, removeNumbers)
text.corpus <- tm_map(text.corpus, removeWords, stopwords("english"))
text.corpus <- tm_map(text.corpus, stripWhitespace)
text.corpus <- tm_map(text.corpus, PlainTextDocument)

Now, we can “mine” our corpus, which is seen by tm as 19 separate documents (these are the paragraphs in the article), by finding frequent terms and associations in a fashion similar to what we did in data mining. To do this, we need to convert the corpus to a term document matrix, and then we can find the frequent terms and associations as follows. We set the correlation limit, which is a minimum value, to 0.5. We specify the search term and the text document matrix as well.

> tdm <- TermDocumentMatrix(text.corpus)
> findAssocs(x = tdm, term = "clemson", corlimit = 0.5)
clemson
university 0.66
advancing 0.51
announced 0.51
commitment 0.51
david 0.51
distinguished 0.51
educate 0.51
education 0.51
forward 0.51
generations 0.51
inventor 0.51
leader 0.51
look 0.51
luebke 0.51
monday 0.51
senior 0.51
state 0.51
using 0.51
visual 0.51
working 0.51
world 0.51
> findFreqTerms(x = tdm, lowfreq = 10, highfreq = Inf)
[1] "clemson" "computing" "cuda" "nvidia"

If we wanted to continue mining, we might want to “stem” the corpus, which will truncate words to a common stem, so that “compute,” “computer,” and “computing” would all become “comput.” This operation requires the SnowballC package we loaded earlier. We can then use thetm_map function in the tm package to find the specific stem elements and where they occur within the corpus. We examine only the first few of these terms, namely the first 20 rows and the first 5 columns (which correspond to the first 5 paragraphs of the article).

> myStems <- tm_map(text.corpus, stemDocument)
> myCorpus <- TermDocumentMatrix(myStems)
> inspect(myCorpus[1:20,1:5])
<<TermDocumentMatrix (terms: 20, documents: 5)>>
Non-/sparse entries: 3/97
Sparsity : 97%
Maximal term length: 7
Weighting : term frequency (tf)

Docs
Terms character(0) character(0) character(0) character(0) character(0)
acceler 0 0 0 1 0
access 0 0 0 0 0
accord 0 0 0 0 0
across 0 0 0 0 0
advanc 0 0 0 0 0
aim 0 0 0 0 0
air 0 0 0 0 0
alreadi 0 0 0 0 0
also 0 0 0 0 0
analyz 0 0 0 0 0
announc 1 0 0 0 0
applic 0 0 0 0 0
array 0 0 0 0 0
assist 0 0 0 0 0
automot 0 0 0 0 0
base 0 0 0 0 0
benefit 0 0 0 0 0
better 0 1 0 0 0
build 0 0 0 0 0
calcul 0 0 0 0 0

I mentioned word clouds earlier. The wordcloud package will create a nice-looking word cloud from a corpus or a text document. Figure 16-7 shows a word cloud (also known as a tag cloud) from the text we have been mining.

> library(wordcloud)
> wordcloud(text.corpus)

9781484201312_Fig16-07

Figure 16-7. A word cloud developed from our corpus