Machine Learning with R Cookbook (2015)

Chapter 5. Classification (I) – Tree, Lazy, and Probabilistic

In this chapter, we will cover the following recipes:

· Preparing the training and testing datasets

· Building a classification model with recursive partitioning trees

· Visualizing a recursive partitioning tree

· Measuring the prediction performance of a recursive partitioning tree

· Pruning a recursive partitioning tree

· Building a classification model with a conditional inference tree

· Visualizing a conditional inference tree

· Measuring the prediction performance of a conditional inference tree

· Classifying data with a k-nearest neighbor classifier

· Classifying data with logistic regression

· Classifying data with the Naïve Bayes classifier

Introduction

Classification is used to identify a category of new observations (testing datasets) based on a classification model built from the training dataset, of which the categories are already known. Similar to regression, classification is categorized as a supervised learning method as it employs known answers (label) of a training dataset to predict the answer (label) of the testing dataset. The main difference between regression and classification is that regression is used to predict continuous values.

In contrast to this, classification is used to identify the category of a given observation. For example, one may use regression to predict the future price of a given stock based on historical prices. However, one should use the classification method to predict whether the stock price will rise or fall.

In this chapter, we will illustrate how to use R to perform classification. We first build a training dataset and a testing dataset from the churn dataset, and then apply different classification methods to classify the churn dataset. In the following recipes, we will introduce the tree-based classification method using a traditional classification tree and a conditional inference tree, lazy-based algorithm, and a probabilistic-based method using the training dataset to build up a classification model, and then use the model to predict the category (class label) of the testing dataset. We will also use a confusion matrix to measure the performance.

Preparing the training and testing datasets

Building a classification model requires a training dataset to train the classification model, and testing data is needed to then validate the prediction performance. In the following recipe, we will demonstrate how to split the telecom churn dataset into training and testing datasets, respectively.

Getting ready

In this recipe, we will use the telecom churn dataset as the input data source, and split the data into training and testing datasets.

How to do it...

Perform the following steps to split the churn dataset into training and testing datasets:

1. You can retrieve the churn dataset from the C50 package:

2. > install.packages("C50")

3. > library(C50)

4. > data(churn)

2. Use str to read the structure of the dataset:

3. > str(churnTrain)

3. We can remove the state, area_code, and account_length attributes, which are not appropriate for classification features:

4. > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ]

4. Then, split 70 percent of the data into the training dataset and 30 percent of the data into the testing dataset:

5. > set.seed(2)

6. > ind = sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))

7. > trainset = churnTrain[ind == 1,]

8. > testset = churnTrain[ind == 2,]

5. Lastly, use dim to explore the dimensions of both the training and testing datasets:

6. > dim(trainset)

7. [1] 2315 17

8. > dim(testset)

9. [1] 1018 17

How it works...

In this recipe, we use the telecom churn dataset as our example data source. The dataset contains 20 variables with 3,333 observations. We would like to build a classification model to predict whether a customer will churn, which is very important to the telecom company as the cost of acquiring a new customer is significantly more than retaining one.

Before building the classification model, we need to preprocess the data first. Thus, we load the churn data from the C50 package into the R session with the variable name as churn. As we determined that attributes such as state, area_code, and account_length are not useful features for building the classification model, we remove these attributes.

After preprocessing the data, we split it into training and testing datasets, respectively. We then use a sample function to randomly generate a sequence containing 70 percent of the training dataset and 30 percent of the testing dataset with a size equal to the number of observations. Then, we use a generated sequence to split the churn dataset into the training dataset, trainset, and the testing dataset, testset. Lastly, by using the dim function, we found that 2,315 out of the 3,333 observations are categorized into the training dataset, trainset, while the other 1,018 are categorized into the testing dataset, testset.

There's more...

You can combine the split process of the training and testing datasets into the split.data function. Therefore, you can easily split the data into the two datasets by calling this function and specifying the proportion and seed in the parameters:

> split.data = function(data, p = 0.7, s = 666){

+ set.seed(s)

+ index = sample(1:dim(data)[1])

+ train = data[index[1:floor(dim(data)[1] * p)], ]

+ test = data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ]

+ return(list(train = train, test = test))

+ }

Building a classification model with recursive partitioning trees

A classification tree uses a split condition to predict class labels based on one or multiple input variables. The classification process starts from the root node of the tree; at each node, the process will check whether the input value should recursively continue to the right or left sub-branch according to the split condition, and stops when meeting any leaf (terminal) nodes of the decision tree. In this recipe, we will introduce how to apply a recursive partitioning tree on the customer churn dataset.

Getting ready

You need to have completed the previous recipe by splitting the churn dataset into the training dataset (trainset) and testing dataset (testset), and each dataset should contain exactly 17 variables.

How to do it...

Perform the following steps to split the churn dataset into training and testing datasets:

1. Load the rpart package:

2. > library(rpart)

2. Use the rpart function to build a classification tree model:

3. > churn.rp = rpart(churn ~ ., data=trainset)

3. Type churn.rp to retrieve the node detail of the classification tree:

4. > churn.rp

4. Next, use the printcp function to examine the complexity parameter:

5. > printcp(churn.rp)

7. Classification tree:

8. rpart(formula = churn ~ ., data = trainset)

10.Variables actually used in tree construction:

11.[1] international_plan number_customer_service_calls

12.[3] total_day_minutes total_eve_minutes

13.[5] total_intl_calls total_intl_minutes

14.[7] voice_mail_plan

15.

16.Root node error: 342/2315 = 0.14773

17.

18.n= 2315

19.

20. CP nsplit rel error xerror xstd

21.1 0.076023 0 1.00000 1.00000 0.049920

22.2 0.074561 2 0.84795 0.99708 0.049860

23.3 0.055556 4 0.69883 0.76023 0.044421

24.4 0.026316 7 0.49415 0.52632 0.037673

25.5 0.023392 8 0.46784 0.52047 0.037481

26.6 0.020468 10 0.42105 0.50877 0.037092

27.7 0.017544 11 0.40058 0.47076 0.035788

28.8 0.010000 12 0.38304 0.47661 0.035993

5. Next, use the plotcp function to plot the cost complexity parameters:

6. > plotcp(churn.rp)

Figure 1: The cost complexity parameter plot

6. Lastly, use the summary function to examine the built model:

7. > summary(churn.rp)

How it works...

In this recipe, we use a recursive partitioning tree from the rpart package to build a tree-based classification model. The recursive portioning tree includes two processes: recursion and partitioning. During the process of decision induction, we have to consider a statistic evaluation question (or simply a yes/no question) to partition the data into different partitions in accordance with the assessment result. Then, as we have determined the child node, we can repeatedly perform the splitting until the stop criteria is satisfied.

For example, the data (shown in the following figure) in the root node can be partitioned into two groups with regard to the question of whether f1 is smaller than X. If so, the data is divided into the left-hand side. Otherwise, it is split into the right-hand side. Then, we can continue to partition the left-hand side data with the question of whether f2 is smaller than Y:

Figure 2: Recursive partioning tree

In the first step, we load the rpart package with the library function. Next, we build a classification model using the churn variable as a classification category (class label) and the remaining variables as input features.

After the model is built, you can type the variable name of the built model, churn.rp, to display the tree node details. In the printed node detail, n indicates the sample size, loss indicates the misclassification cost, yval stands for the classified membership (no or yes, in this case), and yprob stands for the probabilities of two classes (the left value refers to the probability reaching label no, and the right value refers to the probability reaching label, yes).

Then, we use the printcp function to print the complexity parameters of the built tree model. From the output of printcp, one should find the value of CP, a complexity parameter, which serves as a penalty to control the size of the tree. In short, the greater the CP value, the fewer the number of splits there are (nsplit). The output value (the rel error) represents the average deviance of the current tree divided by the average deviance of the null tree. A xerrorvalue represents the relative error estimated by a 10-fold classification. xstd stands for the standard error of the relative error.

To make the CP (cost complexity parameter) table more readable, we use plotcp to generate an information graphic of the CP table. As per the screenshot (step 5), the x-axis at the bottom illustrates the cp value, the y-axis illustrates the relative error, and the upper x-axis displays the size of the tree. The dotted line indicates the upper limit of a standard deviation. From the screenshot, we can determine that minimum cross-validation error occurs when the tree is at a size of 12.

We can also use the summary function to display the function call, complexity parameter table for the fitted tree model, variable importance, which helps identify the most important variable for the tree classification (summing up to 100), and detailed information of each node.

The advantage of using the decision tree is that it is very flexible and easy to interpret. It works on both classification and regression problems, and more; it is nonparametric. Therefore, one does not have to worry about whether the data is linear separable. As for the disadvantage of using the decision tree, it is that it tends to be biased and over-fitted. However, you can conquer the bias problem through the use of a conditional inference tree, and solve the problem of over-fitting through a random forest method or tree pruning.

See also

· For more information about the rpart, printcp, and summary functions, please use the help function:

· > ?rpart

· > ?printcp

· > ?summary.rpart

· C50 is another package that provides a decision tree and a rule-based model. If you are interested in the package, you may refer to the document at http://cran.r-project.org/web/packages/C50/C50.pdf.

Visualizing a recursive partitioning tree

From the last recipe, we learned how to print the classification tree in a text format. To make the tree more readable, we can use the plot function to obtain the graphical display of a built classification tree.

Getting ready

One needs to have the previous recipe completed by generating a classification model, and assign the model into the churn.rp variable.

How to do it...

Perform the following steps to visualize the classification tree:

1. Use the plot function and the text function to plot the classification tree:

2. > plot(churn.rp, margin= 0.1)

3. > text(churn.rp, all=TRUE, use.n = TRUE)

Figure 3: The graphical display of a classification tree

2. You can also specify the uniform, branch, and margin parameter to adjust the layout:

3. > plot(churn.rp, uniform=TRUE, branch=0.6, margin=0.1)

4. > text(churn.rp, all=TRUE, use.n = TRUE)

Figure 4: Adjust the layout of the classification tree

How it works...

Here, we demonstrate how to use the plot function to graphically display a classification tree. The plot function can simply visualize the classification tree, and you can then use the text function to add text to the plot.

In Figure 3, we assign margin = 0.1 as a parameter to add extra white space around the border to prevent the displayed text being truncated by the margin. It shows that the length of the branches displays the relative magnitude of the drop in deviance. We then use the text function to add labels for the nodes and branches. By default, the text function will add a split condition on each split, and add a category label in each terminal node. In order to add extra information on the tree plot, we set the parameter as all equal to TRUE to add a label to all the nodes. In addition to this, we add a parameter by specifying use.n = TRUE to add extra information, which shows that the actual number of observations fall into two different categories (no and yes).

In Figure 4, we set the option branch to 0.6 to add a shoulder to each plotted branch. In addition to this, in order to display branches of an equal length rather than relative magnitude of the drop in deviance, we set the option uniform to TRUE. As a result, Figure 4 shows a classification tree with short shoulders and branches of equal length.

See also

· You may use ?plot.rpart to read more about the plotting of the classification tree. This document also includes information on how to specify the parameters, uniform, branch, compress, nspace, margin, and minbranch, to adjust the layout of the classification tree.

Measuring the prediction performance of a recursive partitioning tree

Since we have built a classification tree in the previous recipes, we can use it to predict the category (class label) of new observations. Before making a prediction, we first validate the prediction power of the classification tree, which can be done by generating a classification table on the testing dataset. In this recipe, we will introduce how to generate a predicted label versus a real label table with the predict function and the table function, and explain how to generate a confusion matrix to measure the performance.

Getting ready

You need to have the previous recipe completed by generating the classification model, churn.rp. In addition to this, you have to prepare the training dataset, trainset, and the testing dataset, testset, generated in the first recipe of this chapter.

How to do it...

Perform the following steps to validate the prediction performance of a classification tree:

1. You can use the predict function to generate a predicted label of testing the dataset:

2. > predictions = predict(churn.rp, testset, type="class")

2. Use the table function to generate a classification table for the testing dataset:

3. > table(testset$churn, predictions)

4. predictions

5. yes no

6. yes 100 41

7. no 18 859

3. One can further generate a confusion matrix using the confusionMatrix function provided in the caret package:

4. > library(caret)

5. > confusionMatrix(table(predictions, testset$churn))

6. Confusion Matrix and Statistics

9. predictions yes no

10. yes 100 18

11. no 41 859

12.

13. Accuracy : 0.942

14. 95% CI : (0.9259, 0.9556)

15. No Information Rate : 0.8615

16. P-Value [Acc > NIR] : < 2.2e-16

17.

18. Kappa : 0.7393

19. Mcnemar's Test P-Value : 0.004181

20.

21. Sensitivity : 0.70922

22. Specificity : 0.97948

23. Pos Pred Value : 0.84746

24. Neg Pred Value : 0.95444

25. Prevalence : 0.13851

26. Detection Rate : 0.09823

27. Detection Prevalence : 0.11591

28. Balanced Accuracy : 0.84435

29.

30. 'Positive' Class : yes

How it works...

In this recipe, we use a predict function and built up classification model, churn.rp, to predict the possible class labels of the testing dataset, testset. The predicted categories (class labels) are coded as either no or yes. Then, we use the table function to generate a classification table on the testing dataset. From the table, we discover that there are 859 correctly predicted as no, while 18 are misclassified as yes. 100 of the yes predictions are correctly predicted, but 41 observations are misclassified into no. Further, we use the confusionMatrix function from the caret package to produce a measurement of the classification model.

See also

· You may use ?confusionMatrix to read more about the performance measurement using the confusion matrix

· For those who are interested in the definition output by the confusion matrix, please refer to the Wikipedia entry, Confusion_matrix (http://en.wikipedia.org/wiki/Confusion_matrix)

Pruning a recursive partitioning tree

In previous recipes, we have built a complex decision tree for the churn dataset. However, sometimes we have to remove sections that are not powerful in classifying instances to avoid over-fitting, and to improve the prediction accuracy. Therefore, in this recipe, we introduce the cost complexity pruning method to prune the classification tree.

Getting ready

You need to have the previous recipe completed by generating a classification model, and assign the model into the churn.rp variable.

How to do it...

Perform the following steps to prune the classification tree:

1. Find the minimum cross-validation error of the classification tree model:

2. > min(churn.rp$cptable[,"xerror"])

3. [1] 0.4707602

2. Locate the record with the minimum cross-validation errors:

3. > which.min(churn.rp$cptable[,"xerror"])

4. 7

3. Get the cost complexity parameter of the record with the minimum cross-validation errors:

4. > churn.cp = churn.rp$cptable[7,"CP"]

5. > churn.cp

6. [1] 0.01754386

4. Prune the tree by setting the cp parameter to the CP value of the record with minimum cross-validation errors:

5. > prune.tree = prune(churn.rp, cp= churn.cp)

5. Visualize the classification tree by using the plot and text function:

6. > plot(prune.tree, margin= 0.1)

7. > text(prune.tree, all=TRUE , use.n=TRUE)

Figure 5: The pruned classification tree

6. Next, you can generate a classification table based on the pruned classification tree model:

7. > predictions = predict(prune.tree, testset, type="class")

8. > table(testset$churn, predictions)

9. predictions

10. yes no

11. yes 95 46

12. no 14 863

7. Lastly, you can generate a confusion matrix based on the classification table:

8. > confusionMatrix(table(predictions, testset$churn))

9. Confusion Matrix and Statistics

10.

11.

12.

13.predictions yes no

14. yes 95 14

15. no 46 863

16.

17. Accuracy : 0.9411

18. 95% CI : (0.9248, 0.9547)

19. No Information Rate : 0.8615

20. P-Value [Acc > NIR] : 2.786e-16

21.

22. Kappa : 0.727

23. Mcnemar's Test P-Value : 6.279e-05

24.

25. Sensitivity : 0.67376

26. Specificity : 0.98404

27. Pos Pred Value : 0.87156

28. Neg Pred Value : 0.94939

29. Prevalence : 0.13851

30. Detection Rate : 0.09332

31. Detection Prevalence : 0.10707

32. Balanced Accuracy : 0.82890

33.

34. 'Positive' Class : yes

How it works...

In this recipe, we discussed pruning a classification tree to avoid over-fitting and producing a more robust classification model. We first located the record with the minimum cross-validation errors within the cptable, and we then extracted the CP of the record and assigned the value to churn.cp. Next, we used the prune function to prune the classification tree with churn.cp as the parameter. Then, by using the plot function, we graphically displayed the pruned classification tree. From Figure 5, it is clear that the split of the tree is less than the original classification tree (Figure 3). Lastly, we produced a classification table and used the confusion matrix to validate the performance of the pruned tree. The result shows that the accuracy (0.9411) is slightly lower than the original model (0.942), and also suggests that the pruned tree may not perform better than the original classification tree as we have pruned some split conditions (Still, one should examine the change in sensitivity and specificity). However, the pruned tree model is more robust as it removes some split conditions that may lead to over-fitting.

See also

· For those who would like to know more about cost complexity pruning, please refer to the Wikipedia article for Pruning (decision_trees): http://en.wikipedia.org/wiki/Pruning_(decision_trees

Building a classification model with a conditional inference tree

In addition to traditional decision trees (rpart), conditional inference trees (ctree) are another popular tree-based classification method. Similar to traditional decision trees, conditional inference trees also recursively partition the data by performing a univariate split on the dependent variable. However, what makes conditional inference trees different from traditional decision trees is that conditional inference trees adapt the significance test procedures to select variables rather than selecting variables by maximizing information measures (rpart employs a Gini coefficient). In this recipe, we will introduce how to adapt a conditional inference tree to build a classification model.

Getting ready

You need to have the first recipe completed by generating the training dataset, trainset, and the testing dataset, testset.

How to do it...

Perform the following steps to build the conditional inference tree:

1. First, we use ctree from the party package to build the classification model:

2. > library(party)

3. > ctree.model = ctree(churn ~ . , data = trainset)

2. Then, we examine the built tree model:

3. > ctree.model

How it works...

In this recipe, we used a conditional inference tree to build a classification tree. The use of ctree is similar to rpart. Therefore, you can easily test the classification power using either a traditional decision tree or a conditional inference tree while confronting classification problems. Next, we obtain the node details of the classification tree by examining the built model. Within the model, we discover that ctree provides information similar to a splitcondition, criterion (1 – p-value), statistics (test statistics), and weight (the case weight corresponding to the node). However, it does not offer as much information as rpart does through the use of the summary function.

See also

· You may use the help function to refer to the definition of Binary Tree Class and read more about the properties of binary trees:

· > help("BinaryTree-class")

Visualizing a conditional inference tree

Similar to rpart, the party package also provides a visualization method for users to plot conditional inference trees. In the following recipe, we will introduce how to use the plot function to visualize conditional inference trees.

Getting ready

You need to have the first recipe completed by generating the conditional inference tree model, ctree.model. In addition to this, you need to have both, trainset and testset, loaded in an R session.

How to do it...

Perform the following steps to visualize the conditional inference tree:

1. Use the plot function to plot ctree.model built in the last recipe:

2. > plot(ctree.model)

Figure 6: A conditional inference tree of churn data

2. To obtain a simple conditional inference tree, one can reduce the built model with less input features, and redraw the classification tree:

3. > daycharge.model = ctree(churn ~ total_day_charge, data = trainset)

4. > plot(daycharge.model)

Figure 7: A conditional inference tree using the total_day_charge variable as only split condition

How it works...

To visualize the node detail of the conditional inference tree, we can apply the plot function on a built classification model. The output figure reveals that every intermediate node shows the dependent variable name and the p-value. The split condition is displayed on the left and right branches. The terminal nodes show the number of categorized observations, n, and the probability of a class label of either 0 or 1.

Taking Figure 7 as an example, we first build a classification model using total_day_charge as the only feature and churn as the class label. The built classification tree shows that when total_day_charge is above 48.18, the lighter gray area is greater than the darker gray in node 9, which indicates that the customer with a day charge of over 48.18 has a greater likelihood to churn (label = yes).

See also

· The visualization of the conditional inference tree comes from the plot.BinaryTree function. If you are interested in adjusting the layout of the classification tree, you may use the help function to read the following document:

· > ?plot.BinaryTree

Measuring the prediction performance of a conditional inference tree

After building a conditional inference tree as a classification model, we can use the treeresponse and predict functions to predict categories of the testing dataset, testset, and further validate the prediction power with a classification table and a confusion matrix.

Getting ready

You need to have the previous recipe completed by generating the conditional inference tree model, ctree.model. In addition to this, you need to have both trainset and testset loaded in an R session.

How to do it...

Perform the following steps to measure the prediction performance of a conditional inference tree:

1. You can use the predict function to predict the category of the testing dataset, testset:

2. > ctree.predict = predict(ctree.model ,testset)

3. > table(ctree.predict, testset$churn)

5. ctree.predict yes no

6. yes 99 15

7. no 42 862

2. Furthermore, you can use confusionMatrix from the caret package to generate the performance measurements of the prediction result:

3. > confusionMatrix(table(ctree.predict, testset$churn))

4. Confusion Matrix and Statistics

7. ctree.predict yes no

8. yes 99 15

9. no 42 862

10.

11. Accuracy : 0.944

12. 95% CI : (0.9281, 0.9573)

13. No Information Rate : 0.8615

14. P-Value [Acc > NIR] : < 2.2e-16

15.

16. Kappa : 0.7449

17. Mcnemar's Test P-Value : 0.0005736

18.

19. Sensitivity : 0.70213

20. Specificity : 0.98290

21. Pos Pred Value : 0.86842

22. Neg Pred Value : 0.95354

23. Prevalence : 0.13851

24. Detection Rate : 0.09725

25. Detection Prevalence : 0.11198

26. Balanced Accuracy : 0.84251

27.

28. 'Positive' Class : yes

3. You can also use the treeresponse function, which will tell you the list of class probabilities:

4. > tr = treeresponse(ctree.model, newdata = testset[1:5,])

5. > tr

6. [[1]]

7. [1] 0.03497409 0.96502591

9. [[2]]

10.[1] 0.02586207 0.97413793

11.

12.[[3]]

13.[1] 0.02586207 0.97413793

14.

15.[[4]]

16.[1] 0.02586207 0.97413793

17.

18.[[5]]

19.[1] 0.03497409 0.96502591

How it works...

In this recipe, we first demonstrate that one can use the prediction function to predict the category (class label) of the testing dataset, testset, and then employ a table function to generate a classification table. Next, you can use the confusionMatrix function built into the caret package to determine the performance measurements.

In addition to the predict function, treeresponse is also capable of estimating the class probability, which will often classify labels with a higher probability. In this example, we demonstrated how to obtain the estimated class probability using the top five records of the testing dataset, testset. The treeresponse function returns a list of five probabilities. You can use the list to determine the label of instance.

See also

· For the predict function, you can specify the type as response, prob, or node. If you specify the type as prob when using the predict function (for example, predict(… type="prob")), you will get exactly the same result as what treeresponse returns.

Classifying data with the k-nearest neighbor classifier

K-nearest neighbor (knn) is a nonparametric lazy learning method. From a nonparametric view, it does not make any assumptions about data distribution. In terms of lazy learning, it does not require an explicit learning phase for generalization. The following recipe will introduce how to apply the k-nearest neighbor algorithm on the churn dataset.

Getting ready

You need to have the previous recipe completed by generating the training and testing datasets.

How to do it...

Perform the following steps to classify the churn data with the k-nearest neighbor algorithm:

1. First, one has to install the class package and have it loaded in an R session:

2. > install.packages("class")

3. > library(class)

2. Replace yes and no of the voice_mail_plan and international_plan attributes in both the training dataset and testing dataset to 1 and 0:

3. > levels(trainset$international_plan) = list("0"="no", "1"="yes")

4. > levels(trainset$voice_mail_plan) = list("0"="no", "1"="yes")

5. > levels(testset$international_plan) = list("0"="no", "1"="yes")

6. > levels(testset$voice_mail_plan) = list("0"="no", "1"="yes")

3. Use the knn classification method on the training dataset and the testing dataset:

4. > churn.knn = knn(trainset[,! names(trainset) %in% c("churn")], testset[,! names(testset) %in% c("churn")], trainset$churn, k=3)

4. Then, you can use the summary function to retrieve the number of predicted labels:

5. > summary(churn.knn)

6. yes no

7. 77 941

5. Next, you can generate the classification matrix using the table function:

6. > table(testset$churn, churn.knn)

7. churn.knn

8. yes no

9. yes 44 97

10. no 33 844

6. Lastly, you can generate a confusion matrix by using the confusionMatrix function:

7. > confusionMatrix(table(testset$churn, churn.knn))

8. Confusion Matrix and Statistics

10. churn.knn

11. yes no

12. yes 44 97

13. no 33 844

14.

15. Accuracy : 0.8723

16. 95% CI : (0.8502, 0.8922)

17. No Information Rate : 0.9244

18. P-Value [Acc > NIR] : 1

19.

20. Kappa : 0.339

21. Mcnemar's Test P-Value : 3.286e-08

22.

23. Sensitivity : 0.57143

24. Specificity : 0.89692

25. Pos Pred Value : 0.31206

26. Neg Pred Value : 0.96237

27. Prevalence : 0.07564

28. Detection Rate : 0.04322

29. Detection Prevalence : 0.13851

30. Balanced Accuracy : 0.73417

31.

32. 'Positive' Class : yes

How it works...

knn trains all samples and classifies new instances based on a similarity (distance) measure. For example, the similarity measure can be formulated as follows:

· Euclidian Distance:

· Manhattan Distance:

In knn, a new instance is classified to a label (class) that is common among the k-nearest neighbors. If k = 1, then the new instance is assigned to the class where its nearest neighbor belongs. The only required input for the algorithm is k. If we give a small k input, it may lead to over-fitting. On the other hand, if we give a large k input, it may result in under-fitting. To choose a proper k-value, one can count on cross-validation.

The advantages of knn are:

· The cost of the learning process is zero

· It is nonparametric, which means that you do not have to make the assumption of data distribution

· You can classify any data whenever you can find similarity measures of given instances

The main disadvantages of knn are:

· It is hard to interpret the classified result.

· It is an expensive computation for a large dataset.

· The performance relies on the number of dimensions. Therefore, for a high dimension problem, you should reduce the dimension first to increase the process performance.

The use of knn does not vary significantly from applying a tree-based algorithm mentioned in the previous recipes. However, while a tree-based algorithm may show you the decision tree model, the output produced by knn only reveals classification category factors. However, before building a classification model, one should replace the attribute with a string type to an integer since the k-nearest neighbor algorithm needs to calculate the distance between observations. Then, we build up a classification model by specifying k=3, which means choosing the three nearest neighbors. After the classification model is built, we can generate a classification table using predicted factors and the testing dataset label as the input. Lastly, we can generate a confusion matrix from the classification table. The confusion matrix output reveals an accuracy result of (0.8723), which suggests that both the tree-based methods mentioned in previous recipes outperform the accuracy of the k-nearest neighbor classification method in this case. Still, we cannot determine which model is better depending merely on accuracy, one should also examine the specificity and sensitivity from the output.

See also

· There is another package named kknn, which provides a weighted k-nearest neighbor classification, regression, and clustering. You can learn more about the package by reading this document: http://cran.r-project.org/web/packages/kknn/kknn.pdf.

Classifying data with logistic regression

Logistic regression is a form of probabilistic statistical classification model, which can be used to predict class labels based on one or more features. The classification is done by using the logit function to estimate the outcome probability. One can use logistic regression by specifying the family as a binomial while using the glm function. In this recipe, we will introduce how to classify data using logistic regression.

Getting ready

You need to have completed the first recipe by generating training and testing datasets.

How to do it...

Perform the following steps to classify the churn data with logistic regression:

1. With the specification of family as a binomial, we apply the glm function on the dataset, trainset, by using churn as a class label and the rest of the variables as input features:

2. > fit = glm(churn ~ ., data = trainset, family=binomial)

2. Use the summary function to obtain summary information of the built logistic regression model:

3. > summary(fit)

5. Call:

6. glm(formula = churn ~ ., family = binomial, data = trainset)

8. Deviance Residuals:

9. Min 1Q Median 3Q Max

10.-3.1519 0.1983 0.3460 0.5186 2.1284

11.

12.Coefficients:

13. Estimate Std. Error z value Pr(>|z|)

14.(Intercept) 8.3462866 0.8364914 9.978 < 2e-16

15.international_planyes -2.0534243 0.1726694 -11.892 < 2e-16

16.voice_mail_planyes 1.3445887 0.6618905 2.031 0.042211

17.number_vmail_messages -0.0155101 0.0209220 -0.741 0.458496

18.total_day_minutes 0.2398946 3.9168466 0.061 0.951163

19.total_day_calls -0.0014003 0.0032769 -0.427 0.669141

20.total_day_charge -1.4855284 23.0402950 -0.064 0.948592

21.total_eve_minutes 0.3600678 1.9349825 0.186 0.852379

22.total_eve_calls -0.0028484 0.0033061 -0.862 0.388928

23.total_eve_charge -4.3204432 22.7644698 -0.190 0.849475

24.total_night_minutes 0.4431210 1.0478105 0.423 0.672367

25.total_night_calls 0.0003978 0.0033188 0.120 0.904588

26.total_night_charge -9.9162795 23.2836376 -0.426 0.670188

27.total_intl_minutes 0.4587114 6.3524560 0.072 0.942435

28.total_intl_calls 0.1065264 0.0304318 3.500 0.000464

29.total_intl_charge -2.0803428 23.5262100 -0.088 0.929538

30.number_customer_service_calls -0.5109077 0.0476289 -10.727 < 2e-16

31.

32.(Intercept) ***

33.international_planyes ***

34.voice_mail_planyes *

35.number_vmail_messages

36.total_day_minutes

37.total_day_calls

38.total_day_charge

39.total_eve_minutes

40.total_eve_calls

41.total_eve_charge

42.total_night_minutes

43.total_night_calls

44.total_night_charge

45.total_intl_minutes

46.total_intl_calls ***

47.total_intl_charge

48.number_customer_service_calls ***

49.---

50.Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

51.

52.(Dispersion parameter for binomial family taken to be 1)

53.

54. Null deviance: 1938.8 on 2314 degrees of freedom

55.Residual deviance: 1515.3 on 2298 degrees of freedom

56.AIC: 1549.3

57.

58.Number of Fisher Scoring iterations: 6

3. Then, we find that the built model contains insignificant variables, which would lead to misclassification. Therefore, we use significant variables only to train the classification model:

4. > fit = glm(churn ~ international_plan + voice_mail_plan+total_intl_calls+number_customer_service_calls, data = trainset, family=binomial)

5. > summary(fit)

7. Call:

8. glm(formula = churn ~ international_plan + voice_mail_plan +

9. total_intl_calls + number_customer_service_calls, family = binomial,

10. data = trainset)

11.

12.Deviance Residuals:

13. Min 1Q Median 3Q Max

14.-2.7308 0.3103 0.4196 0.5381 1.6716

15.

16.Coefficients:

17. Estimate Std. Error z value

18.(Intercept) 2.32304 0.16770 13.852

19.international_planyes -2.00346 0.16096 -12.447

20.voice_mail_planyes 0.79228 0.16380 4.837

21.total_intl_calls 0.08414 0.02862 2.939

22.number_customer_service_calls -0.44227 0.04451 -9.937

23. Pr(>|z|)

24.(Intercept) < 2e-16 ***

25.international_planyes < 2e-16 ***

26.voice_mail_planyes 1.32e-06 ***

27.total_intl_calls 0.00329 **

28.number_customer_service_calls < 2e-16 ***

29.---

30.Signif. codes:

31.0 es: des: **rvice_calls < '. es: de

32.

33.(Dispersion parameter for binomial family taken to be 1)

34.

35. Null deviance: 1938.8 on 2314 degrees of freedom

36.Residual deviance: 1669.4 on 2310 degrees of freedom

37.AIC: 1679.4

38.

39.Number of Fisher Scoring iterations: 5

4. Then, you can then use a fitted model, fit, to predict the outcome of testset. You can also determine the class by judging whether the probability is above 0.5:

5. > pred = predict(fit,testset, type="response")

6. > Class = pred >.5

5. Next, the use of the summary function will show you the binary outcome count, and reveal whether the probability is above 0.5:

6. > summary(Class)

7. Mode FALSE TRUE NA's

8. logical 29 989 0

6. You can generate the counting statistics based on the testing dataset label and predicted result:

7. > tb = table(testset$churn,Class)

8. > tb  Class

9. FALSE TRUE

10. yes 18 123

11. no 11 866

7. You can turn the statistics of the previous step into a classification table, and then generate the confusion matrix:

8. > churn.mod = ifelse(testset$churn == "yes", 1, 0)

9. > pred_class = churn.mod

10.> pred_class[pred<=.5] = 1- pred_class[pred<=.5]

11.> ctb = table(churn.mod, pred_class)

12.> ctb

13. pred_class

14.churn.mod 0 1

15. 0 866 11

16. 1 18 123

17.> confusionMatrix(ctb)

18.Confusion Matrix and Statistics

19.

20. pred_class

21.churn.mod 0 1

22. 0 866 11

23. 1 18 123

24.

25. Accuracy : 0.9715

26. 95% CI : (0.9593, 0.9808)

27. No Information Rate : 0.8684

28. P-Value [Acc > NIR] : <2e-16

29.

30. Kappa : 0.8781

31. Mcnemar's Test P-Value : 0.2652

32.

33. Sensitivity : 0.9796

34. Specificity : 0.9179

35. Pos Pred Value : 0.9875

36. Neg Pred Value : 0.8723

37. Prevalence : 0.8684

38. Detection Rate : 0.8507

39. Detection Prevalence : 0.8615

40. Balanced Accuracy : 0.9488

41.

42. 'Positive' Class : 0

How it works...

Logistic regression is very similar to linear regression; the main difference is that the dependent variable in linear regression is continuous, but the dependent variable in logistic regression is dichotomous (or nominal). The primary goal of logistic regression is to use logit to yield the probability of a nominal variable is related to the measurement variable. We can formulate logit in following equation: ln(P/(1-P)), where P is the probability that certain event occurs.

The advantage of logistic regression is that it is easy to interpret, it directs model logistic probability, and provides a confidence interval for the result. Unlike the decision tree, which is hard to update the model, you can quickly update the classification model to incorporate new data in logistic regression. The main drawback of the algorithm is that it suffers from multicollinearity and, therefore, the explanatory variables must be linear independent. glmprovides a generalized linear regression model, which enables specifying the model in the option family. If the family is specified to a binomial logistic, you can set the family as a binomial to classify the dependent variable of the category.

The classification process begins by generating a logistic regression model with the use of the training dataset by specifying Churn as the class label, the other variables as training features, and family set as binomial. We then use the summary function to generate the model's summary information. From the summary information, we may find some insignificant variables (p-values > 0.05), which may lead to misclassification. Therefore, we should consider only significant variables for the model.

Next, we use the fit function to predict the categorical dependent variable of the testing dataset, testset. The fit function outputs the probability of a class label, with a result equal to 0.5 and below, suggesting that the predicted label does not match the label of the testing dataset, and a probability above 0.5 indicates that the predicted label matches the label of the testing dataset. Further, we can use the summary function to obtain the statistics of whether the predicted label matches the label of the testing dataset. Lastly, in order to generate a confusion matrix, we first generate a classification table, and then use confusionMatrix to generate the performance measurement.

See also

· For more information of how to use the glm function, please refer to Chapter 4, Understanding Regression Analysis, which covers how to interpret the output of the glm function

Classifying data with the Naïve Bayes classifier

The Naïve Bayes classifier is also a probability-based classifier, which is based on applying the Bayes theorem with a strong independent assumption. In this recipe, we will introduce how to classify data with the Naïve Bayes classifier.

Getting ready

You need to have the first recipe completed by generating training and testing datasets.

How to do it...

Perform the following steps to classify the churn data with the Naïve Bayes classifier:

1. Load the e1071 library and employ the naiveBayes function to build the classifier:

2. > library(e1071)

3. > classifier=naiveBayes(trainset[, !names(trainset) %in% c("churn")], trainset$churn)

2. Type classifier to examine the function call, a-priori probability, and conditional probability:

3. > classifier

5. Naive Bayes Classifier for Discrete Predictors

7. Call:

8. naiveBayes.default(x = trainset[, !names(trainset) %in% c("churn")],

9. y = trainset$churn)

10.

11.A-priori probabilities:

12.trainset$churn

13. yes no

14.0.1477322 0.8522678

15.

16.Conditional probabilities:

17. international_plan

18.trainset$churn no yes

19. yes 0.70467836 0.29532164

20. no 0.93512418 0.06487582

3. Next, you can generate a classification table for the testing dataset:

4. > bayes.table = table(predict(classifier, testset[, !names(testset) %in% c("churn")]), testset$churn)

5. > bayes.table

7. yes no

8. yes 68 45

9. no 73 832

4. Lastly, you can generate a confusion matrix from the classification table:

5. > confusionMatrix(bayes.table)

6. Confusion Matrix and Statistics

9. yes no

10. yes 68 45

11. no 73 832

12.

13. Accuracy : 0.8841

14. 95% CI : (0.8628, 0.9031)

15. No Information Rate : 0.8615

16. P-Value [Acc > NIR] : 0.01880

17.

18. Kappa : 0.4701

19. Mcnemar's Test P-Value : 0.01294

20.

21. Sensitivity : 0.4823

22. Specificity : 0.9487

23. Pos Pred Value : 0.6018

24. Neg Pred Value : 0.9193

25. Prevalence : 0.1385

26. Detection Rate : 0.0668

27. Detection Prevalence : 0.1110

28. Balanced Accuracy : 0.7155

29.

30. 'Positive' Class : yes

How it works...

Naive Bayes assumes that features are conditionally independent, which the effect of a predictor(x) to class (c) is independent of the effect of other predictors to class(c). It computes the posterior probability, P(c|x), as the following formula:

Where P(x|c) is called likelihood, p(x) is called the marginal likelihood, and p(c) is called the prior probability. If there are many predictors, we can formulate the posterior probability as follows:

The advantage of Naïve Bayes is that it is relatively simple and straightforward to use. It is suitable when the training set is relative small, and may contain some noisy and missing data. Moreover, you can easily obtain the probability for a prediction. The drawbacks of Naïve Bayes are that it assumes that all features are independent and equally important, which is very unlikely in real-world cases.

In this recipe, we use the Naïve Bayes classifier from the e1071 package to build a classification model. First, we specify all the variables (excluding the churn class label) as the first input parameters, and specify the churn class label as the second parameter in the naiveBayes function call. Next, we assign the classification model into the variable classifier. Then, we print the variable classifier to obtain information, such as function call, A-priori probabilities, and conditional probabilities. We can also use the predict function to obtain the predicted outcome and the table function to retrieve the classification table of the testing dataset. Finally, we use a confusion matrix to calculate the performance measurement of the classification model.

At last, we list a comparison table of all the mentioned algorithms in this chapter:

Algorithm	Advantage	Disadvantage
Recursive partitioning tree	· Very flexible and easy to interpret · Works on both classification and regression problems · Nonparametric	· Prone to bias and over-fitting
Conditional inference tree	· Very flexible and easy to interpret · Works on both classification and regression problems · Nonparametric · Less prone to bias than a recursive partitioning tree	· Prone to over-fitting
K-nearest neighbor classifier	· The cost of the learning process is zero · Nonparametric · You can classify any data whenever you can find similarity measures of any given instances	· Hard to interpret the classified result · Computation is expensive for a large dataset · The performance relies on the number of dimensions
Logistic regression	· Easy to interpret · Provides model logistic probability · Provides confidence interval · You can quickly update the classification model to incorporate new data	· Suffers multicollinearity · Does not handle the missing value of continuous variables · Sensitive to extreme values of continuous variables
Naïve Bayes	· Relatively simple and straightforward to use · Suitable when the training set is relative small · Can deal with some noisy and missing data · Can easily obtain the probability for a prediction	· Assumes all features are independent and equally important, which is very unlikely in real-world cases · Prone to bias when the number of training sets increase

See also

· To learn more about the Bayes theorem, you can refer to the following Wikipedia article: http://en.wikipedia.org/wiki/Bayes'_theorem