# Practical Data Science with R (2014)

## Part 2. Modeling methods

### Chapter 6. Memorization methods

*This chapter covers*

· Building single-variable models

· Cross-validated variable selection

· Building basic multivariable models

· Starting with decision trees, nearest neighbor, and naive Bayes models

The simplest methods in data science are what we call *memorization methods*. These are methods that generate answers by returning a majority category (in the case of classification) or average value (in the case of scoring) of a subset of the original training data. These methods can vary from models depending on a single variable (similar to the analyst’s pivot table), to decision trees (similar to what are called *business rules*), to nearest neighbor and Naive Bayes methods.^{[}^{1}^{]} In this chapter, you’ll learn how to use these memorization methods to solve classification problems (though the same techniques also work for scoring problems).

^{1} Be aware: *memorization methods* are a nonstandard classification of techniques that we’re using to organize our discussion.

**6.1. KDD and KDD Cup 2009**

We’ll demonstrate all of the techniques in this chapter on the KDD Cup 2009 dataset as our example dataset. The Conference on Knowledge Discovery and Data Mining (KDD) is the premier conference on machine learning methods. Every year KDD hosts a data mining cup, where teams analyze a dataset and then are ranked against each other. The KDD Cup is a huge deal and the inspiration for the famous Netflix Prize and even Kaggle competitions.

The KDD Cup 2009 provided a dataset about customer relationship management. The contest supplied 230 facts about 50,000 credit card accounts. From these features, the goal was to predict account cancellation (called *churn*), the innate tendency to use new products and services (called*appetency*), and willingness to respond favorably to marketing pitches (called *upselling*).^{[}^{2}^{]} As with many score-based competitions, this contest concentrated on machine learning and deliberately abstracted or skipped over a number of important data science issues, such as cooperatively defining goals, requesting new measurements, collecting data, and quantifying classifier performance in terms of business goals. In fact, for this contest we don’t have names or definitions for any of the independent (or input) variables and no real definition of the dependent (or outcome) variables. We have the advantage that the data is already in a ready-to-model format (all input variables and the results arranged in single rows). But we don’t know the meaning of any variable (so we can’t merge in outside data sources), and we can’t use any method that treats time and repetition of events carefully (such as time series methods or survival analysis).

^{2} Data available from *http://mng.bz/RDJF*. We share the steps to prepare this data for modeling in R here: *https://github.com/WinVector/zmPDSwR/tree/master/KDD2009*.

To simulate the data science processes, we’ll assume that we can use any column we’re given to make predictions (that all of these columns are known prior to needing a prediction^{[}^{3}^{]}), the contest metric (AUC) is the correct one, and the AUC of the top contestant is a good Bayes rate estimate (telling us when to stop tuning).

^{3} Checking if a column is actually going to be available during prediction (and not some later function of the unknown output) is a critical step in data science projects.

**The worst possible modeling outcome**

The worst possible modeling outcome is not failing to find a good model. The worst possible modeling outcome is thinking you have a good model when you don’t. One of the easiest ways to accidentally build such a deficient model is to have an instrumental or independent variable that is in fact a subtle function of the outcome. Such variables can easily leak into your training data, especially when you have no knowledge or control of variable meaning preparation. The point is this: such variables won’t actually be available in a real deployment and often are in training data packaged up by others.

**6.1.1. Getting started with KDD Cup 2009 data**

For our example, we’ll try to predict churn in the KDD dataset. The KDD contest was judged in terms of AUC (*area under the curve*, a measure of prediction quality discussed in *section 5.2.3*), so we’ll also use AUC as our measure of performance.^{[}^{4}^{]} The winning team achieved an AUC of 0.76 on churn, so we’ll treat that as our upper bound on possible performance. Our lower bound on performance is an AUC of 0.5, as this is the performance of a useless model.

^{4} Also, as is common for example problems, we have no project sponsor to discuss metrics with, so our choice of evaluation is a bit arbitrary.

This problem has a large number of variables, many of which have a large number of possible levels. We’re also using the AUC measure, which isn’t particularly resistant to overfitting (not having built-in model complexity or chance corrections). Because of this concern, we’ll split our data into three sets: training, calibration, and test. The intent of the three-way split is this: we’ll use the training set for most of our work, and we’ll never look at the test set (we’ll reserve it for our final report of model performance). The calibration set is used to simulate the unseen test set during modeling—we’ll look at performance on the calibration set to estimate if we’re overfitting. This three-way split procedure is recommended by many researchers. In this book, we emphasize a two-way training and test split and suggest that, generally, steps like calibration and cross-validation estimates be performed by repeatedly splitting the training portion of the data (allowing for more efficient estimation than a single split, and keeping the test data completely out of the modeling effort). For simplicity in this example, we’ll split the training portion of our data into training and calibration only a single time. Let’s start work as shown in the following listing, where we prepare the data for analysis and modeling.

**Listing 6.1. Preparing the KDD data for analysis**

We have also saved an R workspace with most of the data, functions, and results of this chapter in the GitHub repository that you can load with the command load('KDD2009.Rdata'). We’re now ready to build some single-variable models. Business analysts almost always build single-variable models using categorical features, so we’ll start with these.

**Subsample to prototype quickly**

Often the data scientist will be so engrossed with the business problem, math, and data that they forget how much trial and error is needed. It’s often an excellent idea to first work on a small subset of your training data, so that it takes seconds to debug your code instead of minutes. Don’t work with expensive data sizes until you have to.

**6.2. Building single-variable models**

Single-variable models are simply models built using only one variable at a time. Single-variable models can be powerful tools, so it’s worth learning how to work well with them before jumping into general modeling (which almost always means multiple variable models). We’ll show how to build single-variable models from both categorical and numeric variables. By the end of this section, you should be able to build, evaluate, and cross-validate single-variable models with confidence.

**6.2.1. Using categorical features**

A single-variable model based on categorical features is easiest to describe as a table. For this task, business analysts use what’s called a *pivot table* (which promotes values or levels of a feature to be families of new columns) and statisticians use what’s called a *contingency table* (where each possibility is given a column name). In either case, the R command to produce a table is table(). To create a table comparing the levels of variable 218 against the labeled churn outcome, we run the table command shown in the following listing.

**Listing 6.2. Plotting churn grouped by variable 218 levels**

From this, we see variable 218 takes on two values plus NA, and we see the joint distribution of these values against the churn outcome. At this point it’s easy to write down a single-variable model based on variable 218.

**Listing 6.3. Churn rates grouped by variable 218 codes**

> print(table218[,2]/(table218[,1]+table218[,2]))

cJvF UYBR <NA>

0.05994389 0.08223821 0.26523297

This summary tells us that when variable 218 takes on a value of cJvF, around 6% of the customers churn; when it’s UYBR, 8% of the customers churn; and when it’s not recorded (NA), 27% of the customers churn. The utility of any variable level is a combination of how often the level occurs (rare levels aren’t very useful) and how extreme the distribution of the outcome is for records matching a given level. Variable 218 seems like a feature that’s easy to use and helpful with prediction. In real work, we’d want to research with our business partners why it has missing values and what’s the best thing to do when values are missing (this will depend on how the data was prepared). We also need to design a strategy for what to do if a new level not seen during training were to occur during model use. Since this is a contest problem with no available project partners, we’ll build a function that converts NA to a level (as it seems to be pretty informative) and also treats novel values as uninformative. Our function to convert a categorical variable into a single model prediction is shown in *listing 6.4*.

**Listing 6.4. Function to build single-variable models for categorical variables**

*Listing 6.4* may seem like a lot of work, but placing all of the steps in a function lets us apply the technique to many variables quickly. The dataset we’re working with has 38 categorical variables, many of which are almost always NA, and many of which have over 10,000 distinct levels. So we definitely want to automate working with these variables as we have. Our first automated step is to adjoin a prediction or forecast (in this case, the predicted probability of churning) for each categorical variable, as shown in the next listing.

**Listing 6.5. Applying single-categorical variable models to all of our datasets**

for(v in catVars) {

pi <- paste('pred',v,sep='')

dTrain[,pi] <- mkPredC(dTrain[,outcome],dTrain[,v],dTrain[,v])

dCal[,pi] <- mkPredC(dTrain[,outcome],dTrain[,v],dCal[,v])

dTest[,pi] <- mkPredC(dTrain[,outcome],dTrain[,v],dTest[,v])

}

Note that in all cases we train with the training data frame and then apply to all three data frames dTrain, dCal, and dTest. We’re using an extra calibration data frame (dCal) because we have so many categorical variables that have a very large number of levels and are subject to overfitting. We wish to have some chance of detecting this overfitting before moving on to the test data (which we’re using as our final check, so it’s data we mustn’t use during model construction and evaluation, or we may have an exaggerated estimate of our model quality). Once we have the predictions, we can find the categorical variables that have a good AUC both on the training data and on the calibration data not used during training. These are likely the more useful variables and are identified by the loop in the next listing.

**Listing 6.6. Scoring categorical variables by AUC**

library('ROCR')

> calcAUC <- function(predcol,outcol) {

perf <- performance(prediction(predcol,outcol==pos),'auc')

as.numeric(perf@y.values)

}

> for(v in catVars) {

pi <- paste('pred',v,sep='')

aucTrain <- calcAUC(dTrain[,pi],dTrain[,outcome])

if(aucTrain>=0.8) {

aucCal <- calcAUC(dCal[,pi],dCal[,outcome])

print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f",

pi,aucTrain,aucCal))

}

}

[1] "predVar200, trainAUC: 0.828 calibrationAUC: 0.527"

[1] "predVar202, trainAUC: 0.829 calibrationAUC: 0.522"

[1] "predVar214, trainAUC: 0.828 calibrationAUC: 0.527"

[1] "predVar217, trainAUC: 0.898 calibrationAUC: 0.553"

Note how, as expected, each variable’s training AUC is inflated compared to its calibration AUC. This is because many of these variables have thousands of levels. For example, length(unique(dTrain$Var217)) is 12,434, indicating that variable 217 has 12,434 levels. A good trick to work around this is to sort the variables by their AUC score on the calibration set (not seen during training), which is a better estimate of the variable’s true utility. In our case, the most promising variable is variable 206, which has both training and calibration AUCs of 0.59. The winning KDD entry, which was a model that combined evidence from multiple features, had a much larger AUC of 0.76.

**6.2.2. Using numeric features**

There are a number of ways to use a numeric feature to make predictions. A common method is to bin the numeric feature into a number of ranges and then use the range labels as a new categorical variable. R can do this quickly with its quantile() and cut() commands, as shown next.

**Listing 6.7. Scoring numeric variables by AUC**

> mkPredN <- function(outCol,varCol,appCol) {

cuts <- unique(as.numeric(quantile(varCol,

probs=seq(0, 1, 0.1),na.rm=T)))

varC <- cut(varCol,cuts)

appC <- cut(appCol,cuts)

mkPredC(outCol,varC,appC)

}

> for(v in numericVars) {

pi <- paste('pred',v,sep='')

dTrain[,pi] <- mkPredN(dTrain[,outcome],dTrain[,v],dTrain[,v])

dTest[,pi] <- mkPredN(dTrain[,outcome],dTrain[,v],dTest[,v])

dCal[,pi] <- mkPredN(dTrain[,outcome],dTrain[,v],dCal[,v])

aucTrain <- calcAUC(dTrain[,pi],dTrain[,outcome])

if(aucTrain>=0.55) {

aucCal <- calcAUC(dCal[,pi],dCal[,outcome])

print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f",

pi,aucTrain,aucCal))

}

}

[1] "predVar6, trainAUC: 0.557 calibrationAUC: 0.554"

[1] "predVar7, trainAUC: 0.555 calibrationAUC: 0.565"

[1] "predVar13, trainAUC: 0.568 calibrationAUC: 0.553"

[1] "predVar73, trainAUC: 0.608 calibrationAUC: 0.616"

[1] "predVar74, trainAUC: 0.574 calibrationAUC: 0.566"

[1] "predVar81, trainAUC: 0.558 calibrationAUC: 0.542"

[1] "predVar113, trainAUC: 0.557 calibrationAUC: 0.567"

[1] "predVar126, trainAUC: 0.635 calibrationAUC: 0.629"

[1] "predVar140, trainAUC: 0.561 calibrationAUC: 0.560"

[1] "predVar189, trainAUC: 0.574 calibrationAUC: 0.599"

Notice in this case the numeric variables behave similarly on the training and calibration data. This is because our prediction method converts numeric variables into categorical variables with around 10 well-distributed levels, so our training estimate tends to be good and not overfit. We could improve our numeric estimate by interpolating between quantiles. Other methods we could’ve used are kernel-based density estimation and parametric fitting. Both of these methods are usually available in the variable treatment steps of Naive Bayes classifiers.

A good way to visualize the predictive power of a numeric variable is the double density plot, where we plot on the same graph the variable score distribution for positive examples and variable score distribution of negative examples as two groups. *Figure 6.1* shows the performance of the single-variable model built from the numeric feature Var126.

**Figure 6.1. Performance of variable 126 on calibration data**

The code to produce *figure 6.1* is shown in the next listing.

**Listing 6.8. Plotting variable performance**

ggplot(data=dCal) +

geom_density(aes(x=predVar126,color=as.factor(churn)))

What *figure 6.1* is showing is the conditional distribution of predVar126 for churning accounts (the dashed-line density plot) and the distribution of predVar126 for non-churning accounts (the solid-line density plot). We can deduce that low values of predVar126 are rare for churning accounts and not as rare for non-churning accounts (the graph is read by comparing areas under the curves). This (by Bayes law) lets us in turn say that a low value of predVar126 is good evidence that an account will not churn.

**Dealing with missing values in numeric variables**

One of the best strategies we’ve seen for dealing with missing values in numeric variables is the following two-step process. First, for each numeric variable, introduce a new advisory variable that is 1 when the original variable had a missing value and 0 otherwise. Second, replace all missing values of the original variable with 0. You now have removed all of the missing values and have recorded enough details so that missing values aren’t confused with actual zero values.

**6.2.3. Using cross-validation to estimate effects of overfitting**

We now have enough experience fitting the KDD dataset to try to estimate the degree of overfitting we’re seeing in our models. We can use a procedure called *cross-validation* to estimate the degree of overfit we have hidden in our models. Cross-validation applies in *all* modeling situations. This is the first opportunity we have to demonstrate it, so we’ll work through an example here.

In repeated cross-validation, a subset of the training data is used to build a model, and a complementary subset of the training data is used to score the model. We can implement a cross-validated estimate of the AUC of the single-variable model based on variable 217 with the code in the following listing.

**Listing 6.9. Running a repeated cross-validation experiment**

This shows that the 100-fold replicated estimate of the AUC has a mean of 0.556 and a standard deviation of 0.016. So our original *section 6.2* estimate of 0.553 as the AUC of this variable was very good. In some modeling circumstances, training set estimations are good enough (linear regression is often such an example). In many other circumstances, estimations from a single calibration set are good enough. And in extreme cases (such as fitting models with very many variables or level values), you’re well advised to use replicated cross-validation estimates of variable utilities and model fits. Automatic cross-validation is *extremely* useful in all modeling situations, so it’s critical you automate your modeling steps so you can perform cross-validation studies. We’re demonstrating cross-validation here, as single-variable models are among the simplest to work with.

**Aside: cross-validation in functional notation**

As a point of style, for(){} loops are considered an undesirable crutch in R. We used a for loop in our cross-validation example, as this is the style of programming that is likely to be most familiar to nonspecialists. The point is that for loops over-specify computation (they describe both what you want and the exact order of steps to achieve it). For loops tend to be less reusable and less composable than other computational methods. When you become proficient in R, you look to eliminate for loops from your code and use either vectorized or functional methods where appropriate. For example, the cross-validation we just demonstrated could be performed in a functional manner as shown in the following listing.

**Listing 6.10. Empirically cross-validating performance**

> fCross <- function() {

useForCalRep <- rbinom(n=dim(dTrainAll)[[1]],size=1,prob=0.1)>0

predRep <- mkPredC(dTrainAll[!useForCalRep,outcome],

dTrainAll[!useForCalRep,var],

dTrainAll[useForCalRep,var])

calcAUC(predRep,dTrainAll[useForCalRep,outcome])

}

> aucs <- replicate(100,fCross())

What we’ve done is wrap our cross-reference work into a function instead of in a for-based code block. Advantages are that the function can be reused and run in parallel, and it’s shorter (as it avoids needless details about result storage and result indices). The function is then called 100 times using the replicate() method (replicate() is a convenience method from the powerful sapply() family).

Note that we must write replicate(100,fCross()), *not* the more natural replicate (100,fCross). This is because R is expecting an expression (a sequence that implies execution) as the second argument, and not the mere name of a function. The notation can be confusing and the reason it works is because function arguments in R are *not* evaluated prior to being passed in to a function, but instead are evaluated inside the function.^{[}^{5}^{]} This is called *promise-based* argument evaluation and is powerful (it allows user-defined macros, lazy evaluation, placement of variable names on plots, user-defined control structures, and user-defined exceptions). This can also be complicated, so it’s best to think of R as having mostly *call-by-value semantics* (see *http://mng.bz/unf5*), where arguments are passed to functions as values evaluated prior to entering the function and alterations of these values aren’t seen outside of the function.

^{5} For just a taste of the complexity this introduces, try to read Thomas Lumley’s “Standard nonstandard evaluation rules”: *http://developer.r-project.org/nonstandard-eval.pdf*.

**6.3. Building models using many variables**

Models that combine the effects of many variables tend to be much more powerful than models that use only a single variable. In this section, you’ll learn how to build some of the most fundamental multiple-variable models: decision trees, nearest neighbor, and Naive Bayes.

**6.3.1. Variable selection**

A key part of building many variable models is selecting what variables^{[}^{6}^{]} to use and how the variables are to be transformed or treated. We’ve already discussed variable treatment in *chapter 4*, so we’ll only discuss variable selection here (we’re assuming you’ve discussed with your project sponsors what variables are available for or even legal to use in your model).

^{6} We’ll call variables used to build the model variously *variables*, *independent variables*, *input variables*, *causal variables*, and so on to try and distinguish them from the item to be predicted (which we’ll call *outcome* or *dependent*).

*When* variables are available has a huge impact on model utility. For instance, a variable that’s coincident with (available near or even after) the time that the outcome occurs may make a very accurate model with little utility (as it can’t be used for long-range prediction). The analyst has to watch out for variables that are functions of or “contaminated by” the value to be predicted. Which variables will actually be available in production is something you’ll want to discuss with your project sponsor. And sometimes you may want to improve model utility (at a possible cost of accuracy) by removing variables from the project design. An acceptable prediction one day before an event can be much more useful than a more accurate prediction one hour before the event.

Each variable we use represents a chance of explaining more of the outcome variation (a chance of building a better model) but also represents a possible source of noise and overfitting. To control this effect, we often preselect which subset of variables we’ll use to fit. Variable selection can be an important defensive modeling step even for types of models that “don’t need it” (as seen with decision trees in *section 6.3.2*). *Listing 6.11* shows a hand-rolled variable selection loop where each variable is scored according to an AIC (Akaike information criterion) -inspired score, in which a variable is scored with a bonus proportional to the scaled log likelihood of the training data minus a penalty proportional to the complexity of the variable (which in this case is 2^entropy). The score is a bit ad hoc, but tends to work well in selecting variables. Notice we’re using performance on the calibration set (not the training set) to pick variables. Note that we don’t use the test set for calibration; to do so lessens the reliability of the test set for model quality confirmation.

**Listing 6.11. Basic variable selection**

In our case, this picks 27 of the 212 possible variables. The categorical and numeric variables selected are shown in the following listing.

**Listing 6.12. Selected categorical and numeric variables**

## [1] "predVar194, calibrationScore: 5.25759"

## [1] "predVar201, calibrationScore: 5.25521"

## [1] "predVar204, calibrationScore: 5.37414"

## [1] "predVar205, calibrationScore: 24.2323"

## [1] "predVar206, calibrationScore: 34.4434"

## [1] "predVar210, calibrationScore: 10.6681"

## [1] "predVar212, calibrationScore: 6.23409"

## [1] "predVar218, calibrationScore: 13.2455"

## [1] "predVar221, calibrationScore: 12.4098"

## [1] "predVar225, calibrationScore: 22.9074"

## [1] "predVar226, calibrationScore: 6.68931"

## [1] "predVar228, calibrationScore: 15.9644"

## [1] "predVar229, calibrationScore: 24.4946"

## [1] "predVar6, calibrationScore: 11.2431"

## [1] "predVar7, calibrationScore: 16.685"

## [1] "predVar13, calibrationScore: 8.06318"

## [1] "predVar28, calibrationScore: 9.38643"

## [1] "predVar65, calibrationScore: 7.96938"

## [1] "predVar72, calibrationScore: 10.5353"

## [1] "predVar73, calibrationScore: 46.2524"

## [1] "predVar74, calibrationScore: 17.6324"

## [1] "predVar81, calibrationScore: 6.8741"

## [1] "predVar113, calibrationScore: 21.136"

## [1] "predVar126, calibrationScore: 72.9556"

## [1] "predVar140, calibrationScore: 14.1816"

## [1] "predVar144, calibrationScore: 13.9858"

## [1] "predVar189, calibrationScore: 40.3059"

We’ll show in *section 6.3.2* the performance of a multiple-variable model with and without using variable selection.

**6.3.2. Using decision trees**

Decision trees are a simple model type: they make a prediction that is piecewise constant. This is interesting because the null hypothesis that we’re trying to outperform is often a single constant for the whole dataset, so we can view a decision tree as a procedure to split the training data into pieces and use a simple memorized constant on each piece. Decision trees (especially a type called *classification and regression trees*, or *CART*) can be used to quickly predict either categorical or numeric outcomes. The best way to grasp the concept of decision trees is to think of them as machine-generated business rules.

**Fitting a decision tree model**

Building a decision tree involves proposing many possible *data cuts* and then choosing best cuts based on simultaneous competing criteria of predictive power, cross-validation strength, and interaction with other chosen cuts. One of the advantages of using a canned package for decision tree work is not having to worry about tree construction details. Let’s start by building a decision tree model for churn. The simplest way to call rpart() is to just give it a list of variables and see what happens (rpart(), unlike many R modeling techniques, has built-in code for dealing with missing values).

**Listing 6.13. Building a bad decision tree**

> library('rpart')

> fV <- paste(outcome,'>0 ~ ',

paste(c(catVars,numericVars),collapse=' + '),sep='')

> tmodel <- rpart(fV,data=dTrain)

> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome]))

[1] 0.9241265

> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome]))

[1] 0.5266172

> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome]))

[1] 0.5126917

What we get is pretty much a disaster. The model looks way too good to believe on the training data (which it has merely memorized, negating its usefulness) and not as good as our best single-variable models on withheld calibration and test data. A couple of possible sources of the failure are that we have categorical variables with very many levels, and we have a lot more NAs/missing data than rpart()’s surrogate value strategy was designed for. What we can do to work around this is fit on our reprocessed variables, which hide the categorical levels (replacing them with numeric predictions), and remove NAs (treating them as just another level).

**Listing 6.14. Building another bad decision tree**

> tVars <- paste('pred',c(catVars,numericVars),sep='')

> fV2 <- paste(outcome,'>0 ~ ',paste(tVars,collapse=' + '),sep='')

> tmodel <- rpart(fV2,data=dTrain)

> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome]))

[1] 0.928669

> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome]))

[1] 0.5390648

> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome]))

[1] 0.5384152

This result is about the same (also bad). So our next suspicion is that the overfitting is because our model is too complicated. To control rpart() model complexity, we need to monkey a bit with the controls. We pass in an extra argument, rpart.control (use help('rpart') for some details on this control), that changes the decision tree selection strategy.

**Listing 6.15. Building yet another bad decision tree**

> tmodel <- rpart(fV2,data=dTrain,

control=rpart.control(cp=0.001,minsplit=1000,

minbucket=1000,maxdepth=5)

)

> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome]))

[1] 0.9421195

> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome]))

[1] 0.5794633

> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome]))

[1] 0.547967

This is a very small improvement. We can waste a lot of time trying variations of the rpart() controls. The best guess is that this dataset is unsuitable for decision trees and a method that deals better with overfitting issues is needed—such as random forests, which we’ll demonstrate in*chapter 9*. The best result we could get for this dataset using decision trees was from using our selected variables (instead of all transformed variables).

**Listing 6.16. Building a better decision tree**

f <- paste(outcome,'>0 ~ ',paste(selVars,collapse=' + '),sep='')

> tmodel <- rpart(f,data=dTrain,

control=rpart.control(cp=0.001,minsplit=1000,

minbucket=1000,maxdepth=5)

)

> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome]))

[1] 0.6906852

> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome]))

[1] 0.6843595

> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome]))

[1] 0.6669301

These AUCs aren’t great (they’re not near 1.0 or even particularly near the winning team’s 0.76), but they are significantly better than any of the AUCs we saw from single-variable models when checked on non-training data. So we’ve finally built a legitimate multiple-variable model.

To tune rpart we suggest, in addition to trying variable selection (which is an odd thing to combine with decision tree methods), following the rpart documentation in trying different settings of the method argument. But we quickly get better results with KNN and logistic regression, so it doesn’t make sense to spend too long trying to tune decision trees for this particular dataset.

**How decision tree models work**

At this point, we can look at the model and use it to explain how decision tree models work.

**Listing 6.17. Printing the decision tree**

> print(tmodel)

n= 40518

node), split, n, deviance, yval

* denotes terminal node

1) root 40518 2769.3550 0.07379436

2) predVar126< 0.07366888 18188 726.4097 0.04167583

4) predVar126< 0.04391312 8804 189.7251 0.02203544 *

5) predVar126>=0.04391312 9384 530.1023 0.06010230

10) predVar189< 0.08449448 8317 410.4571 0.05206204 *

11) predVar189>=0.08449448 1067 114.9166 0.12277410 *

3) predVar126>=0.07366888 22330 2008.9000 0.09995522

6) predVar212< 0.07944508 8386 484.2499 0.06153112

12) predVar73< 0.06813291 4084 167.5012 0.04285015 *

13) predVar73>=0.06813291 4302 313.9705 0.07926546 *

7) predVar212>=0.07944508 13944 1504.8230 0.12306370

14) predVar218< 0.07134103 6728 580.7390 0.09542212

28) predVar126< 0.1015407 3901 271.8426 0.07536529 *

29) predVar126>=0.1015407 2827 305.1617 0.12309870

58) predVar73< 0.07804522 1452 110.0826 0.08264463 *

59) predVar73>=0.07804522 1375 190.1935 0.16581820 *

15) predVar218>=0.07134103 7216 914.1502 0.14883590

30) predVar74< 0.0797246 2579 239.3579 0.10352850 *

31) predVar74>=0.0797246 4637 666.5538 0.17403490

62) predVar189< 0.06775545 1031 102.9486 0.11251210 *

63) predVar189>=0.06775545 3606 558.5871 0.19162510 *

Each row in *listing 6.17* that starts with #) is called a *node* of the decision tree. This decision tree has 15 nodes. Node 1 is always called the *root*. Each node other than the root node has a parent, and the parent of node *k* is node floor(k/2). The indentation also indicates how deep in the tree a node is. Each node other than the root is named by what condition must be true to move from the parent to the node. You move from node 1 to node 2 if predVar126 < -0.002810871 (and otherwise you move to node 3, which has the complementary condition). So to score a row of data, we navigate from the root of the decision tree by the node conditions until we reach a node with no children, which is called a *leaf node*. Leaf nodes are marked with stars. The remaining three numbers reported for each node are the number of training items that navigated to the node, the deviance of the set of training items that navigated to the node (a measure of how much uncertainty remains at a given decision tree node), and the fraction of items that were in the positive class at the node (which is the prediction for leaf nodes).

We can get a graphical representation of much of this with the commands in the next listing that produce *figure 6.2*.

**Figure 6.2. Graphical representation of a decision tree**

**Listing 6.18. Plotting the decision tree**

par(cex=0.7)

plot(tmodel)

text(tmodel)

**6.3.3. Using nearest neighbor methods**

A *k-nearest neighbor* (*KNN*) method scores an example by finding the *k* training examples nearest to the example and then taking the average of their outcomes as the score. The notion of nearness is basic Euclidean distance, so it can be useful to select nonduplicate variables, rescale variables, and orthogonalize variables.

One problem with KNN is the nature of its concept space. For example, if we were to run a 3-nearest neighbor analysis on our data, we have to understand that with three neighbors from the training data, we’ll always see either zero, one, two, or three examples of churn. So the estimated probability of churn is always going to be one of 0%, 33%, 66%, or 100%. This is not going to work on an event as rare as churn, which has a rate of around 7% in our training data. For events with unbalanced outcomes (that is, probabilities not near 50%), we suggest using a large *k* so KNN can express a useful range of probabilities. For a good *k*, we suggest trying something such that you have a good chance of seeing 10 positive examples in each neighborhood (allowing your model to express rates smaller than your baseline rate to some precision). In our case, that’s a *k* around10/0.07 = 142. You’ll want to try a range of *k*, and we demonstrate a KNN run with k=200 in the following listing.

**Listing 6.19. Running k-nearest neighbors**

This is our best result yet. What we’re looking for are the two distributions to be unimodal^{[}^{7}^{]} and, if not separated, at least not completely on top of each other. Notice how, under these criteria, the double density performance plot in *figure 6.3* is much better looking than *figure 6.1*.

^{7} Distributions that are multimodal are often evidence that there are significant effects we haven’t yet explained. Distributions that are unimodal or even look normal are consistent with the unexplained effects being simple noise.

**Figure 6.3. Performance of 200-nearest neighbors on calibration data**

The code to produce *figure 6.3* is shown in the next listing.

**Listing 6.20. Platting 200-nearest neighbor performance**

dCal$kpred <- knnPred(dCal[,selVars])

ggplot(data=dCal) +

geom_density(aes(x=kpred,

color=as.factor(churn),linetype=as.factor(churn)))

This finally gives us a result good enough to bother plotting the ROC curve for. The code in the next listing produces *figure 6.4*.

**Figure 6.4. ROC of 200-nearest neighbors on calibration data**

**Listing 6.21. Plotting the receiver operating characteristic curve**

plotROC <- function(predcol,outcol) {

perf <- performance(prediction(predcol,outcol==pos),'tpr','fpr')

pf <- data.frame(

FalsePositiveRate=perf@x.values[[1]],

TruePositiveRate=perf@y.values[[1]])

ggplot() +

geom_line(data=pf,aes(x=FalsePositiveRate,y=TruePositiveRate)) +

geom_line(aes(x=c(0,1),y=c(0,1)))

}

print(plotROC(knnPred(dTest[,selVars]),dTest[,outcome]))

The ROC curve shows every possible classifier you can get by using different scoring thresholds on the same model. For example, you can achieve a high recall (high true positive rate, or TPR) at the expense of a high false positive rate (FPR) by selecting a threshold that moves you to the top right of the graph. Conversely, you can achieve high precision (high positive confirmation rate) at the expense of recall by selecting a threshold that moves you to the bottom left of the graph. Notice that score thresholds aren’t plotted, just the resulting FPRs and TPRs.

KNN is expensive both in time and space. Sometimes we can get similar results with more efficient methods such as logistic regression (which we’ll explain in detail in *chapter 7*). To demonstrate that a fast method can be competitive with KNN, we’ll show the performance of logistic regression in the next listing.

**Listing 6.22. Plotting the performance of a logistic regression model**

> gmodel <- glm(as.formula(f),data=dTrain,family=binomial(link='logit'))

> print(calcAUC(predict(gmodel,newdata=dTrain),dTrain[,outcome]))

[1] 0.7309537

> print(calcAUC(predict(gmodel,newdata=dTest),dTest[,outcome]))

[1] 0.7234645

> print(calcAUC(predict(gmodel,newdata=dCal),dCal[,outcome]))

[1] 0.7170824

**6.3.4. Using Naive Bayes**

Naive Bayes is an interesting method that memorizes how each training variable is related to outcome, and then makes predictions by multiplying together the effects of each variable. To demonstrate this, let’s use a scenario in which we’re trying to predict whether somebody is employed based on their level of education, their geographic region, and other variables. Naive Bayes begins by reversing that logic and asking this question: Given that you are employed, what is the probability that you have a high school education? From that data, we can then make our prediction regarding employment.

Let’s call a specific variable (x_1) taking on a specific value (X_1) a piece of *evidence:* ev_1. For example, suppose we define our evidence (ev_1) as the predicate education=="High School", which is true when the variable x_1 (education) takes on the value X_1 ("High School"). Let’s call the outcome y (taking on values T or True if the person is employed and F otherwise). Then the fraction of all positive examples where ev_1 is true is an approximation to the *conditional probability* of ev_1, given y==T. This is usually written as P(ev1|y==T). But what we want to estimate is the conditional probability of a subject being employed, given that they have a high school education: P(y==T|ev1). How do we get from P(ev1|y==T) (the quantities we know from our training data) to an estimate of P(y==T|ev1 ... evN) (what we want to predict)?

Bayes’ law tells us we can expand P(y==T|ev1) and P(y==F|ev1) like this:

The left-hand side is what you want; the right-hand side is all quantities that can be estimated from the statistics of the training data. For a single feature ev1, this buys us little as we could derive P(y==T|ev1) as easily from our training data as from P(ev1|y==T). For multiple features (ev1 ... evN) this sort of expansion is useful. The *Naive Bayes assumption* lets us assume that all the evidence is conditionally independent of each other for a given outcome:

P(*ev** _{1}*&. . .

*ev*

*|*

_{N}*y==T*) ≈ P(

*ev*

*|*

_{1}*y==T*) × P(

*ev*

*|*

_{2}*y==T*) × . . . P(

*ev*

*|*

_{N}*y==T*)

P(*ev** _{1}*&. . .

*ev*

*|*

_{N}*y==F*) ≈ P(

*ev*

*|*

_{1}*y==F*) × P(

*ev*

*|*

_{2}*y==F*) × . . . P(

*ev*

*|*

_{N}*y==F*)

This gives us the following:

The numerator terms of the right sides of the final expressions can be calculated efficiently from the training data, while the left sides can’t. We don’t have a direct scheme for estimating the denominators in the Naive Bayes expression (these are called the *joint probability of the evidence*). However, we can still estimate P(y==T|evidence) and P(y==F|evidence), as we know by the law of total probability that we should have P(y==T|evidence) + P(y==F|evidence) = 1. So it’s enough to pick a denominator such that our estimates add up to 1.

For numerical reasons, it’s better to convert the products into sums, by taking the log of both sides. Since the denominator term is the same in both expressions, we can ignore it; we only want to determine which of the following expressions is greater:

score (*T| ev** _{1}*&. . .

*ev*

*) =*

_{N}*log*(P(

*y==T*)) +

*log*(P(

*ev*

*|*

_{1}*y==T1)) + . . . log (P(ev*

_{N}*| y==T))*

score (*F| ev** _{1}*&. . .

*ev*

*) =*

_{N}*log*(P(

*y==F*)) +

*log*(P(

*ev*

*|*

_{1}*y==F1)) + . . . log (P(ev*

_{N}*| y==F))*

It’s also a good idea to add a smoothing term so that you’re never taking the log of zero.

All of the single-variable models we’ve built up to now are estimates of the form model(e_i) ~ P(y==T|e_i), so by another appeal to Bayes’ law we can say that the proportions we need for the Naive Bayes calculation (the ratios of P(e_i|y==T) to P(e_i|y==F)) are identical to the ratios of model(e_i)/P(y===T)) to (1-model(e_i))/P(y===F). So our single-variable models can be directly used to build an overall Naive Bayes model (without any need for additional record keeping). We show such an implementation in the following listing.

**Listing 6.23. Building, applying, and evaluating a Naive Bayes model**

Intuitively, what we’ve done is built a new per-variable prediction column from each of our single-variable models. Each new column is the logarithm of the ratio of the single-variable model’s predicted churn rate over the overall churn rate. When the model predicts a rate near the overall churn rate, this ratio is near 1.0 and therefore the logarithm is near 0. Similarly, for high predicted churn rates, the prediction column is a positive number, and for low predicted churn rates the column prediction is negative.

Summing these signed columns is akin to taking a net-consensus vote across all of the columns’ variables. If all the evidence is conditionally independent given the outcome (this is the Naive Bayes assumption—and remember it’s only an assumption), then this is exactly the right thing to do. The amazing thing about the Naive Bayes classifier is that it can perform well even when the conditional independence assumption isn’t true.

**Smoothing**

The most important design parameter in Naive Bayes is how *smoothing* is handled. The idea of smoothing is an attempt to obey *Cromwell’s rule* that no probability estimate of 0 should ever be used in probabilistic reasoning. This is because if you’re combining probabilities by multiplication (the most common method of combining probability estimates), then once some term is 0, the entire estimate will be 0 *no matter what the values of the other terms are*. The most common form of smoothing is called *Laplace smoothing*, which counts k successes out of n trials as a success ratio of (k+1)/(n+1) and not as a ratio of k/n (defending against the k=0 case). Frequentist statisticians think of smoothing as a form of regularization and Bayesian statisticians think of smoothing in terms of priors.

There are many discussions of Bayes Law and Naive Bayes methods that cover the math in much more detail. One thing to remember is that Naive Bayes doesn’t perform any clever optimization, so it can be outperformed by methods like logistic regression and support vector machines (when you have enough training data). Also, variable selection is very important for Naive Bayes. Naive Bayes is particularly useful when you have a very large number of features that are rare and/or nearly independent.

**Document classification and Naive Bayes**

Naive Bayes is the workhorse method when classifying text documents (as done by email spam detectors). This is because the standard model for text documents (usually called *bag-of-words* or *bag-of-k-grams*) can have an extreme number of possible features. In the bag-of-k-grams model, we pick a small *k* (typically 2) and each possible consecutive sequence of *k* words is a possible feature. Each document is represented as a *bag*, which is a sparse vector indicating which k-grams are in the document. The number of possible features runs into the millions, but each document only has a non-zero value on a number of features proportional to *k* times the size of the document.

Of course we can also call a prepackaged Naive Bayes implementation (that includes its own variable treatments), as shown in the following listing.

**Listing 6.24. Using a Naive Bayes package**

library('e1071')

lVars <- c(catVars,numericVars)

ff <- paste('as.factor(',outcome,'>0) ~ ',

paste(lVars,collapse=' + '),sep='')

nbmodel <- naiveBayes(as.formula(ff),data=dTrain)

dTrain$nbpred <- predict(nbmodel,newdata=dTrain,type='raw')[,'TRUE']

dCal$nbpred <- predict(nbmodel,newdata=dCal,type='raw')[,'TRUE']

dTest$nbpred <- predict(nbmodel,newdata=dTest,type='raw')[,'TRUE']

calcAUC(dTrain$nbpred,dTrain[,outcome])

## [1] 0.4643591

calcAUC(dCal$nbpred,dCal[,outcome])

## [1] 0.5544484

calcAUC(dTest$nbpred,dTest[,outcome])

## [1] 0.5679519

The e1071 code is performing a bit below our expectations on raw data. We *do* see performance superior from e1072 if we call it again with our processed and selected variables. This emphasizes the advantage of combining by hand variable processing with pre-made machine learning libraries.

**6.4. Summary**

The single-variable and multiple-variable memorization style models in this section are always worth trying first. This is especially true if most of your variables are categorical variables, as memorization is a good idea in this case. The techniques of this chapter are also a good repeat example of variable treatment and variable selection.

We have, at a bit of a stretch, called all of the modeling techniques of this chapter *memorization methods*. The reason for this is because, having worked an example using all of these models all in the same place, you now have enough experience to see the common memorization traits in these models: their predictions are all sums of summaries of the original training data.

The models of this chapter are conceptualized as follows:

· Single-variable models can be thought of as being simple memorizations or summaries of the training data. This is especially true for categorical variables where the model is essentially a contingency table or pivot table, where for every level of the variable we record the distribution of training outcomes (see *section 6.2.1*). Some sophisticated ideas (like smoothing, regularization, or shrinkage) may be required to avoid overfitting and to build good single-variable models. But in the end, single-variable models essentially organize the training data into a number of subsets indexed by the predictive variable and then store a summary of the distribution of outcome as their future prediction. These models are atoms or sub-assemblies that we sum in different ways to get the rest of the models of this chapter.

· Decision tree model decisions are also sums of summaries over subsets of the training data. For each scoring example, the model makes a prediction by choosing the summary of all training data that was placed in the same leaf node of the decision tree as the current example to be scored. There’s some cleverness in the construction of the decision tree itself, but once we have the tree, it’s enough to store a single data summary per tree leaf.

· K-nearest neighbor predictions are based on summaries of the *k* pieces of training data that are closest to the example to be scored. KNN models usually store all of their original training data instead of an efficient summary, so they truly do memorize the training data.

· Naive Bayes models partially memorize training data through intermediate features. Roughly speaking, Naive Bayes models form their decision by building a large collection of independent single-variable models.^{[}^{8}^{]} The Naive Bayes prediction for a given example is just the product of all the applicable single-variable model adjustments (or, isomorphically, the sum of logarithms of the single-variable contributions). Note that Naive Bayes models are constructed without any truly clever functional forms or optimization steps. This is why we stretch terms a bit and call them memorization: their predictions are just sums of appropriate summaries of the original training data.

^{8} As you saw in *section 6.3.4*, these are slightly modified single-variable models, since they model feature-driven change in outcome distribution, or in Bayesian terms “have the priors pulled out.”

For all their fascinating features, at some point you’ll have needs that push you away from memorization methods. For some problems, you’ll want models that capture more of the functional or additive structure of relationships. In particular, you’ll want to try regression for value prediction and logistic regression for category prediction, as we demonstrate in *chapter 7*.

**Key takeaways**

· Always try single-variable models before trying more complicated techniques.

· Single-variable modeling techniques give you a useful start on variable selection.

· Always compare your model performance to the performance of your best single-variable model.

· Consider decision trees, nearest neighbor, and naive Bayes models as basic data memorization techniques and, if appropriate, try them early in your projects.