﻿ ﻿Linear and logistic regression - Modeling methods - Practical Data Science with R (2014)

# Practical Data Science with R (2014)

## Part 2. Modeling methods

### Chapter 7. Linear and logistic regression

This chapter covers

· Using linear regression to predict quantities

· Using logistic regression to predict probabilities or categories

· Extracting relations and advice from functional models

· Interpreting the diagnostics from R’s lm() call

· Interpreting the diagnostics from R’s glm() call

In the last chapter, we worked through using memorization methods for prediction. In this chapter, we’ll talk about a different class of methods for both scoring and classification: functional methods. These are methods that learn a model that is a continuous function of its inputs (versus being a mere lookup table). This class of methods is especially useful when you don’t just want to predict an outcome, but you also want to know the relationship between the input variables and the outcome. This knowledge can prove useful because this relationship can often be used as advice on how to get the outcome that you want.

In this chapter, we’ll show how to use linear regression to predict customer income and logistic regression to predict the probability that a newborn baby will need extra medical attention. These are two of the most common functional methods (there are many others, including generalized additive models, neural nets, and support vector machines). We’ll also walk through the diagnostics that R produces when you fit a linear or logistic model.

7.1. Using linear regression

Linear regression is the bread and butter prediction method for statisticians and data scientists. If you’re trying to predict a numerical quantity like profit, cost, or sales volume, you should always try linear regression first. If it works well, you’re done; if it fails, the detailed diagnostics produced give you a good clue as to what methods you should try next.

In this section, we’ll use a real-world example (predicting personal income) to work through all of the steps of producing and using a linear regression model.

Before we get to the main example, let’s take a quick overview of the method.

7.1.1. Understanding linear regression

Linear regression models the expected value of a numeric quantity (called the dependent or response variable) in terms of numeric and categorical inputs (called the independent or explanatory variables). For example, suppose we’re trying to predict how many pounds a person on a diet and exercise plan will lose in a month. We’ll base that prediction on other facts about that person, like their average daily caloric intake over that month and how many hours a day they exercised. In other words, for every person i, we want to predict pounds.lost[i] based ondaily.cals[i] and daily.exercise[i]. Linear regression assumes that pounds.lost[i] is a linear combination of daily.cals[i] and daily.exercise[i]:

pounds.lost[i] = b.cals * daily.cals[i] + b.exercise * daily.exercise[i]

The goal is to find the values of b.cals and b.exercise so that the linear combination of daily.cals[i] and daily.exercise[i] comes very close to pounds.lost[i] for all persons i in the training data.

Let’s put this in more general terms. Suppose that y[i] is the numeric quantity we want to predict, and x[i,] is a row of inputs that corresponds to output y[i]. Linear regression finds a fit function f(x) such that

y[i] ~ f(x[i,]) = b[1] x[i,1] + ... b[n] x[i,n]

We want numbers b[1],...,b[n] (called the coefficients or betas) such that f(x[i,]) is as near as possible to y[i] for all (x[i,],y[i]) pairs in our training data. R supplies a one-line command to find these coefficients: lm().

In the idealized theoretic situation, linear regression is used to fit y[i]when the y[i] are themselves given by

y[i] = b[1] x[i,1] + b[2] x[i,2] + ... b[n] x[i,n] + e[i]

In particular, this means that y is linear in the values of x: a change in the value of x[i,m] by one unit (while holding all the other x[i,k]s constant) will change the value of y[i] by the amount b[m] always, no matter what the starting value of x[i,m] was. This is easier to see in one dimension. If y = 3 + 2*x, and if we increase x by 1, then y will always increase by 2, no matter what the starting value of x is. This wouldn’t be true for, say, y = 3 + 2*(x^2).

The last term, e[i], represents what are called unsystematic errors, or noise. Unsystematic errors average to 0 (so they don’t represent a net upward or net downward bias) and are uncorrelated with x[i,] and y[i].

The technical meaning of regression

Regression means the unbiased prediction of the conditional expected value. A prediction is a function f() such that f(x[i,]) is near the unknown ideal quantity E[y[i]|x[i,]] (where E[|] is the conditional expected value operator averaging over all possible y[j] where x[j,] is sufficiently similar to x[i,]). Unbiased predictors have E[f(x[i,])-y[i]]=0 under an appropriate expectation operator E[] (note we’re talking about properties of the model now, not properties of errors in the data). Regression isn’t so much a prediction of an individual y[i] but a forecast of E[y[i]|x[i,]] (the conditional expectation). For a good regression model, we expect f(x[i,]) to be a good estimate of E[y[i]|x[i,]] and to have low conditional variance (y[i] tending to be near E[y[i]|x[i,]]). When both of these conditions are met, thenf(x[i,]) is in general a good estimate for y[i].

Under these assumptions, linear regression is absolutely relentless in finding the best coefficients. If there’s some advantageous combination or cancellation of features, it’ll find it. One thing that linear regression doesn’t do is reshape variables to be linear. Oddly enough, linear regression often does an excellent job, even when the actual relation is not in fact linear.

When working with linear regression, you’ll vacillate between “Adding is too simple to work,” and “How is it even possible to estimate the coefficients?” This is natural and comes from the fact that the method is both simple and powerful. Our friend Philip Apps sums it up: “You have to get up pretty early in the morning to beat linear regression.”

As a toy example, consider trying to fit the squares of the first 10 integers using only a linear function plus the constant 1. We’re asking for coefficients b[0] and b[1] such that

x[i]^2 nearly equals b[0] + b[1] x[i]

This is clearly not a fair thing to ask, but linear regression still does a great job. It picks the following fit:

x[i]^2 nearly equals -22 + 11 x[i]

As the figure 7.1 shows, this is a good fit in the region of values we trained on.

Figure 7.1. Fit versus actuals for y=x2

The example in figure 7.1 is typical of how linear regression is “used in the field”—we’re using a linear model to predict something that is itself not linear. Be aware that this is a minor sin: in particular, note that the errors between the model’s predictions and the true y are systematic. This is bad as the model underpredicts for specific ranges of x and overpredicts for others.

Next we’ll work through an example of how to apply linear regression on more interesting real data. Our example task will be to predict personal income from other demographic variables such as age and education from 2011 US Census PUMS data. In addition to predicting income, we also have a secondary goal: to determine the effect of a bachelor’s degree on income, relative to having no degree at all (the reference level “no high school diploma”). In section 2.2.3, we prepared a small sample of PUMS data which we’ll use here. As a reminder, the data preparation steps included

· Restricting the data to full-time employees between 20 and 50 years of age, with an income between \$1,000 and \$250,000.

· Dividing the data into a training set, dtrain, and a test set, dtest.

Representativeness of data

For our examples, we’re deliberately ignoring the 80 PWGTP* and WGTP* columns supplied with the Census data. The PUMS data is a sample of individual households that are representative of the US population if each person and household is counted proportionally to one of the PWGTP*and WGTP*columns. There are 80 columns so the researcher can evaluate the impact of sampling noise on their work (using lm()’s weights argument). To keep our examples simple, we’ll model the data as-is and not worry about how this sample data is related to the larger (unreported) US population. However, it’s critical when working with datasets like PUMS to become familiar with the steps taken to produce the data, and we would certainly want to use the PWGTP* weights before making any claims about statistics of the actual US population.

We can continue the example by loading psub.RData (which you can copy from https://github.com/WinVector/zmPDSwR/raw/master/PUMS/psub.RData) and performing the steps in the following listing (which we’ll explain shortly).

dtrain <- subset(psub,ORIGRANDGROUP >= 500)

dtest <- subset(psub,ORIGRANDGROUP < 500)

model <- lm(log(PINCP,base=10) ~ AGEP + SEX + COW + SCHL,data=dtrain)

dtest\$predLogPINCP <- predict(model,newdata=dtest)

dtrain\$predLogPINCP <- predict(model,newdata=dtrain)

Each row of PUMS data represents a single anonymized person or household. Personal data recorded includes occupation, level of education, personal income, and many other demographics variables.

For the analysis in this section, we’ll consider the input variables age (AGEP), sex (SEX), class of worker (COW), and level of education (SCHL). The output variable is personal income (PINCP). We’ll also set the reference level, or “default” sex to M (male); the reference level of class of worker to Employee of a private for-profit; and the reference level of education level to no high school diploma. We’ll discuss reference levels later in this chapter.

Now on to the model building.

7.1.2. Building a linear regression model

The first step in either prediction or finding relations (advice) is to build the linear regression model. The command to build the linear regression model in R is lm(). The most important argument to lm() is a formula with ~ used in place of an equals sign. The formula specifies what column of the data frame is the quantity to be predicted, and what columns are to be used to make the predictions. Statisticians call the quantity to be predicted the dependent variable and the variables/columns used to make the prediction the independent variables. We find it is easier to call the quantity to be predicted the y and the variables used to make the predictions the xs. Our formula is this: log(PINCP,base=10) ~ AGEP + SEX + COW + SCHL, which is read “Predict the log base 10 of income as a function of age, sex, employment class, and education.”[1] The overall command is demonstrated in figure 7.2.

1 Recall from the discussion of the lognormal distribution in section 4.1.2 that it’s often useful to log transform monetary quantities.

Figure 7.2. Building a linear model using the lm() command

The command in figure 7.2 builds the linear regression model and stores the results in the new object called model. This model is able both to make predictions and to extract important advice from the data.

R stores training data in the model

R holds a copy of the training data in its model to supply the residual information seen in summary(model). Holding a copy of the data this way is not strictly necessary and can needlessly run you out of memory. You can mitigate this problem somewhat by setting the parameters model = F, x = F, y = F, qr = F in the lm() call. If you’re running low on memory (or swapping), you can dispose of R objects like model using the rm() command. In this case, you’d dispose of the model by running rm("model").

7.1.3. Making predictions

Once you’ve called lm() to build the model, your first goal is to predict income. This is easy to do in R. To predict, you pass data into the predict() method. Figure 7.3 demonstrates this using both the test and training data frames dtest and dtrain.

Figure 7.3. Making predictions with a linear regression model

The data frame columns dtest\$predLogPINCP and dtrain\$predLogPINCP now store the predictions for the test and training sets, respectively. We have now both produced and applied a linear regression model.

Characterizing prediction quality

Before sharing predictions, you want to inspect both the predictions and model for quality. We recommend plotting the actual y you’re trying to predict as if it were a function of your prediction. In this case, plot log(PINCP,base=10) as if it were a function of predLogPINCP. If the predictions are very good, then the plot will be dots arranged near the line y=x, which we call the line of perfect prediction (the phrase is not standard terminology; we use it to make talking about the graph easier). The commands to produce this for figure 7.4 are shown in the next listing.

ggplot(data=dtest,aes(x=predLogPINCP,y=log(PINCP,base=10))) +

geom_point(alpha=0.2,color="black") +

geom_smooth(aes(x=predLogPINCP,

y=log(PINCP,base=10)),color="black") +

geom_line(aes(x=log(PINCP,base=10),

y=log(PINCP,base=10)),color="blue",linetype=2) +

scale_x_continuous(limits=c(4,5)) +

scale_y_continuous(limits=c(3.5,5.5))

Statisticians prefer the residual plot shown in figure 7.5, where the predictions errors predLogPINCP-log(PINCP,base=10) are plotted as a function of predLogPINCP. In this case, the line of perfect prediction is the line y=0. Notice the points are scattered widely from this line (a possible sign of low-quality fit). The residual plot in figure 7.5 is prepared with the R commands in the next listing.

Figure 7.5. Plot of residual error as a function of prediction

Listing 7.3. Plotting residuals income as a function of predicted log income

ggplot(data=dtest,aes(x=predLogPINCP,

y=predLogPINCP-log(PINCP,base=10))) +

geom_point(alpha=0.2,color="black") +

geom_smooth(aes(x=predLogPINCP,

y=predLogPINCP-log(PINCP,base=10)),

color="black")

When you look at the true-versus-fitted or residual graphs, you’re looking for some specific things that we’ll discuss next.

On average, are the predictions correct?

In other words, does the smoothing curve lie more or less along the line of perfect prediction? Ideally, the points will all lie very close to that line, but you may instead get a wider cloud of points (as we do in figures 7.4 and 7.5) if your input variables don’t explain the output too closely. But if the smoothing curve lies along the line of perfect prediction, then the model predicts correctly on average: it underpredicts about as much as it overpredicts.

Why are the predictions, not the true values, on the x-axis?

The two graphs (plotting residuals as a function of true values or as a function of predicted values) answer different questions. Statisticians tend to prefer the graph as shown in figure 7.5, with predictions on the x-axis.

A residual graph with predictions on the x-axis gives you a sense of when the model may be under- or overpredicting, based on the model’s output. A residual graph with the true outcome on the x-axis instead gives you a sense of where the model under-or overpredicts based on the actual outcome.

Are there systematic errors?

If the smoothing curve veers off the line of perfect prediction too much, this is a sign of systematic under- or overprediction in certain ranges: the error is correlated with y[i]. Many of the theoretical claims about linear regression depend on the observation error being uncorrelated withy[i]. Unstructured observation errors (the good case) are called homoscedastic, and structured observation errors are called heteroscedastic and introduce prediction bias. For example, the toy fit in section 7.1 is heteroscedastic and is unsafe to use for values outside of its training range.

In addition to inspecting graphs, you should produce quantitative summaries of the quality of the predictions and the residuals. One standard measure of quality of a prediction is called R-squared. You can compute the R-squared between the prediction and the actual y with the R commands in the following listing.

Listing 7.4. Computing R-squared

rsq <- function(y,f) { 1 - sum((y-f)^2)/sum((y-mean(y))^2) }

rsq(log(dtrain\$PINCP,base=10),predict(model,newdata=dtrain))

rsq(log(dtest\$PINCP,base=10),predict(model,newdata=dtest))

You want R-squared to be fairly large (1.0 is the largest you can achieve) and R-squareds that are similar on test and training. A significantly lower R-squared on test data is a symptom of an overfit model that looks good in training and won’t work in production. In our case, our R-squareds were 0.338 on training and 0.261 on test. We’d like to see R-squares higher than this (say, 0.7–1.0). So the model is of low quality, but not substantially overfit.

R-squared can be thought of as what fraction of the y variation is explained by the model. For well-fit models, R-squared is also equal to the square of the correlation between the predicted values and actual training values.

R-squared can be overoptimistic

In general, R-squared on training data will be higher for models with more input parameters, independently of whether the additional variables actually improve the model or not. That’s why many people prefer the adjusted R-squared (which we’ll discuss later in this chapter).

Also, R-squared is related to correlation, and the correlation can be artificially inflated if the model correctly predicts a few outliers (because the increased data range makes the overall data cloud appear “tighter” against the line of perfect prediction). Here’s a toy example. Let y <- c(1,2,3,4,5,9,10) and pred <- c(0.5,0.5,0.5, 0.5,0.5,9,10). This corresponds to a model that’s completely uncorrelated to the true outcome for the first five points, and perfectly predicts the last two points, which are somewhat far away from the first five. You can check for yourself that this obviously poor model has a correlation cor(y,pred) of about 0.926, with a corresponding R-squared of 0.858. So it’s an excellent idea to look at the true-versus-fitted graph in addition to checking R-squared.

Another good measure to consider is root mean square error (RMSE).

Listing 7.5. Calculating root mean square error

rmse <- function(y, f) { sqrt(mean( (y-f)^2 )) }

rmse(log(dtrain\$PINCP,base=10),predict(model,newdata=dtrain))

rmse(log(dtest\$PINCP,base=10),predict(model,newdata=dtest))

You can think of the RMSE as a measure of the width of the data cloud around the line of perfect prediction. We’d like RMSE to be small, and one way to achieve this is to introduce more useful explanatory variables.

7.1.4. Finding relations and extracting advice

Recall that our other goal, beyond predicting income, is to find the value of having a bachelor’s degree. We’ll show how this value, and other relations in the data, can be read directly off a linear regression model.

All of the information in a linear regression model is stored in a block of numbers called the coefficients. The coefficients are available through the coefficients(model) command. The coefficients of our income model are shown in figure 7.6.

Figure 7.6. The model coefficients

Reported coefficients

Our original modeling variables were only AGEP, SEX, COW, and SCHL; yet the model reports many more coefficients than these four. We’ll explain what all the reported coefficients are.

In figure 7.6, there are eight coefficients that start with SCHL. The original variable SCHL took on these eight string values plus one more not shown: no high school diploma. Each of these possible strings is called a level, and SCHL itself is called a categorical variable or a factor variable. The level that isn’t shown is called the reference level; the coefficients of the other levels are measured with respect to the reference level.

For example, in SCHLBachelor's degree we find the coefficient 0.39, which is read as “The model gives a 0.39 bonus to log income for having a bachelor’s degree, relative to not having a high school degree.” This means that the income ratio between someone with a bachelor’s degree and the equivalent person (same sex, age, and class of work) without a high school degree is about 10^0.39, or 2.45 times higher.

And under SCHLRegular high school diploma we find the coefficient 0.10. This is read as “The model believes that having a bachelor’s degree tends to add 0.39–0.10 units to the predicted log income (relative to a high school degree).” The modeled relation between the bachelor’s degree holder’s expected income and high school graduate’s (all other variables being equal) is 10^(0.39-0.10), or about 1.8 times greater. The advice: college is worth it if you can find a job (remember that we limited our analysis to the fully employed, so this is assuming you can find a job).

SEX and COW are also discrete variables, and the coefficients that correspond to the different levels of SEX and COW can be interpreted in a similar manner. AGEP is a continuous variable with coefficient 0.0117. You can interpret this as a one-year increase in age, adding a 0.0117 bonus to log income; in other words, an increase in age of one year corresponds to an increase of income of 10^0.0117, or a factor of 1.027—about a 2.7% increase in income (all other variables being equal).

The coefficient (Intercept) corresponds to a variable that always has a value of 1, which is implicitly added to linear regression models unless you use the special 0+ notation in the formula during the call to lm(). This coefficient is a rough center for the model predictions.

The preceding interpretations of the coefficients assume that the model has provided good estimates of the coefficients. We’ll see how to check that in the next section.

Indicator variables

Most modeling methods handle a string-valued (categorical) variable with n possible levels by converting it to n (or n-1) binary variables, or indicator variables. R has commands to explicitly control the conversion of string-valued variables into well-behaved indicators: as.factor()creates categorical variables from string variables; relevel() allows the user to specify the reference level.

But beware of variables with a very large number of levels, like ZIP codes. The runtime of linear (and logistic) regression increases as roughly the cube of the number of coefficients. Too many levels (or too many variables in general) will bog the algorithm down and require much more data for reliable inference.[2]

2 To see a trick for dealing with factors with very many levels, see http://mng.bz/ytFY.

7.1.5. Reading the model summary and characterizing coefficient quality

In section 7.1.3, we checked whether our income predictions were to be trusted. We’ll now show how to check whether model coefficients are reliable. This is especially urgent, as we’ve been discussing showing coefficients’ relations to others as advice.

Most of what we need to know is already in the model summary, which is produced using the summary() command: summary(model). This produces the output shown in figure 7.7, which looks intimidating, but contains a lot of useful information and diagnostics. You’re likely to be asked about elements of figure 7.7 when presenting results, so we’ll demonstrate how all of these fields are derived and what the fields mean.

Figure 7.7. Model summary

We’ll first break down the summary() into pieces.

The original model call

The first part of the summary() is how the lm() model was constructed:

Call:

lm(formula = log(PINCP, base = 10) ~ AGEP + SEX + COW + SCHL,

data = dtrain)

This is a good place to double-check whether we used the correct data frame, performed our intended transformations, and used the right variables. For example, we can double-check whether we used the data frame dtrain and not the data frame dtest.

The residuals summary

The next part of the summary() is the residuals summary:

Residuals:

Min 1Q Median 3Q Max

-1.29220 -0.14153 0.02458 0.17632 0.62532

In linear regression, the residuals are everything. Most of what we want to know about the quality of our model fit is in the residuals. The residuals are our errors in prediction: log(dtrain\$PINCP,base=10) - predict(model,newdata=dtrain). We can find useful summaries of the residuals for both the training and test sets, as shown in the following listing.

Listing 7.6. Summarizing residuals

> summary(log(dtrain\$PINCP,base=10) - predict(model,newdata=dtrain))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.29200 -0.14150 0.02458 0.00000 0.17630 0.62530

> summary(log(dtest\$PINCP,base=10) - predict(model,newdata=dtest))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.494000 -0.165300 0.018920 -0.004637 0.175500 0.868100

In linear regression, the coefficients are chosen to minimize the sum of squares of the residuals. This is the why the method is also often called the least squares method. So for good models, we expect the residuals to be small.

In the residual summary, you’re given the Min. and Max., which are the smallest and largest residuals seen. You’re also given three quantiles of the residuals: 1st. Qu., Median, and 3rd Qu. An r-quantile is a number r such that an r-fraction of the residuals is less than x and a (1-r)-fraction of residuals is greater than x. The 1st. Qu., Median, and 3rd Qu. quantiles’ values are the values of the 0.25, 0.5, and 0.75 quantiles.

What you hope to see in the residual summary is the median near 0 and symmetry in that 1st. Qu. is near -3rd Qu. (with neither too large). The 1st. Qu. and 3rd Qu. quantiles are interesting because exactly half of the training data has a residual in this range. If you drew a random training example, its residual would be in this range exactly half the time. So you really expect to commonly see prediction errors of these magnitudes. If these errors are too big for your application, you don’t have a usable model.

The coefficients table

The next part of the summary(model) is the coefficients table, as shown in figure 7.8. A matrix form of this table can be retrieved as summary(model)\$coefficients.

Figure 7.8. Model summary coefficient columns

Each model coefficient forms a row of the summary coefficients table. The columns report the estimated coefficient, the uncertainty of the estimate, how large the coefficient is relative to the uncertainty, and how likely such a ratio would be due to mere chance. Figure 7.8 gives the names and interpretations of the columns.

We set out to study income and the impact on income of getting a bachelor’s degree. But we must look at all of the coefficients to check for interfering effects.

For example, the coefficient of -0.093 for SEXF means that our model learned a penalty of -0.093 to log(PINCP,base=10) for being female. Females are modeled as earning 1-10^-0.093 relative to males, or 19% less, all other model parameters being equal. Note we said “all other model parameters being equal” not “all other things being equal.” That’s because we’re not modeling the number of years in the workforce (which age may not be a reliable proxy for) or occupation/industry type (which has a big impact on income). This model is not, with the features it was given, capable of testing if, on average, a female in the same job with the same number of years of experience is paid less.

Statistics as an attempt to correct bad experimental design

The absolute best experiment to test if there’s a sex-driven difference in income distribution would be to compare incomes of individuals who were identical in all possible variables (age, education, years in industry, performance reviews, race, region, and so on) but differ only in sex. We’re unlikely to have access to such data, so we’d settle for a good experimental design: a population where there’s no correlation between any other feature and sex. Random selection can help in experimental design, but it’s not a complete panacea. Barring a good experimental design, the usual pragmatic strategy is this: introduce extra variables to represent effects that may have been interfering with the effect we were trying to study. Thus a study of the effect of sex on income may include other variables like education and age to try to disentangle the competing effects.

The p-value (also called the significance) is one of the most important diagnostic columns in the coefficient summary. The p-value estimates the probability of seeing a coefficient with a magnitude as large as we observe if the true coefficient is really 0 (if the variable has no effect on the outcome). Don’t trust the estimate of any coefficient with a large p-value. The general rule of thumb, p>=0.05, is not to be trusted. The estimate of the coefficient may be good, but you want to use more data to build a model that reliably shows that the estimate is good. However, lower p-values aren’t always “better” once they’re good enough. There’s no reason to prefer a coefficient with a p-value of 1e-23 to one with a p-value of 1e-08; at this point you know both coefficients are likely good estimates and you should prefer the ones that explain the most variance.

Collinearity also lowers significance

Sometimes, a predictive variable won’t appear significant because it’s collinear (or correlated) with another predictive variable. For example, if we did try to use both age and number of years in the workforce to predict income, neither variable may appear significant. This is because age tends to be correlated with number of years in the workforce. If you remove one of the variables and the other one gains significance, this is a good indicator of correlation.

Another possible indication of collinearity in the inputs is seeing coefficients with an unexpected sign: for example, seeing that income is negatively correlated with years in the workforce.

The overall model can still predict income quite well, even when the inputs are correlated; it just can’t determine which variable deserves the credit for the prediction. Using regularization (especially ridge regression as found in lm.ridge() in the package MASS) is helpful in collinear situations (we prefer it to “x-alone” variable preprocessing, such as principal components analysis). If you want to use the coefficient values as advice as well as to make good predictions, try to avoid collinearity in the inputs as much as possible.

Overall model quality summaries

The last part of the summary(model) report is the overall model quality statistics. It’s a good idea to check the overall model quality before sharing any predictions or coefficients. The summaries are as follows:

Residual standard error: 0.2691 on 578 degrees of freedom

Multiple R-squared: 0.3383, Adjusted R-squared: 0.3199

F-statistic: 18.47 on 16 and 578 DF, p-value: < 2.2e-16

The degrees of freedom is the number of data rows minus the number of coefficients fit; in our case, this:

df <- dim(dtrain)[1] - dim(summary(model)\$coefficients)[1]

The degrees of freedom is thought of as the number of training data rows you have after correcting for the number of coefficients you tried to solve for. You want the degrees of freedom to be large compared to the number of coefficients fit to avoid overfitting. Overfitting is when you find chance relations in your training data that aren’t present in the general population. Overfitting is bad: you think you have a good model when you don’t.

The residual standard error is the sum of the square of the residuals (or the sum of squared error) divided by the degrees of freedom. So it’s similar to the RMSE (root mean squared error) that we discussed earlier, except with the number of data rows adjusted to be the degrees of freedom; in R, this:

modelResidualError <- sqrt(sum(residuals(model)^2)/df)

Multiple R-squared is just the R-squared (discussed in section 7.1.3).

The adjusted R-squared is the multiple R-squared penalized by the ratio of the degrees of freedom to the number of training examples. This attempts to correct the fact that more complex models tend to look better on training data due to overfitting. Usually it’s better to rely on the adjusted R-squared. Better still is to compute the R-squared between predictions and actuals on hold-out test data. In section 7.1.3, we showed the R-squared on test data was 0.26, which is significantly lower than the reported adjusted R-squared of 0.32. So the adjusted R-squared discounts for overfitting, but not always enough. This is one of the reasons we advise preparing both training and test datasets; the test dataset estimates can be more representative of production model performance than statistical formulas.

The F-statistic is similar to the p-values we saw with the model coefficients. It’s used to measure whether the linear regression model predicts outcome better than the constant mode (the mean value of y). The F-statistic gets its name from the F-test, which is the technique used to check if two variances—in this case, the variance of the residuals from the constant model and the variance of the residuals from the linear model—are significantly different. The corresponding p-value is the estimate of the probability that we would’ve observed an F-statistic this large or larger if the two variances in question are in reality the same. So you want the p-value to be small (rule of thumb: less than 0.05).

In our example, the model is doing better than just the constant model, and the improvement is incredibly unlikely to have arisen from sampling error.

Interpreting model significances

Most of the tests of linear regression, including the tests for coefficient and model significance, are based on the error terms, or residuals are normally distributed. It’s important to examine graphically or using quantile analysis to determine if the regression model is appropriate.

7.1.6. Linear regression takeaways

Here’s what you should remember about linear regression:

· Linear regression is the go-to statistical modeling method for quantities.

· You should always try linear regression first, and only use more complicated methods if they actually outperform a linear regression model.

· Linear regression will have trouble with problems that have a very large number of variables, or categorical variables with a very large number of levels.

· You can enhance linear regression by adding new variables or transforming variables (like we did with the log() transform of y, but always be wary when transforming y as it changes the error model).

· With linear regression, you think in terms of residuals. You look for variables that correlate with your errors and add them to try and eliminate systematic modeling errors.

· Linear regression can predict well even in the presence of correlated variables, but correlated variables lower the quality of the advice.

· Overly large coefficient magnitudes, overly large standard errors on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs.

· Linear regression packages have some of the best built-in diagnostics available, but rechecking your model on test data is still your most effective safety check.

7.2. Using logistic regression

Logistic regression is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict values that are restricted to the (0,1) interval, such as probabilities. It’s the go-to method for predicting probabilities or rates, and like linear regression, the coefficients of a logistic regression model can be treated as advice. It’s also a good first choice for binary classification problems.

In this section, we’ll use a medical classification example (predicting whether a newborn will need extra medical attention) to work through all of the steps of producing and using a logistic regression model.[3]

3 Logistic regression is usually used to perform classification, but logistic regression and its close cousin beta regression are also useful in estimating rates. In fact, R’s standard glm() call will work with prediction numeric values between 0 and 1.0 in addition to predicting classifications.

7.2.1. Understanding logistic regression

Logistic regression predicts the probability y that an instance belongs to a specific category—for instance, the probability that a flight will be delayed. When x[i,] is a row of inputs (for example, a flight’s origin and destination, the time of year, the weather, the air carrier), logistic regression finds a fit function f(x) such that

P[y[i] in class] ~ f(x[i,]) = s(a+b[1] x[i,1] + ... b[n] x[i,n])

Here, s(z) is the so-called sigmoid function, defined as s(z) = 1/(1+exp(z)). If the y[i] are the probabilities that the x[i,] belong to the class of interest (in our example, that a flight with certain characteristics will be delayed), then the task of fitting is to find the b[1], ..., b[n]such that f(x[i,]) is the best possible estimate of y[i]. R supplies a one-line command to find these coefficients: glm().[4] Note that we don’t need to supply y[i] that are probability estimates to run glm(); the training method only requires y[i] that say whether a given training example is in the target class.

4 Logistic regression can be used for classifying into any number of categories (as long as the categories are disjoint and cover all possibilities: every x has to belong to one of the given categories). But glm() only handles the two-category case, so our discussion will focus on this case.

The sigmoid function maps real numbers to the interval (0,1)—that is, to probabilities. The inverse of the sigmoid is the logit, which is defined as log(p/(1-p)), where p is a probability. The ratio p/(1-p) is known as the odds, so in the flight example, the logit is the log of the odds (orlog-odds) that a flight will be delayed. In other words, you can think of logistic regression as a linear regression that finds the log-odds of the probability that you’re interested in.

In particular, logistic regression assumes that logit(y) is linear in the values of x. Like linear regression, logistic regression will find the best coefficients to predict y, including finding advantageous combinations and cancellations when the inputs are correlated.

For the example task, imagine that you’re working at a hospital. The goal is to design a plan that provisions neonatal emergency equipment to delivery rooms. Newborn babies are assessed at one and five minutes after birth using what’s called the Apgar test, which is designed to determine if a baby needs immediate emergency care or extra medical attention. A baby who scores below 7 (on a scale from 0 to 10) on the Apgar scale needs extra attention.

Such at-risk babies are rare, so the hospital doesn’t want to provision extra emergency equipment for every delivery. On the other hand, at-risk babies may need attention quickly, so provisioning resources proactively to appropriate deliveries can save lives. The goal of this project is to identify ahead of time situations with a higher probability of risk, so that resources can be allocated appropriately.

We’ll use a sample dataset from the CDC 2010 natality public-use data file (http://mng.bz/pnGy). This dataset records statistics for all births registered in the 50 US States and the District of Columbia, including facts about the mother and father, and about the delivery. We’ll use a sample of just over 26,000 births in a data frame called sdata.[5] The data is split into training and test sets, using the random grouping column that we added, as recommended in section 2.2.2.

5 Our pre-prepared file is at https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData; we also provide a script file (https://github.com/WinVector/zmPDSwR/blob/master/CDC/PrepNatalRiskData.R), which prepares the data frame from an extract of the full natality data set. Details found at https://github.com/WinVector/zmPDSwR/blob/master/CDC/README.md.

train <- sdata[sdata\$ORIGRANDGROUP<=5,]

test <- sdata[sdata\$ORIGRANDGROUP>5,]

Table 7.1 lists the columns of the dataset that we’ll use. Because the goal is to anticipate at-risk infants ahead of time, we’ll restrict variables to those whose values are known before delivery or can be determined during labor. For example, facts about the mother’s weight or health history are valid inputs, but postbirth facts like infant birth weight are not. We can consider in-labor complications like breech birth by reasoning that the model can be updated in the delivery room (via a protocol or checklist) in time for emergency resources to be allocated before delivery.

Table 7.1. Some variables in natality dataset

 Variable Type Description atRisk Logical TRUE if 5-minute Apgar score < 7; FALSE otherwise PWGT Numeric Mother’s prepregnancy weight UPREVIS Numeric (integer) Number of prenatal medical visits CIG_REC Logical TRUE if smoker; FALSE otherwise GESTREC3 Categorical Two categories: <37 weeks (premature) and >=37 weeks DPLURAL Categorical Birth plurality, three categories: single/twin/triplet+ ULD_MECO Logical TRUE if moderate/heavy fecal staining of amniotic fluid ULD_PRECIP Logical TRUE for unusually short labor (< three hours) ULD_BREECH Logical TRUE for breech (pelvis first) birth position URF_DIAB Logical TRUE if mother is diabetic URF_CHYPER Logical TRUE if mother has chronic hypertension URF_PHYPER Logical TRUE if mother has pregnancy-related hypertension URF_ECLAM Logical TRUE if mother experienced eclampsia: pregnancy-related seizures

Now we’re ready to build the model.

7.2.2. Building a logistic regression model

The command to build a logistic regression model in R is glm(). In our case, the dependent variable y is the logical (or Boolean) atRisk; all the other variables in table 7.1 are the independent variables x. The formula for building a model to predict atRisk using these variables is rather long to type in by hand; you can generate the formula with the commands shown in the next listing.

Listing 7.8. Building the model formula

complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")

riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",

"URF_ECLAM")

y <- "atRisk"

x <- c("PWGT",

"UPREVIS",

"CIG_REC",

"GESTREC3",

"DPLURAL",

complications,

riskfactors)

fmla <- paste(y, paste(x, collapse="+"), sep="~")

Now we build the logistic regression model, using the training dataset.

Listing 7.9. Fitting the logistic regression model

print(fmla)

[1] "atRisk ~ PWGT+UPREVIS+CIG_REC+GESTREC3+DPLURAL+ULD_MECO+ULD_PRECIP+

ULD_BREECH+URF_DIAB+URF_CHYPER+URF_PHYPER+URF_ECLAM"

This is similar to the linear regression call to lm(), with one additional argument: family=binomial(link="logit"). The family function specifies the assumed distribution of the dependent variable y. In our case, we’re modeling y as a binomial distribution, or as a coin whose probability of heads depends on x. The link function “links” the output to a linear model—pass y through the link function, and then model the resulting value as a linear function of the x values. Different combinations of family functions and link functions lead to different kinds of generalized linear models (for example, Poisson, or probit). In this book, we’ll only discuss logistic models, so we’ll only need to use the binomial family with the logit link.

Don’t forget the family argument!

Without an explicit family argument, glm defaults to standard linear regression (like lm).

As before, we’ve stored the results in the object model.

7.2.3. Making predictions

Making predictions with a logistic model is similar to making predictions with a linear model—use the predict() function.

Listing 7.10. Applying the logistic regression model

train\$pred <- predict(model, newdata=train, type="response")

test\$pred <- predict(model, newdata=test, type="response")

We’ve again stored the predictions for the training and test sets as the column pred in the respective data frames. Note the additional parameter type="response". This tells the predict() function to return the predicted probabilities y. If you don’t specify type="response", then by default predict() will return the output of the link function, logit(y).

One strength of logistic regression is that it preserves the marginal probabilities of the training data. That means that if you sum up the predicted probability scores for the entire training set, that quantity will be equal to the number of positive outcomes (atRisk == T) in the training set. This is also true for subsets of the data determined by variables included in the model. For example, in the subset of the training data that has train\$GESTREC=="<37 weeks" (the baby was premature), the sum of the predicted probabilities equals the number of positive training examples (see, for example http://mng.bz/j338).

Characterizing prediction quality

If our goal is to use the model to classify new instances into one of two categories (in this case, at-risk or not-at-risk), then we want the model to give high scores to positive instances and low scores otherwise. We can check if this is so by plotting the distribution of scores for both the positive and negative instances. Let’s do this on the training set (we should also plot the test set, to make sure that the performance is of similar quality).

Listing 7.11. Plotting distribution of prediction score grouped by known outcome

library(ggplot2)

ggplot(train, aes(x=pred, color=atRisk, linetype=atRisk)) +

geom_density()

The result is shown in figure 7.9. Ideally, we’d like the distribution of scores to be separated, with the scores of the negative instances (FALSE) to be concentrated on the left, and the distribution for the positive instances to be concentrated on the right. Earlier we showed an example of a classifier that separates the positives and the negatives quite well in figure 5.8. In the current case, both distributions are concentrated on the left, meaning that both positive and negative instances score low. This isn’t surprising, since the positive instances (the ones with the baby at risk) are rare (about 1.8% of all births in the dataset). The distribution of scores for the negative instances dies off sooner than the distribution for positive instances. This means that the model did identify subpopulations in the data where the rate of at-risk newborns is higher than the average.

Figure 7.9. Distribution of score broken up by positive examples (TRUE) and negative examples (FALSE)

In order to use the model as a classifier, you must pick a threshold; scores above the threshold will be classified as positive, those below as negative. When you pick a threshold, you’re trying to balance the precision of the classifier (what fraction of the predicted positives are true positives) and its recall (how many of the true positives the classifier finds).

If the score distributions of the positive and negative instances are well separated, as in figure 5.8, we can pick an appropriate threshold in the “valley” between the two peaks. In the current case, the two distributions aren’t well separated, which indicates that the model can’t build a classifier that simultaneously achieves good recall and good precision. But we can build a classifier that identifies a subset of situations with a higher-than-average rate of at-risk births, so preprovisioning resources to those situations may be advised. We’ll call the ratio of the classifier precision to the average rate of positives the enrichment rate.

The higher we set the threshold, the more precise the classifier will be (we’ll identify a set of situations with a much higher-than-average rate of at-risk births); but we’ll also miss a higher percentage of at-risk situations, as well. When picking the threshold, we’ll use the training set, since picking the threshold is part of classifier-building. We can then use the test set to evaluate classifier performance.

To help pick the threshold, we can use a plot like figure 7.10, which shows both enrichment and recall as a functions of the threshold.

Figure 7.10. Enrichment (top) and recall (bottom) plotted as functions of threshold for the training set

Looking at figure 7.10, you see that higher thresholds result in more precise classifications, at the cost of missing more cases; a lower threshold will identify more cases, at the cost of many more false positives. The best trade-off between precision and recall is a function of how many resources the hospital has available to allocate, and how many they can keep in reserve (or redeploy) for situations that the classifier missed. A threshold of 0.02 (which incidentally is about the overall rate of at-risk births in the training data) might be a good trade-off. The resulting classifier will identify a set of potential at-risk situations that finds about half of all the true at-risk situations, with a true positive rate 2.5 times higher than the overall population.

We can produce figure 7.10 with the ROCR package, which we discussed in more detail in chapter 5.

Once we’ve picked an appropriate threshold, we can evaluate the resulting classifier by looking at the confusion matrix, as we discussed in section 5.2.1. Let’s use the test set to evaluate the classifier, with a threshold of 0.02.

Listing 7.13. Evaluating our chosen model

The resulting classifier is low-precision, but identifies a set of potential at-risk cases that contains 55.5% of the true positive cases in the test set, at a rate 2.66 times higher than the overall average. This is consistent with the results on the training set.

In addition to making predictions, a logistic regression model also helps you extract useful information and advice. We’ll show this in the next section.

7.2.4. Finding relations and extracting advice from logistic models

The coefficients of a logistic regression model encode the relationships between the input variables and the output in a way similar to how the coefficients of a linear regression model do. You can get the model’s coefficients with the call coefficients(model).

Listing 7.14. The model coefficients

> coefficients(model)

(Intercept) PWGT

-4.41218940 0.00376166

UPREVIS CIG_RECTRUE

-0.06328943 0.31316930

GESTREC3< 37 weeks DPLURALtriplet or higher

1.54518311 1.39419294

DPLURALtwin ULD_MECOTRUE

0.31231871 0.81842627

ULD_PRECIPTRUE ULD_BREECHTRUE

0.19172008 0.74923672

URF_DIABTRUE URF_CHYPERTRUE

-0.34646672 0.56002503

URF_PHYPERTRUE URF_ECLAMTRUE

0.16159872 0.49806435

Negative coefficients that are statistically significant[6] correspond to variables that are negatively correlated to the odds (and hence to the probability) of a positive outcome (the baby being at risk). Positive coefficients that are statistically significant are positively correlated to the odds of a positive outcome.

6 We’ll show how to check for statistical significance in the next section.

As with linear regression, every categorical variable is expanded to a set of indicator variables. If the original variable has n levels, there will be n-1 indicator variables; the remaining level is the reference level.

For example, the variable DPLURAL has three levels corresponding to single births, twins, and triplets or higher. The logistic regression model has two corresponding coefficients: DPLURALtwin and DPLURALtriplet or higher. The reference level is single births. Both of the DPLURALcoefficients are positive, indicating that multiple births have higher odds of being at risk than single births do, all other variables being equal.

Logistic regression also dislikes a very large variable count

And as with linear regression, you should avoid categorical variables with too many levels.

Interpreting the coefficients

Interpreting coefficient values is a little more complicated with logistic than with linear regression. If the coefficient for the variable x[,k] is b[k], then the odds of a positive outcome are multiplied by a factor of exp(b[k]) for every unit change in x[,k].

The coefficient for GESTREC3< 37 weeks (for a premature baby) is 1.545183. So for a premature baby, the odds of being at risk are exp(1.545183)=4.68883 times higher compared to a baby that’s born full-term, with all other input variables unchanged. As an example, suppose a full-term baby with certain characteristics has a 1% probability of being at risk (odds are p/(1-p), or 0.01/0.99 = 0.0101); then the odds for a premature baby with the same characteristics are 0.0101*4.68883 = 0.047. This corresponds to a probability of being at risk ofodds/(1+odds), or 0.047/1.047—about 4.5%.

Similarly, the coefficient for UPREVIS (number of prenatal medical visits) is about -0.06. This means every prenatal visit lowers the odds of an at-risk baby by a factor of exp(-0.06), or about 0.94. Suppose the mother of our premature baby had made no prenatal visits; a baby in the same situation whose mother had made three prenatal visits would have odds of being at risk of about 0.047 * 0.94 * 0.94 * 0.94 = 0.039. This corresponds to a probability of being at risk of 3.75%.

So the general advice in this case might be to keep a special eye on premature births (and multiple births), and encourage expectant mothers to make regular prenatal visits

7.2.5. Reading the model summary and characterizing coefficients

As we mentioned earlier, conclusions about the coefficient values are only to be trusted if the coefficient values are statistically significant. We also want to make sure that the model is actually explaining something. The diagnostics in the model summary will help us determine some facts about model quality. The call, as before, is summary(model).

Listing 7.15. The model summary

> summary(model)

Call:

glm(formula = fmla, family = binomial(link = "logit"), data = train)

Deviance Residuals:

Min 1Q Median 3Q Max

-0.9732 -0.1818 -0.1511 -0.1358 3.2641

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***

PWGT 0.003762 0.001487 2.530 0.011417 *

UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***

CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .

GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***

DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **

DPLURALtwin 0.312319 0.241088 1.295 0.195163

ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***

ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951

ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***

URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187

URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676

URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029

URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489

---

Signif. codes: 0 ’ ***’ 0.001 ’ **’ 0.01 ’ *’ 0.05 ’ .’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2698.7 on 14211 degrees of freedom

Residual deviance: 2463.0 on 14198 degrees of freedom

AIC: 2491

Number of Fisher Scoring iterations: 7

The original model call

The first line of the summary is the call to glm():

Call:

glm(formula = fmla, family = binomial(link = "logit"), data = train)

Here is where we check that we’ve used the correct training set and the correct formula (although in our case, the formula itself is in another variable). We can also verify that we used the correct family and link function to produce a logistic model.

The deviance residuals summary

The deviance residuals are the analog to the residuals of a linear regression model:

Deviance Residuals:

Min 1Q Median 3Q Max

-0.9732 -0.1818 -0.1511 -0.1358 3.2641

In linear regression, the residuals are the vector of differences between the true outcome values and the predicted output values (the errors). In logistic regression, the deviance residuals are related to the log likelihoods of having observed the true outcome, given the predicted probability of that outcome. The idea behind log likelihood is that positive instances y should have high probability py of occurring under the model; negative instances should have low probability of occurring (or putting it another way, (1-py) should be large). The log likelihood function rewards “matches” between the outcome y and the predicted probability py, and penalizes mismatches (high py for negative instances, and vice versa).

Listing 7.16. Calculating deviance residuals

Linear regression models are found by minimizing the sum of the squared residuals; logistic regression models are found by minimizing the sum of the squared residual deviances, which is equivalent to maximizing the log likelihood of the data, given the model.

Logistic models can also be used to explicitly compute rates: given several groups of identical data points (identical except the outcome), predict the rate of positive outcomes in each group. This kind of data is called grouped data. In the case of grouped data, the deviance residuals can be used as a diagnostic for model fit. This is why the deviance residuals are included in the summary. We’re using ungrouped data—every data point in the training set is potentially unique. In the case of ungrouped data, the model fit diagnostics that use the deviance residuals are no longer valid.[7]

7 See Daniel Powers and Yu Xie, Statistical Methods for Categorical Data Analysis, 2nd Ed., Emerald Group Publishing Ltd., 2008.

The summary coefficients table

The summary coefficients table for logistic regression has the same format as the coefficients table for linear regression:

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***

PWGT 0.003762 0.001487 2.530 0.011417 *

UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***

CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .

GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***

DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **

DPLURALtwin 0.312319 0.241088 1.295 0.195163

ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***

ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951

ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***

URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187

URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676

URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029

URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489

---

Signif. codes: 0 ’ ***’ 0.001 ’ **’ 0.01 ’ *’ 0.05 ’ .’ 0.1 ’ ’ 1

The columns of the table represent

· A coefficient

· Its estimated value

· The error around that estimate

· The signed distance of the estimated coefficient value from 0 (using the standard error as the unit of distance)

· The probability of seeing a coefficient value at least as large as we observed, under the null hypothesis that the coefficient value is really 0

This last value, called the p-value or significance, tells us whether we should trust the estimated coefficient value. The standard rule of thumb is that coefficients with p-values less than 0.05 are reliable, although some researchers prefer stricter thresholds.

For the birth data, we can see from the coefficient summary that premature birth and triplet birth are strong predictors of the newborn needing extra medical attention: the coefficient magnitudes are non-negligible and the p-values indicate significance. Other variables that affect the outcome are the mother’s prepregnancy weight (heavier mothers indicate higher risk—slightly surprising); the number of prenatal medical visits (the more visits, the lower the risk); meconium staining in the amniotic fluid; and breech position at birth. There might be a positive correlation between mother’s smoking and an at-risk birth, but the data doesn’t indicate it definitively. None of the other variables show a strong relationship to an at-risk birth.

Lack of significance could mean collinear inputs

As with linear regression, logistic regression can predict well with collinear (or correlated) inputs, but the correlations can mask good advice.

To see this for yourself, we left data about the babies’ birth weight in grams in the dataset sdata. It’s present in both the test and training data as the column DBWT. Try adding DBWT to the logistic regression model in addition to all the other variables; you’ll see that the coefficient for baby’s birth weight will be non-negligible (has a substantial impact on prediction) and significant, and negatively correlated with risk. The coefficient for DPLURALtriplet or higher will appear insignificant, and the coefficient for GESTREC3< 37 weeks has a much smaller magnitude. This is because low birth weight is correlated to both prematurity and multiple birth. Of the three related variables, birth weight is the best single predictor of the outcome: knowing that the baby is a triplet adds no additional useful information, and knowing the baby is premature adds only a little information.

In the context of the modeling goal—to proactively allocate emergency resources where they’re more likely to be needed—birth weight isn’t as useful a variable, because we don’t know the baby’s weight until it’s born. We do know ahead of time if it’s being born prematurely, or if it’s one of multiple babies. So it’s better to use GESTREC3 and DPLURAL as input variables, instead of DBWT.

Other signs of possibly collinear inputs are coefficients with the wrong sign and unusually large coefficient magnitudes.

Overall model quality summaries

The next section of the summary contains the model quality statistics:

Null deviance: 2698.7 on 14211 degrees of freedom

Residual deviance: 2463.0 on 14198 degrees of freedom

AIC: 2491

Null and residual deviances

Deviance is again a measure of how well the model fits the data. It is 2 times the negative log likelihood of the dataset, given the model. If you think of deviance as analogous to variance, then the null deviance is similar to the variance of the data around the average rate of positive examples. The residual deviance is similar to the variance of the data around the model. We can calculate the deviances for both the training and test sets.

Listing 7.17. Computing deviance

The first thing we can do with the null and residual deviances is check whether the model’s probability predictions are better than just guessing the average rate of positives, statistically speaking. In other words, is the reduction in deviance from the model meaningful, or just something that was observed by chance? This is similar to calculating the F-test statistics that are reported for linear regression. In the case of logistic regression, the test you’ll run is the chi-squared test. To do that, you need to know the degrees of freedom for the null model and the actual model (which are reported in the summary). The degrees of freedom of the null model is the number of data points minus 1: df.null = dim(train)[[1]] - 1. The degrees of freedom of the model that you fit is the number of data points minus the number of coefficients in the model: df.model = dim(train)[[1]] - length(model\$coefficients).

If the number of data points in the training set is large, and df.null - df.model is small, then the probability of the difference in deviances null.dev - resid.dev being as large as we observed is approximately distributed as a chi-squared distribution with df.null - df.modeldegrees of freedom.

Listing 7.18. Calculating the significance of the observed fit

The p-value is very small; it’s extremely unlikely that we could’ve seen this much reduction in deviance by chance.

The pseudo R-squared

A useful goodness of fit measure based on the deviances is the pseudo R-squared: 1 - (dev.model/dev.null). The pseudo R-squared is the analog to the R-squared measure for linear regression. It’s a measure of how much of the deviance is “explained” by the model. Ideally, you want the pseudo R-squared to be close to 1. Let’s calculate the pseudo-R-squared for both the test and training data.

Listing 7.19. Calculating the pseudo R-squared

pr2 <- 1-(resid.dev/null.dev)

> print(pr2)

[1] 0.08734674

> pr2.test <- 1-(resid.dev.test/null.dev.test)

> print(pr2.test)

[1] 0.07760427

The model only explains about 7.7–8.7% of the deviance; it’s not a highly predictive model (you should have suspected that already, from figure 7.9). This tells us that we haven’t yet identified all the factors that actually predict at-risk births.

Goodness of fit versus significance

It’s worth noting that the model we found is a legitimate model, just not a complete one. The good p-value tells us that the model is legitimate: it gives us more information than the average rate of at-risk births does alone. The poor pseudo R-squared means that the model isn’t giving us enough information to predict at-risk births with high reliability.

It’s also possible to have good pseudo R-squared (on the training data) with a bad p-value. This is an indication of overfit. That’s why it’s a good idea to check both, or better yet, check the pseudo R-squared of the model on both training and test data.

The AIC

The last metric given in the section of the summary is the AIC, or the Akaike information criterion. The AIC is the log likelihood adjusted for the number of coefficients. Just as the R-squared of a linear regression is generally higher when the number of variables is higher, the log likelihood also increases with the number of variables.

Listing 7.20. Calculating the Akaike information criterion

aic <- 2*(length(model\$coefficients) -

loglikelihood(as.numeric(train\$atRisk), pred))

> aic

[1] 2490.992

The AIC is usually used to decide which and how many input variables to use in the model. If you train many different models with different sets of variables on the same training set, you can consider the model with the lowest AIC to be the best fit.

Fisher scoring iterations

The last line of the model summary is the number of Fisher scoring iterations:

Number of Fisher Scoring iterations: 7

The Fisher scoring method is an iterative optimization method similar to Newton’s method that glm() uses to find the best coefficients for the logistic regression model. You should expect it to converge in about six to eight iterations. If there are more iterations than that, then the algorithm may not have converged, and the model may not be valid.

Separation and quasi-separation

The probable reason for nonconvergence is separation or quasi-separation: one of the model variables or some combination of the model variables predicts the outcome perfectly for at least a subset of the training data. You’d think this would be a good thing, but ironically logistic regression fails when the variables are too powerful. Ideally, glm() will issue a warning when it detects separation or quasi-separation:

Warning message:

glm.fit: fitted probabilities numerically 0 or 1 occurred

Unfortunately, there are situations when it seems that no warning is issued, but there are other warning signs:

· An unusually high number of Fisher iterations

· Very large coefficients, usually with extremely large standard errors

· Residual deviances larger than the null deviances

If you see any of these signs, the model is suspect. To try to fix the problem, remove any variables with unusually large coefficients; they’re probably causing the separation. You can try using a decision tree on the variables that you remove to detect regions of perfect prediction. The data that the decision tree doesn’t predict perfectly on can still be used for building a logistic regression model. The overall model would then be a hybrid: the decision tree to predict the too-good data, and a logistic regression model for the rest.

The right way to deal with separation

We admit, it doesn’t feel right to remove variables or data that are “too good” from the modeling process. The correct way to handle separation is to regularize. Unfortunately, the default, glm() doesn’t regularize. The package glmnet can. But its calling interface isn’t the standard interface that lm(), glm(), and other modeling functions in this book use. It also doesn’t have the nice diagnostic output of the other packages. For these reasons, we consider a discussion of glmnet beyond the scope of this book.

To regularize glm() in an ad hoc way, you can use the weights argument. The weights argument lets you pass a vector of weights (one per datum) to the glm() call. Add another copy of the data, but with opposite outcomes, and use this faux data in glm() with a small weight. An example of this trick can be found in the blog article “A Pathological GLM Problem that Doesn’t Issue a Warning” (http://mng.bz/G8RS).

Another way to deal with separation is to build a two-stage model starting with a decision tree (section 6.3.2) and use a different glm() for a different set of variables on each induced partition of your data.

7.2.6. Logistic regression takeaways

What you should remember about logistic regression:

· Logistic regression is the go-to statistical modeling method for binary classification. Try logistic regression first, and then more complicated methods if logistic regression doesn’t perform well.

· Logistic regression will have trouble with problems with a very large number of variables, or categorical variables with a very large number of levels.

· Logistic regression is well calibrated: it reproduces the marginal probabilities of the data.

· Logistic regression can predict well even in the presence of correlated variables, but correlated variables lower the quality of the advice.

· Overly large coefficient magnitudes, overly large standard errors on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs.

· Too many Fisher iterations, or overly large coefficients with very large standard errors, could be signs that an input or combination of inputs is perfectly correlated with a subset of your responses. You may have to segment the data to deal with this issue.

· glm() provides good diagnostics, but rechecking your model on test data is still your most effective diagnostic.

· Pseudo R-squared is a useful goodness-of-fit heuristic.

7.3. Summary

In this chapter, you’ve started building models that go beyond training data memorization and infer a functional form of model. You’ve learned how to predict numerical quantities with linear regression models and to predict probabilities or classify using logistic regression models. You’ve also learned how to interpret the models that you’ve produced.

Both linear and logistic regression assume that the outcome is a function of a linear combination of the inputs. This seems restrictive, but in practice linear and logistic regression models can perform well even when the theoretical assumptions aren’t exactly met. We’ll show how to further work around these limits in chapter 9.

Linear and logistic regression can also provide advice by quantifying the relationships between the outcomes and the model’s inputs. Since the models are expressed completely by their coefficients, they’re small, portable, and efficient—all valuable qualities when putting a model into production. If the model’s errors are homoscedastic (uncorrelated with y), the model might be trusted to extrapolate predictions outside the training range. Extrapolation is never completely safe, but it’s sometimes necessary.

The methods that we discussed in this chapter and in the previous chapter use data about known outcomes to build models that predict future outcomes. But what if you don’t yet know what to predict? The next chapter looks at unsupervised methods: algorithms that discover previously unknown relationships in data.

Key takeaways

· Functional models allow you to better explore how changes in inputs affect predictions.

· Linear regression is a good first technique for modeling quantities.

· Logistic regression is good first technique for modeling probabilities.

· Models with simple forms come with very powerful summaries and diagnostics.

﻿