Multiple Regression - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 15.Multiple Regression
I don’t look at a problem and put variables in there that don’t affect it.
Bill Parcells
Although theVP is pretty impressed with your predictive model, she thinks you can do better. To that end, you’ve collected additional data: for each of your users, you know how many hours he works each day, and whether he has a PhD. You’d like to use this additional data to improve your model.
Accordingly, you hypothesize a linear model with more independent variables:
Obviously, whether a user has a PhD is not a number, but — as we mentioned inChapter 11— we can introduce adummy variablethat equals 1 for users with PhDs and 0 for users without, after which it’s just as numeric as the other variables.
The Model
Recall that inChapter 14we fita model of the form:
Now imagine that each inputis not a single number but rather a vector ofknumbers. The multiple regression model assumes that:
In multiple regression the vector of parameters is usually called. We’ll want this to include the constant term as well, which we can achieve by adding a column of ones to our data:
beta=[alpha,beta_1,...,beta_k]
and:
x_i=[1,x_i1,...,x_ik]
Then our model is just:
defpredict(x_i,beta):
"""assumes that the first element of each x_i is 1"""
returndot(x_i,beta)
In this particular case, our independent variablexwill be a list of vectors, each of which looks like this:
[1,# constant term
49,# number of friends
4,# work hours per day
0]# doesn't have PhD
Further Assumptions of the Least Squares Model
There are a couple of further assumptions that are required for this model (and our solution) to make sense.
The first is that the columns ofxarelinearly independent— that there’s no way to write any one as a weighted sum of some of the others. If this assumption fails, it’s impossible to estimatebeta. To see this in an extreme case, imagine we had an extra fieldnum_acquaintancesin our data that for every user was exactly equal tonum_friends.
Then, starting with anybeta, if we addanyamount to thenum_friendscoefficient and subtract that same amount from thenum_acquaintancescoefficient, the model’s predictions will remain unchanged. Which means that there’s no way to findthecoefficient fornum_friends. (Usually violations of this assumption won’t be so obvious.)
The second important assumption is that the columns ofxare all uncorrelated with the errors. If this fails to be the case, our estimates ofbetawill be systematically wrong.
For instance, inChapter 14, we built a model that predicted that each additional friend was associated with an extra 0.90 daily minutes on the site.
Imagine that it’s also the case that:
§ People who work more hours spend less time on the site.
§ People with more friends tend to work more hours.
That is, imagine that the “actual” model is:
and that work hours and friends are positively correlated. In that case, whenwe minimize the errors of the single variable model:
we will underestimate.
Think about what would happen if we made predictions using the single variable model with the “actual” value of. (That is, the value that arises from minimizing the errors of what we called the “actual” model.) The predictions would tend to be too small for users who work many hours and too large for users who work few hours, becauseand we “forgot” to include it. Because work hours is positively correlated with number of friends, this means the predictions tend to be too small for users with many friends and too large for users with few friends.
The result of this is that we can reduce the errors (in the single-variable model) by decreasing our estimate of, which means that the error-minimizingis smaller than the “actual” value. That is, in this case the single-variable least squares solution is biased to underestimate. And, in general, whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate of.
Fitting the Model
As we did in the simple linear model, we’ll choosebetato minimize the sum of squared errors. Finding an exact solution is not simple to do by hand, which means we’ll need to use gradient descent.We’ll start by creating an error function to minimize. For stochastic gradient descent, we’ll just want the squared error corresponding to a single prediction:
deferror(x_i,y_i,beta):
returny_i-predict(x_i,beta)
defsquared_error(x_i,y_i,beta):
returnerror(x_i,y_i,beta)**2
If you know calculus, you can compute:
defsquared_error_gradient(x_i,y_i,beta):
"""the gradient (with respect to beta)
corresponding to the ith squared error term"""
return[-2*x_ij*error(x_i,y_i,beta)
forx_ijinx_i]
Otherwise, you’ll need to take my word for it.
At this point, we’re ready to find the optimal beta usingstochastic gradient descent:
You should think of the coefficients ofthe model as representing all-else-being-equal estimates of the impacts of each factor. All else being equal, each additional friend corresponds to an extra minute spent on the site each day. All else being equal, each additional hour in a user’s workday corresponds to about two fewer minutes spent on the site each day. All else being equal, having a PhD is associated with spending an extra minute on the site each day.
What this doesn’t (directly) tell us is anything about the interactions among the variables. It’s possible that the effect of work hours is different for people with many friends than it is for people with few friends. This model doesn’t capture that. One way to handle this case is to introduce a new variable that is theproductof “friends” and “work hours.” This effectively allows the “work hours” coefficient to increase (or decrease) as the number of friends increases.
Or it’s possible that the more friends you have, the more time you spend on the siteup to a point, after which further friends cause you to spend less time on the site. (Perhaps with too many friends the experience is just too overwhelming?) We could try to capture this in our model by adding another variable that’s thesquareof the number of friends.
Once we start adding variables, we need to worry about whether their coefficients “matter.” There are no limits to the numbers of products, logs, squares, and higher powers we could add.
Goodness of Fit
Again we can look atthe R-squared, which has now increased to 0.68:
Keep in mind, however, that adding new variables to a regression willnecessarilyincrease the R-squared. After all, the simple regression model is just the special case of the multiple regression model where the coefficients on “work hours” and “PhD” both equal 0. The optimal multiple regression model will necessarily have an error at least as small as that one.
Because of this, in a multiple regression, we also need to look at thestandard errorsof the coefficients,which measure how certain we are about our estimates of each. The regression as a whole may fit our data very well, but if some of the independent variables are correlated (or irrelevant), their coefficients might notmeanmuch.
The typical approach to measuring these errors starts with another assumption — that the errorsare independent normal random variables with mean 0 and some shared (unknown) standard deviation. In that case, we (or, more likely, our statistical software) can use some linear algebra to find the standard error of each coefficient. The larger it is, the less sure our model is about that coefficient. Unfortunately, we’re not set up to do that kind of linear algebra from scratch.
Digression: The Bootstrap
Imagine we have a sample ofndata points, generatedby some (unknown to us) distribution:
data=get_sample(num_points=n)
InChapter 5, we wrote a function to compute themedianof the observed data, which we can use as an estimate of the median of the distribution itself.
But how confident can we be about our estimate? If all the data in the sample are very close to 100, then it seems likely that the actual median is close to 100. If approximately half the data in the sample is close to 0 and the other half is close to 200, then we can’t be nearly as certain about the median.
If we could repeatedly get new samples, we could compute the median of each and look at the distribution of those medians. Usually we can’t. What we can do instead isbootstrapnew data setsby choosingndata pointswith replacementfrom our data and then compute the medians of those synthetic data sets:
defbootstrap_sample(data):
"""randomly samples len(data) elements with replacement"""
# 101 points, 50 of them near 0, 50 of them near 200
far_from_100=([99.5+random.random()]+
[random.random()for_inrange(50)]+
[200+random.random()for_inrange(50)])
If you compute themedianof each, both will be very close to 100. However, if you look at:
bootstrap_statistic(close_to_100,median,100)
you will mostly see numbers really close to 100. Whereas if you look at:
bootstrap_statistic(far_from_100,median,100)
you will see a lot of numbers close to 0 and a lot of numbers close to 200.
Thestandard_deviationof the first set of medians is close to 0, while thestandard_deviationof the second set of medians is close to 100. (This extreme a case would be pretty easy to figure out by manually inspecting the data, but in general that won’t be true.)
Standard Errors of Regression Coefficients
We can take the same approach to estimating the standard errors of our regression coefficients.We repeatedly take abootstrap_sampleof our data and estimatebetabased on that sample. If the coefficient corresponding to one of the independent variables (saynum_friends) doesn’t vary much across samples, then we can be confident that our estimate is relatively tight. If the coefficient varies greatly across samples, then we can’t be at all confident in our estimate.
The only subtlety is that, before sampling, we’ll need tozipourxdata andydata to make sure that corresponding values of the independent and dependent variables are sampled together. This means thatbootstrap_samplewill return a list of pairs(x_i, y_i), which we’ll need to reassemble into anx_sampleand ay_sample:
We can use these to test hypotheses such as “doesequal zero?” Under the null hypothesis(and with our other assumptions about the distribution of) the statistic:
which is our estimate ofdivided by our estimate of its standard error, follows aStudent’s t-distributionwith “degrees of freedom.”
If we had astudents_t_cdffunction, we could computep-values for each least-squares coefficient to indicate how likely we would be to observe such a value if the actual coefficient were zero. Unfortunately, we don’t have such a function. (Although we would if we weren’t working from scratch.)
However, as the degrees of freedom get large, thet-distribution gets closer and closer to a standard normal. In a situation like this, wherenis much larger thank, we can usenormal_cdfand still feel good about ourselves:
defp_value(beta_hat_j,sigma_hat_j):
ifbeta_hat_j>0:
# if the coefficient is positive, we need to compute twice the
# probability of seeing an even *larger* value
return2*(1-normal_cdf(beta_hat_j/sigma_hat_j))
else:
# otherwise twice the probability of seeing a *smaller* value
return2*normal_cdf(beta_hat_j/sigma_hat_j)
p_value(30.63,1.174)# ~0 (constant term)
p_value(0.972,0.079)# ~0 (num_friends)
p_value(-1.868,0.131)# ~0 (work_hours)
p_value(0.911,0.990)# 0.36 (phd)
(In a situation not like this, we would probably be using statistical software that knows how to compute thet-distribution, as well as how to compute the exact standard errors.)
While most of the coefficients have very smallp-values (suggesting that they are indeed nonzero), the coefficient for “PhD” is not “significantly” different from zero, which makes it likely that the coefficient for “PhD” is random rather than meaningful.
In more elaborate regression scenarios, you sometimes want to test more elaborate hypotheses about the data, such as “at least one of theis non-zero” or “equalsandequals,” which you can do with anF-test, which, alas, falls outside the scope of this book.
Regularization
In practice, you’d often like to apply linear regression to data sets with large numbers of variables.This creates a couple of extra wrinkles. First, the more variables you use, the more likely you are to overfit your model to the training set. And second, the more nonzero coefficients you have, the harder it is to make sense of them. If the goal is toexplainsome phenomenon, a sparse model with three factors might be more useful than a slightly better model with hundreds.
Regularizationis an approach in which we add to the error term a penalty that gets larger asbetagets larger. We then minimize the combined error and penalty. The more importance we place on the penalty term, the more we discourage large coefficients.
For example,inridge regression, we add a penalty proportional to the sum of the squares of thebeta_i. (Except that typically we don’t penalizebeta_0, the constant term.)
# alpha is a *hyperparameter* controlling how harsh the penalty is
# sometimes it's called "lambda" but that already means something in Python
In particular, the coefficient on “PhD” vanishes as we increase the penalty, which accords with our previous result that it wasn’t significantly different from zero.
NOTE
Usually you’d want torescaleyourdata before using this approach. After all, if you changed years of experience to centuries of experience, its least squares coefficient would increase by a factor of 100 and suddenly get penalized much more, even though it’s the same model.
Another approach islassoregression,which uses the penalty:
deflasso_penalty(beta,alpha):
returnalpha*sum(abs(beta_i)forbeta_iinbeta[1:])
Whereas the ridge penalty shrank the coefficients overall, the lasso penalty tends to force coefficients to be zero, which makes it good for learning sparse models. Unfortunately, it’s not amenable to gradient descent, which means that we won’t be able to solve it from scratch.
For Further Exploration
§ Regression has a rich and expansive theory behind it. This is another place where you should consider reading a textbook or at least a lot of Wikipedia articles.
§ scikit-learn has alinear_model modulethat provides aLinearRegressionmodel similar to ours, as well asRidgeregression,Lassoregression, and other types of regularization too.
§ Statsmodelsis another Python module that contains (among other things) linear regression models.