Hypothesis and Inference - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 7.Hypothesis and Inference
It is the mark of a truly intelligent person to be moved by statistics.
George Bernard Shaw
What will we do with all this statistics and probability theory? Thesciencepart of data science frequently involves forming and testinghypothesesabout our data and the processes that generate it.
Statistical Hypothesis Testing
Often, as data scientists, we’ll want to test whether a certain hypothesis is likely to be true.For our purposes, hypotheses are assertions like “this coin is fair” or “data scientists prefer Python to R” or “people are more likely to navigate away from the page without ever reading the content if we pop up an irritating interstitial advertisement with a tiny, hard-to-find close button” that can be translated into statistics about data. Under various assumptions, those statistics can be thought of as observations of random variables from known distributions, which allows us to make statements about how likely those assumptions are to hold.
In the classical setup, wehave anull hypothesisthat represents some default position, and some alternative hypothesisthat we’d like to compare it with. We use statistics to decide whether we can rejectas false or not. This will probably make more sense with an example.
Example: Flipping a Coin
Imagine we have a coin and we want to test whether it’s fair.We’ll make the assumption that the coin has some probabilitypof landing heads, and so our null hypothesis is that the coin is fair — that is, that. We’ll test this against the alternative hypothesis.
In particular, our test will involve flipping the coin some numberntimes and counting the number of headsX. Each coin flip isa Bernoulli trial, which means thatXis a Binomial(n,p) random variable, which (as we saw inChapter 6) we can approximate using the normal distribution:
defnormal_approximation_to_binomial(n,p):
"""finds mu and sigma corresponding to a Binomial(n, p)"""
mu=p*n
sigma=math.sqrt(p*(1-p)*n)
returnmu,sigma
Whenever a random variable follows a normal distribution, we can usenormal_cdfto figure out the probability that its realized value lies within (or outside) a particular interval:
# the normal cdf _is_ the probability the variable is below a threshold
normal_probability_below=normal_cdf
# it's above the threshold if it's not below the threshold
defnormal_probability_above(lo,mu=0,sigma=1):
return1-normal_cdf(lo,mu,sigma)
# it's between if it's less than hi, but not less than lo
We can also do the reverse — find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood. For example, if we want to find an interval centered at the mean and containing 60% probability, then we find the cutoffs where the upper and lower tails each contain 20% of the probability (leaving 60%):
defnormal_upper_bound(probability,mu=0,sigma=1):
"""returns the z for which P(Z <= z) = probability"""
returninverse_normal_cdf(probability,mu,sigma)
defnormal_lower_bound(probability,mu=0,sigma=1):
"""returns the z for which P(Z >= z) = probability"""
In particular, let’s say that we choose to flip the cointimes. If our hypothesis of fairness is true,Xshould be distributed approximately normally with mean 50 and standard deviation 15.8:
We need to make a decision aboutsignificance— how willing we are to make atype 1 error(“false positive”), in whichwe rejecteven though it’s true. For reasons lost to the annals of history, this willingness is often set at 5% or 1%. Let’s choose 5%.
Consider the test that rejectsifXfalls outside the bounds given by:
Assumingpreally equals 0.5 (i.e.,is true), there is just a 5% chance we observe anXthat lies outside this interval, which is the exact significance we wanted. Said differently, ifis true, then, approximately 19 times out of 20, this test will give the correct result.
We are also often interested in thepowerof a test, which is the probability of not making atype 2 error, in which we fail to rejecteven though it’s false. In order to measure this, we have to specify what exactlybeing falsemeans. (Knowing merely thatpisnot0.5 doesn’t give you a ton of information about the distribution ofX.) In particular, let’s check what happens ifpis really 0.55, so that the coin is slightly biased toward heads.
In that case, we can calculate the power of the test with:
Imagine instead that our null hypothesis was that the coin is not biased toward heads, or that.In that case we want aone-sided testthat rejects the null hypothesis whenXis much larger than 50 but not whenXis smaller than 50. So a 5%-significance test involves usingnormal_probability_belowto find the cutoff below which 95% of the probability lies:
hi=normal_upper_bound(0.95,mu_0,sigma_0)
# is 526 (< 531, since we need more probability in the upper tail)
This is a more powerful test, since it no longer rejectswhenXis below 469 (which is very unlikely to happen ifis true) and instead rejectswhenXis between 526 and 531 (which is somewhat likely to happen ifis true).=== p-values
An alternative way of thinking aboutthe preceding test involvesp-values. Instead of choosing bounds based on some probability cutoff, we compute the probability — assumingis true — that we would see a value at least as extreme as the one we actually observed.
For our two-sided test of whether the coin is fair, we compute:
deftwo_sided_p_value(x,mu=0,sigma=1):
ifx>=mu:
# if x is greater than the mean, the tail is what's greater than x
return2*normal_probability_above(x,mu,sigma)
else:
# if x is less than the mean, the tail is what's less than x
return2*normal_probability_below(x,mu,sigma)
If we were to see 530 heads, we would compute:
two_sided_p_value(529.5,mu_0,sigma_0)# 0.062
NOTE
Why did we use 529.5 instead of 530? This is what’s called acontinuity correction. It reflects the factthatnormal_probability_between(529.5, 530.5, mu_0, sigma_0)is a better estimate of the probability of seeing 530 heads thannormal_probability_between(530, 531, mu_0, sigma_0)is.
Correspondingly,normal_probability_above(529.5, mu_0, sigma_0)is a better estimate of the probability of seeing at least 530 heads. You may have noticed that we also used this in the code that producedFigure 6-4.
One way to convince yourself that this is a sensible estimate is with a simulation:
extreme_value_count=0
for_inrange(100000):
num_heads=sum(1ifrandom.random()<0.5else0# count # of heads
for_inrange(1000))# in 1000 flips
ifnum_heads>=530ornum_heads<=470:# and count how often
extreme_value_count+=1# the # is 'extreme'
printextreme_value_count/100000# 0.062
Since thep-value is greater than our 5% significance, we don’t reject the null. If we instead saw 532 heads, thep-value would be:
two_sided_p_value(531.5,mu_0,sigma_0)# 0.0463
which is smaller than the 5% significance, which means we would reject the null. It’s the exact same test as before. It’s just a different way of approaching the statistics.
Similarly, we would have:
upper_p_value=normal_probability_above
lower_p_value=normal_probability_below
For our one-sided test, if we saw 525 heads we would compute:
upper_p_value(524.5,mu_0,sigma_0)# 0.061
which means we wouldn’t reject the null. If we saw 527 heads, the computation would be:
upper_p_value(526.5,mu_0,sigma_0)# 0.047
and we would reject the null.
WARNING
Make sure your data is roughly normally distributed before usingnormal_probability_aboveto compute p-values.The annals of bad data science are filled with examples of people opining that the chance of some observed event occurring at random is one in a million, when what they really mean is “the chance, assuming the data is distributed normally,” which is pretty meaningless if the data isn’t.
There are various statistical tests for normality, but even plotting the data is a good start.
Confidence Intervals
We’ve been testing hypotheses about the value of the heads probabilityp, which is aparameterof the unknown “heads” distribution.When this is the case, a third approach is to construct aconfidence intervalaround the observed value of the parameter.
For example, we can estimate the probability of the unfair coin by looking at the average value of the Bernoulli variables corresponding to each flip — 1 if heads, 0 if tails. If we observe 525 heads out of 1,000 flips, then we estimatepequals 0.525.
Howconfidentcan we be about this estimate?Well, if we knew the exact value ofp, the central limit theorem (recall“The Central Limit Theorem”) tells us that the average of those Bernoulli variables should be approximately normal, with meanpand standard deviation:
math.sqrt(p*(1-p)/1000)
Here we don’t knowp, so instead we use our estimate:
p_hat=525/1000
mu=p_hat
sigma=math.sqrt(p_hat*(1-p_hat)/1000)# 0.0158
This is not entirely justified, but people seem to do it anyway. Using the normal approximation, we conclude that we are “95% confident” that the following interval contains the true parameterp:
This is a statement about theinterval, not aboutp. You should understand it as the assertion that if you were to repeat the experiment many times, 95% of the time the “true” parameter (which is the same every time) would lie within the observed confidence interval (which might be different every time).
In particular, we do not conclude that the coin is unfair, since 0.5 falls within our confidence interval.
Here, “fair coin” doesn’t lie in the confidence interval. (The “fair coin” hypothesis doesn’t pass a test that you’d expect it to pass 95% of the time if it were true.)
P-hacking
A procedure that erroneously rejects the null hypothesisonly 5% of the time will — by definition — 5% of the time erroneously reject the null hypothesis:
What this means is that if you’re setting out to find “significant” results, you usually can. Test enough hypotheses against your data set, and one of them will almost certainly appear significant. Remove the right outliers, and you can probably get yourpvalue below 0.05. (We did something vaguely similar in“Correlation”; did you notice?)
This is sometimes calledP-hackingandis in some ways a consequence of the “inference fromp-values framework.” A good article criticizing this approach is“The Earth Is Round.”
If you want to do goodscience, you should determine your hypotheses before looking at the data, you should clean your data without the hypotheses in mind, and you should keep in mind thatp-values are not substitutes for common sense. (An alternative approach is“Bayesian Inference”.)
Example: Running an A/B Test
One of your primary responsibilities at DataSciencester is experience optimization, which is a euphemism for trying to get people to click on advertisements.One of your advertisers has developed a new energy drink targeted at data scientists, and the VP of Advertisements wants your help choosing between advertisement A (“tastes great!”) and advertisement B (“less bias!”).
Being ascientist, you decide to run anexperimentby randomly showing site visitors one of the two advertisements and tracking how many people click on each one.
If 990 out of 1,000 A-viewers click their ad while only 10 out of 1,000 B-viewers click their ad, you can be pretty confident that A is the better ad. But what if the differences are not so stark?Here’s where you’d use statistical inference.
Let’s say thatpeople see ad A, and thatof them click it. We can think of each ad view as a Bernoulli trial whereis the probability that someone clicks ad A. Then (ifis large, which it is here) we know thatis approximately a normal random variable with meanand standard deviation.
Similarly,is approximately a normal random variable with meanand standard deviation:
defestimated_parameters(N,n):
p=n/N
sigma=math.sqrt(p*(1-p)/N)
returnp,sigma
If we assume those two normals are independent (which seems reasonable, since the individual Bernoulli trials ought to be), then their difference should also be normal with meanand standard deviation.
NOTE
This is sort of cheating. The math only works out exactly like this if youknowthe standard deviations. Here we’re estimating them from the data, which means that we really should be using at-distribution. But for large enough data sets, it’s close enough that it doesn’t make much of a difference.
This means we can test thenull hypothesisthatandare the same (that is, thatis zero), using the statistic:
defa_b_test_statistic(N_A,n_A,N_B,n_B):
p_A,sigma_A=estimated_parameters(N_A,n_A)
p_B,sigma_B=estimated_parameters(N_B,n_B)
return(p_B-p_A)/math.sqrt(sigma_A**2+sigma_B**2)
which should approximately be a standard normal.
For example, if “tastes great” gets 200 clicks out of 1,000 views and “less bias” gets 180 clicks out of 1,000 views, the statistic equals:
z=a_b_test_statistic(1000,200,1000,180)# -1.14
The probability of seeing such a large difference if the means were actually equal would be:
two_sided_p_value(z)# 0.254
which is large enough that you can’t conclude there’s much of a difference. On the other hand, if “less bias” only got 150 clicks, we’d have:
z=a_b_test_statistic(1000,200,1000,150)# -2.94
two_sided_p_value(z)# 0.003
which means there’s only a 0.003 probability you’d see such a large difference if the ads were equally effective.
Bayesian Inference
The procedures we’ve looked at have involved making probability statementsabout ourtests: “there’s only a 3% chance you’d observe such an extreme statistic if our null hypothesis were true.”
An alternative approach to inference involves treating the unknown parameters themselves as random variables. The analyst (that’s you) starts with aprior distributionfor the parameters and then uses the observed data and Bayes’s Theorem to get an updatedposterior distributionfor the parameters.Rather than making probability judgments about the tests, you make probability judgments about the parameters themselves.
For example, when the unknown parameter is a probability (as in our coin-flipping example), we often use a prior from theBeta distribution, which puts all itsprobability between 0 and 1:
defB(alpha,beta):
"""a normalizing constant so that the total probability is 1"""
Generally speaking, this distribution centers its weight at:
alpha/(alpha+beta)
and the largeralphaandbetaare, the “tighter” the distribution is.
For example, ifalphaandbetaare both 1, it’s just the uniform distribution (centered at 0.5, very dispersed). Ifalphais much larger thanbeta, most of the weight is near 1. And ifalphais much smaller thanbeta, most of the weight is near zero.Figure 7-1shows several different Beta distributions.
So let’s say we assume a prior distribution onp. Maybe we don’t want to take a stand on whether the coin is fair, and we choosealphaandbetato both equal 1. Or maybe we have a strong belief that it lands heads 55% of the time, and we choosealphaequals 55,betaequals 45.
Then we flip our coin a bunch of times and seehheads andttails. Bayes’s Theorem (and some mathematics that’s too tedious for us to go through here) tells us that the posterior distribution forpis again a Beta distribution but with parametersalpha + handbeta + t.
NOTE
It is no coincidence that the posterior distribution was again a Beta distribution. The number of heads is given by a Binomial distribution, and the Beta is theconjugate priorto the Binomial distribution. This means that whenever you update a Beta prior using observations from the corresponding binomial, you will get back a Beta posterior.
Figure 7-1.Example Beta distributions
Let’s say you flip the coin 10 times and see only 3 heads.
If you started with the uniform prior (in some sense refusing to take a stand about the coin’s fairness), your posterior distribution would be a Beta(4, 8), centered around 0.33. Since you considered all probabilities equally likely, your best guess is something pretty close to the observed probability.
If you started with a Beta(20, 20) (expressing the belief that the coin was roughly fair), your posterior distribution would be a Beta(23, 27), centered around 0.46, indicating a revised belief that maybe the coin is slightly biased toward tails.
And if you started with a Beta(30, 10) (expressing a belief that the coin was biased to flip 75% heads), your posterior distribution would be a Beta(33, 17), centered around 0.66. In that case you’d still believe in a heads bias, but less strongly than you did initially. These three different posteriors are plotted inFigure 7-2.
Figure 7-2.Posteriors arising from different priors
If you flipped the coin more and more times, the prior would matter less and less until eventually you’d have (nearly) the same posterior distribution no matter which prior you started with.
For example, no matter how biased you initially thought the coin was, it would be hard to maintain that belief after seeing 1,000 heads out of 2,000 flips (unless you are a lunatic who picks something like a Beta(1000000,1) prior).
What’s interesting is that this allows us to make probability statements about hypotheses: “Based on the prior and the observed data, there is only a 5% likelihood the coin’s heads probability is between 49% and 51%.” This is philosophically very different from a statement like “if the coin were fair we would expect to observe data so extreme only 5% of the time.”
Using Bayesian inference to test hypotheses is considered somewhat controversial — in part because its mathematics can get somewhat complicated, and in part because of the subjective nature of choosing a prior. We won’t use it any further in this book, but it’s good to know about.
For Further Exploration
§ We’ve barely scratched the surface of what you should know about statistical inference. The books recommended at the end ofChapter 5go into a lot more detail.
§ Coursera offers aData Analysis and Statistical Inferencecourse that covers many of these topics.