Probability - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 6.Probability
The laws of probability, so true in general, so fallacious in particular.
Edward Gibbon
It is hard to do data science withoutsome sort of understanding ofprobabilityand its mathematics. As with our treatment of statistics inChapter 5, we’ll wave our hands a lot and elide many of the technicalities.
For our purposes you should think of probability as a way of quantifying the uncertainty associatedwitheventschosen from a someuniverseof events. Rather than getting technical about what these terms mean, think of rolling a die. The universe consists of all possible outcomes. And any subset of these outcomes is an event; for example, “the die rolls a one” or “the die rolls an even number.”
Notationally, we writeto mean “the probability of the eventE.”
We’ll use probability theory to build models. We’ll use probability theory to evaluate models. We’ll use probability theory all over the place.
One could, were one so inclined, get really deep into the philosophy of what probability theorymeans. (This is best done over beers.) We won’t be doing that.
Dependence and Independence
Roughly speaking, we say that two eventsEandFaredependentif knowing something about whetherEhappens gives us information about whetherFhappens (and vice versa).Otherwise they areindependent.
For instance, if we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are independent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then definitely it’s not the case that both flips are Tails.) These two events are dependent.
Mathematically, we say that two eventsEandFare independent if the probability that they both happen is the product of the probabilities that each one happens:
In the example above, the probability of “first flip Heads” is 1/2, and the probability of “both flips Tails” is 1/4, but the probability of “first flip Headsandboth flips Tails” is 0.
Conditional Probability
When two eventsEandFare independent,then by definition we have:
If they are not necessarily independent (and if the probability ofFis not zero), then we define the probability ofE“conditional onF” as:
You should think of this as the probability thatEhappens, given that we know thatFhappens.
We often rewrite this as:
WhenEandFare independent, you can check that this gives:
which is the mathematical way of expressing that knowingFoccurred gives us no additional information about whetherEoccurred.
One common tricky example involves a family with two (unknown) children.
If we assume that:
1. Each child is equally likely to be a boy or a girl
2. The gender of the second child is independent of the gender of the first child
then the event “no girls” has probability 1/4, the event “one girl, one boy” has probability 1/2, and the event “two girls” has probability 1/4.
Now we can ask what is the probability of the event “both children are girls” (B) conditional on the event “the older child is a girl” (G)? Using the definition of conditional probability:
since the eventBandG(“both children are girlsandthe older child is a girl”) is just the eventB. (Once you know that both children are girls, it’s necessarily true that the older child is a girl.)
Most likely this result accords with your intuition.
We could also ask about the probability of the event “both children are girls” conditional on the event “at least one of the children is a girl” (L). Surprisingly, the answer is different from before!
As before, the eventBandL(“both children are girlsandat least one of the children is a girl”) is just the eventB. This means we have:
How can this be the case? Well, if all you know is that at least one of the children is a girl, then it is twice as likely that the family has one boy and one girl than that it has both girls.
We can check this by “generating” a lot of families:
One of the data scientist’s best friends is Bayes’s Theorem, which is a way of “reversing” conditional probabilities.Let’s say we need to know the probability of some eventEconditional on some other eventFoccurring. But we only have information about the probability ofFconditional onEoccurring. Using the definition of conditional probability twice tells us that:
The eventFcan be split into the two mutually exclusive events “FandE” and “Fand notE.” If we writefor “notE” (i.e., “Edoesn’t happen”), then:
so that:
which is how Bayes’s Theorem is often stated.
This theorem often gets used to demonstrate why data scientists are smarter than doctors. Imagine a certain disease that affects 1 in every 10,000 people. And imagine that there is a test for this disease that gives the correct result (“diseased” if you have the disease, “nondiseased” if you don’t) 99% of the time.
What does a positive test mean? Let’s useTfor the event “your test is positive” andDfor the event “you have the disease.” Then Bayes’s Theorem says that the probability that you have the disease, conditional on testing positive, is:
Here we know that, the probability that someone with the disease tests positive, is 0.99., the probability that any given person has the disease, is 1/10,000 = 0.0001., the probability that someone without the disease tests positive, is 0.01. And, the probability that any given person doesn’t have the disease, is 0.9999. If you substitute these numbers into Bayes’s Theorem you find
That is, less than 1% of the people who test positive actually have the disease.
NOTE
This assumes that people take the test more or less at random. If only people with certain symptoms take the test we would instead have to condition on the event “positive testandsymptoms” and the number would likely be a lot higher.
While this is a simple calculation for a data scientist, most doctors will guess thatis approximately 2.
A more intuitive way to see this is to imagine a population of 1 million people. You’d expect 100 of them to have the disease, and 99 of those 100 to test positive. On the other hand, you’d expect 999,900 of them not to have the disease, and 9,999 of those to test positive. Which means that you’d expect only 99 out of (99 + 9999) positive testers to actually have the disease.
Random Variables
Arandom variableis a variable whose possible values have an associated probability distribution.A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails. A more complicated one might measure the number of heads observed when flipping a coin 10 times or a value picked fromrange(10)where each number is equally likely.
The associated distribution gives the probabilities that the variable realizes each of its possible values. The coin flip variable equals 0 with probability 0.5 and 1 with probability 0.5. Therange(10)variable has a distribution that assigns probability 0.1 to each of the numbers from 0 to 9.
We will sometimes talkabout theexpected valueof a random variable, which is the average of its values weighted by their probabilities. The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1 * 1/2), and therange(10)variable has an expected value of 4.5.
Random variables can beconditionedon eventsjust as other events can. Going back to the two-child example from“Conditional Probability”, ifXis the random variable representing the number of girls,Xequals 0 with probability 1/4, 1 with probability 1/2, and 2 with probability 1/4.
We can define a new random variableYthat gives the number of girls conditional on at least one of the children being a girl. ThenYequals 1 with probability 2/3 and 2 with probability 1/3. And a variableZthat’s the number of girls conditional on the older child being a girl equals 1 with probability 1/2 and 2 with probability 1/2.
For the most part, we will be using random variablesimplicitlyin what we do without calling special attention to them. But if you look deeply you’ll see them.
Continuous Distributions
A coin flip corresponds to adiscrete distribution— one that associates positive probability with discrete outcomes.Often we’ll want to model distributions across a continuum of outcomes. (For our purposes, these outcomes will always be real numbers, although that’s not always the case in real life.) For example, theuniform distributionputsequal weighton all the numbers between 0 and 1.
Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero. For this reason, we represent a continuousdistribution with aprobability density function(pdf) such that the probability of seeing a value in a certain interval equals the integral of the density function over the interval.
NOTE
If your integral calculus is rusty, a simpler way of understanding this is that if a distribution has density function, then the probability of seeing a value betweenandis approximatelyifis small.
The density function for the uniform distribution is just:
defuniform_pdf(x):
return1ifx>=0andx<1else0
The probability that a random variable following that distribution is between 0.2 and 0.3 is 1/10, as you’d expect. Python’srandom.random()is a [pseudo]random variable with a uniform density.
We will often be more interested in thecumulative distribution function(cdf), which gives the probabilitythat a random variable is less than or equal to a certain value. It’s not hard to create the cumulative distribution function forthe uniform distribution (Figure 6-1):
defuniform_cdf(x):
"returns the probability that a uniform random variable is <= x"
ifx<0:return0# uniform random is never less than 0
elifx<1:returnx# e.g. P(X <= 0.4) = 0.4
else:return1# uniform random is always less than 1
The Normal Distribution
The normal distribution is the king of distributions.It is the classic bell curve-shaped distribution and is completely determined by two parameters: its mean(mu) and its standard deviation(sigma). The mean indicates where the bell is centered, and the standard deviation how “wide” it is.
Whenand, it’s called thestandard normal distribution.IfZis a standard normal random variable, then it turns out that:
is also normal but with meanand standard deviation. Conversely, ifXis a normal random variable with meanand standard deviation,
is a standard normal variable.
The cumulative distribution function for the normal distribution cannot be written in an “elementary” manner, butwe can write it usingPython’smath.erf:
Sometimes we’ll need to invertnormal_cdfto find the value corresponding to a specified probability. There’s no simple way to compute its inverse, butnormal_cdfiscontinuous and strictly increasing, so we can use abinary search:
low_z,low_p=-10.0,0# normal_cdf(-10) is (very close to) 0
hi_z,hi_p=10.0,1# normal_cdf(10) is (very close to) 1
whilehi_z-low_z>tolerance:
mid_z=(low_z+hi_z)/2# consider the midpoint
mid_p=normal_cdf(mid_z)# and the cdf's value there
ifmid_p<p:
# midpoint is still too low, search above it
low_z,low_p=mid_z,mid_p
elifmid_p>p:
# midpoint is still too high, search below it
hi_z,hi_p=mid_z,mid_p
else:
break
returnmid_z
The function repeatedly bisects intervals until it narrows in on aZthat’s close enough to the desired probability.
The Central Limit Theorem
One reason the normal distribution is so useful is thecentral limit theorem, which says (in essence) thata random variable defined as the average of a large number of independent and identically distributed random variables is itself approximately normally distributed.
In particular, ifare random variables with meanand standard deviation, and ifnis large, then:
is approximately normally distributed with meanand standard deviation. Equivalently (but often more usefully),
is approximately normally distributed with mean 0 and standard deviation 1.
An easy way to illustrate this isby looking atbinomialrandom variables, which have two parametersnandp. A Binomial(n,p) random variable is simply the sum ofnindependent Bernoulli(p) random variables, each of which equals 1 with probabilitypand 0 with probability:
defbernoulli_trial(p):
return1ifrandom.random()<pelse0
defbinomial(n,p):
returnsum(bernoulli_trial(p)for_inrange(n))
The mean of a Bernoulli(p) variable isp, and itsstandard deviation is. The central limit theorem says that asngets large, a Binomial(n,p) variable is approximately a normal random variable with meanand standard deviation. If we plot both, you can easily see the resemblance:
defmake_hist(p,n,num_points):
data=[binomial(n,p)for_inrange(num_points)]
# use a bar chart to show the actual binomial samples
histogram=Counter(data)
plt.bar([x-0.4forxinhistogram.keys()],
[v/num_pointsforvinhistogram.values()],
0.8,
color='0.75')
mu=p*n
sigma=math.sqrt(n*p*(1-p))
# use a line chart to show the normal approximation
plt.title("Binomial Distribution vs. Normal Approximation")
plt.show()
For example, when you callmake_hist(0.75, 100, 10000), you get the graph inFigure 6-4.
The moral of this approximation is that if you want to know the probability that (say) a fair coin turns up more than 60 heads in 100 flips, you can estimate it as the probability that a Normal(50,5) is greater than 60, which is easier than computing the Binomial(100,0.5) cdf. (Although in most applications you’d probably be using statistical software that would gladly compute whatever probabilities you want.)
For Further Exploration
§ scipy.statscontains pdf and cdf functions for most of the popular probability distributions.
§ Remember how, at the end ofChapter 5, I said that it would be a good idea to study a statistics textbook? It would also be a good idea to study a probability textbook. The best one I know that’s available online isIntroduction to Probability.