Predicting items you can count one by one - Business statistics with Excel and Tableau (2015)

Business statistics with Excel and Tableau (2015)

12. Predicting items you can count one by one

Why you need to know this. Many business decisions involve counts: either binary (yes/no) or within time or space. It would be good to know the probability of a certain number of customers com- ing up to a servicedesk in a certain length of time; or the probability of a certain number of car accidents at a given intersection. We are looking for a discrete probability distribution; discrete because the number of occurrences is aninteger.

Chapter 4 on regression showed how to predict the size of an outcome which was continuous (money or time perhaps). The dependent variable—what we were trying to predict—could take on almost any value.Now we want to predict probabilities for the occurrence of an independent variable which is an integer. Below we’ll work through some hands-on applications, with the theory available in the Glossary.

We’ll break this into two parts:

The probability of abinary outcome. The probabilities of two faulty items out of the next twenty on the production line. Or, the probability of at least five customers out of the next fifty actually buyingsomething. This is estimated with the binomial distribution.

The probability of a particular count of occurrences over time or area. The probability of three or fewer people arriving at your Customer Service Desk within the next half hour. Or more than two mistakes in thenext ten lines of code. This is estimated with the Poisson distribution.


12.1 Predicting with the binomial distribution

After the normal distribution (see the Glossary for a definition), the binomial is the most important in statistics. The math for the binomial distribution is also defined in the Glossary.

The binomial distribution provides the probability of a ‘success’ in a certain number of ‘trials’. For example, you can calculate the probability of more than seven out of the next twenty people through the dooractually buying something. Here a ‘success’ is somebody buying something; while the number of ‘trials’ is the number of people coming through the door (here it is twenty).

Some definitions:

What we are looking for is the probability of a pre-defined number of ‘successes’. Note that the definition of success is up to the analyst. It could be ‘spending more than $20’ or wearing a hat.

In the example we’ll work through below you run a store. You know that the long-run probability of somebody buying something is 0.6. You obtained this number by counting total numbers of customers and, out ofthose, the numbers who actually bought something (successes).

Worked example: You want to know the probability of exactly three people through the door out of the next five buying something. Note that ‘success’ just means that the defined event happens. Whether or not it is a‘good thing’ isn’t relevant.

For Excel, the arguments required are: the random variable whose probability we want to predict; number of trials, the long-run probability, and a true/false statement. In the example, the number of trials is five. Therandom variable (X) whose probability we want to predict is 3. We also need the long-run probability of a success. In the example, p = 0.6. We also want the probability of exactly 3, so use the false statement. I’ll go intothis in some detail shortly.

Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and you should get this: 0.3456. This is the probability of exactly three people out of the next five buying something (number of suc- cesses out ofthe next five trials). Notice the order of the arguments in the Excel function. And especially the last one, which in the example above is false. (The alternative is true). The difference is important because the results are quite different.

True and false argument. Defining the argument as false provides the probability of exactly the random variable. Defining the argu- ment as true provides a cumulative probability.

Excel adds up probabilities from the left, so changing the argu- ments to =BINOM.DIST(3,5,0.6,TRUE) = 0.66304 is the sum of the probabilities of X=0 + X=1 and so on up to an including X=3. The probabilityof 0.66304 is therefore the probability of three or fewer customers buying something. The table below shows the probabilities of various values of X, both ‘false’ and ‘cumulative’. As you can see, the cumulative isjust the continued addition of each successive probability. There is a further example here¹

Probabilities calculated both false and true

How about more than three customers buying something? In math notation, we’re looking for P(X >=3). We know that probabilities must sum up to 1. We know that three or fewer is 0.66304. So more than five has tobe: 1 - 0.66304 = 0.33696.

A slightly harder example: the probability that at least four cus- tomers out of the next ten will buy something? We’re looking for P(X >=4). Look carefully at the notation. X >= 4 implies that the distribution of the tencustomers is split into two halves: less than four and four or more. It is the latter half whose probability we’re looking for.

If we can find the probability of 0 + 1 + 2 + 3, that is the probability of less than four (we only want integers here). So find that probability and then subtract from 1, making use of the fact that probabilities must sum up to1.

Let’s do this step by step.

First, find P(X = < 3), that is the probability that X is three or less:

=BINOM.DIST (3,10,0.6,true) = 0.054762. Notice the ‘true’ which gives us the cumulative probability.

We want four or more, so subtract from 1 like this: 1 - 0.054762 = 0.945238.

Another example. It’s winter and you need to wear a sweater every day. You have two blue and three red sweaters. Calculate the probability that during the week you will wear a red sweater:

Exactlytwice in the week

Answer: the long-run probability of picking a red sweater is 3/5

= 0.6 because you have five sweaters, and three of them are red. The number of trials is 7 because there are 7 days in a week. The wording of the question contains the word ‘exactly’ which means that we don’t want acumulative answer, so we’ll include the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6, FALSE) = 0.077414

More than three times. When you see words such as ‘more than’ that’s a clue that you’re looking for a cumulative probability. In math notation, we’re looking for P(X>3). So if we find the cumulative probability up to and including two, and then sub- tract from one, we’re done. The cumulative probability of two

or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 - binom.dist(2,7,0.6,true) = 0.903744

The keys:

Write out what you are trying to predict in math notation. This forces you to be clear. Draw a little sketch (hopefully better than mine!) if you get confused.

If the wording of your problem contains ‘exactly’ or requires the probability of just one particular outcome (eg P(x=3)) then you want to use false.

12.2 Predicting with the Poisson distribution

The binomial distribution gave us the probability of a binary outcome (yes/no) out of a certain number of trials. The Poisson gives us the probability of a certain number within a specified time-frame or area.

Here’s the example. You run the Customer Service Desk. You want to know the probability of five or fewer customers arriving in the next half hour. That would be useful for staffing, wouldn’t it? You know that the onaverage, 20 customers arrive every half hour. You know that because have counted them.

We want: P(X>=5). Like the binomial above, the Poisson has a true/false argument for whether we want cumulative probabilities or ‘exact’. The notation include a > sign, which implies cumulative, so we use TRUE.In Excel: =POISSON.DIST(5,20,true). The answer is 7.19088E-05. This looks a bit weird, but it is just math notation. The E-05 means that you should move the decimal point 5 places to the left. So there are fourzeroes in front of the leading 7. In other words, an extremely small probability. Perhaps send some staff out for lunch? It is very small because 5 is a long way from your long run average of 20.