# Excel Data Analysis For Dummies, 2nd Edition (2014)

### Part III. Advanced Tools

### Chapter 9. Using the Statistics Functions

*In This Chapter*

Counting items in a data set

Using means, modes, and medians

Finding values, ranks, and percentiles

Calculating standard deviations and variances

Using normal distributions

Using t-distributions and f-distributions

Understanding binomial distributions

Using chi-square distributions

Excel supplies a bunch of statistical functions … more than 70, in fact. These functions help you dig more deeply into the characteristics of data that you’ve stored in an Excel worksheet, list, or pivot table. In this chapter, I discuss and illustrate each of the statistical functions that you’re likely to use. I also briefly describe some of the very esoteric statistical functions.

*Counting Items in a Data Set*

Excel provides four useful statistical functions for counting cells within a worksheet or list: COUNT, COUNTA, COUNTBLANK, and COUNTIF. Excel also provides two useful functions for counting permutations and combinations: PERMUT and COMBIN.

*COUNT: Counting cells with values*

The COUNT function counts the number of cells within a specified range that hold values. The function, however, doesn’t count cells containing the logical values TRUE or FALSE or cells that are empty. The function uses the syntax

=COUNT(*value1*,[*value2*])

If you want to use the COUNT function to count the number of values in the range B2:B10 in the worksheet shown in Figure *9-1*, you might enter the formula

=COUNT(B2:B10)

into cell G2, as shown in the figure. The function returns the value 9.

**Figure 9-1:** A worksheet fragment for illustrating the counting functions.

** Note:** You can include several arguments as part of the range argument in the COUNT function. For example, in Figure

*9-1*, you might also use the syntax =COUNT(B2,B3:B5,B6:B7,B8,B9), which would return the same result as the formula shown in the figure.

*COUNTA: Alternative counting cells with values*

The COUNTA function counts the number of cells within a specified range that aren’t empty. The function uses the syntax

=COUNTA(*value1*,[*value2*])

If you want to use the COUNTA function to count the number of non-empty cells in the range A1:B2 in the worksheet shown in Figure *9-1*, for example, enter the formula

=COUNTA(A1:B2)

into cell G4. The function returns the value 3.

*COUNTBLANK: Counting empty cells*

The COUNTBLANK function counts the number of cells within a specified range that are empty. The function uses the syntax

=COUNTBLANK(*value1*,[*value2*])

To use the COUNTBLANK function to count the number of empty cells in the range A1:B2 in the worksheet shown in Figure *9-1*, for example, you could enter the formula

=COUNTBLANK(A1:B2)

into cell G6. The function returns the value 1.

*COUNTIF: Counting cells that match criteria*

The COUNTIF function counts the number of cells within a specified range that match criteria that you specify. The function uses the syntax

=COUNTIF(*range*,*criteria*)

where *range* is the worksheet range in which you count cells and *criteria* is a Boolean expression, enclosed in quotation marks, that describes your criteria.

As an example of how this works, suppose you want to use the COUNTIF function to count the number of cells within the worksheet range C1:C10 that hold values greater than 4. To make this count, you use the following formula:

=COUNTIF(C1:C10,">4")

This formula appears in cell G8 of the worksheet shown in Figure *9-1*.

You can use other Boolean operators to construct other match criteria: Use the < operator for a less-than comparison, the <= operator for a less-than-or-equal-to comparison, the >= operator for a greater-than-or-equal-to comparison, the = operator for the equal-to comparison, and the <> operator for a not-equal-to comparison.

*PERMUT: Counting permutations*

The PERMUT function counts the number of permutations possible when selecting a sample from a population. Note that for a permutation, the order does matter in which items are selected. The function uses the syntax

=PERMUT(*number*,*number_chosen*)

where *number* is the number of items in the population and *number_chosen* is the number of items selected. Given a population of six items and three selections, for example, you calculate the number of permutations by using the formula

=PERMUT(6,3)

The function returns the value 120, indicating that 120 different ways exist in which three items can be selected from a set of six.

*COMBIN: Counting combinations*

If the order in which items are selected doesn’t matter, you use the combination function, COMBIN, which uses the syntax

=COMBIN(*number*,*number_chosen*)

The number of combinations possible when three items are selected from a set of six can be calculated using the formula

=COMBIN(6,3)

This function returns the value 20. The COMBIN function isn’t technically an Excel statistical function, by the way, but it seems so closely related to the PERMUT function that I include a description here.

*Means, Modes, and Medians*

Excel provides a handful of functions for calculating means, modes, and medians.

*AVEDEV: An average absolute deviation*

The AVEDEV function provides a measure of dispersion for a set of values. To do this, the function looks at a set of values and calculates the average absolute deviation from the mean of the values. The function uses the syntax

=AVEDEV(*number1*,[*number2*])

where *number1,[number2]* is a worksheet reference to the range that stores the values.

** Note:** As is the case with many other simple statistical functions, you can include several arguments as part of the range argument in the AVEDEV function. For example, the formulas =AVEDEV(B1,B2:B5,B6:B7,B8,B9) and =AVEDEV(B1:B9) are equivalent.

Suppose you have three values — 100, 200, and 300 — in the worksheet range that you supply to the AVEDEV function. The average of these three values is 200, calculated as (100+200+300)/3. The average of the deviations from the mean is 66.6667, calculated as:

(|100-200|+|200-200|+|300-200|)/3

** Note:** The AVEDEV function calculates the average of the absolute value of the deviation. For this reason, the function calculates absolute differences, or deviations, from the mean.

The AVEDEV function isn’t used in practice. Mostly a teaching tool, educators and trainers sometimes use the average deviation measure of dispersion to introduce the more useful but also more complicated measures of dispersion: the standard deviation and variance.

*AVERAGE: Average*

The AVERAGE function calculates the arithmetic mean for a set of values. The function uses the syntax

=AVERAGE(*number1*,[*number2*])

where *number1,[number2]* is a worksheet reference to the range that stores the values.

If your argument includes the three values — 100, 200, and 300— the function returns the value 200 because (100+200+300)/3 equals 200.

*AVERAGEA: An alternate average*

The AVERAGEA function, like the AVERAGE function, calculates the arithmetic mean for a set of values. The difference with the AVERAGEA function, however, is that AVERAGEA includes cells with text and the logical value for FALSE in its calculations as 0. The AVERAGEA function includes the logical value for TRUE in its calculations as 1. The function uses the syntax

=AVERAGEA(*number1*,[*number2*])

where *number1,[number2]* is a worksheet reference to the range that stores the values — and possibly text as well as logical values.

If your argument includes three values — 100, 200, and 300— and three text labels in the worksheet range that you supply to the AVERAGEA function, the function returns the value 100 because (100+200+300+0+0+0)/6 equals 100.

** Note:** As is the case with the AVERAGE function, you can supply up to 255 arguments to the AVERAGEA function.

*TRIMMEAN: Trimming to a mean*

The TRIMMEAN function calculates the arithmetic average of a set of values but only after discarding a specified percentage of the lowest and highest values from the set. The function uses the syntax

=TRIMMEAN(*array*,*percent*)

where *array* is the range holding the values and *percent* is the decimal value that gives the percentage of values that you want to discard. For example, to calculate the arithmetic mean of the values stored in the worksheet range C2:C10 in Figure *9-2* only after discarding 10 percent of the data — the top 5 percent and the bottom 5 percent — you use the following formula:

=TRIMMEAN(C2:C10,0.1)

**Figure 9-2:** A worksheet fragment that shows how TRIMMEAN works.

*MEDIAN: Median value*

The MEDIAN function finds the middle value in a set of values: Half the values fall below and half the values fall above the median. The function uses the syntax

=MEDIAN(*number1*,[*number2*])

If you use the MEDIAN function to find the median of a range holding the values 1, 2, 3, 4, and 5, for example, the function returns the value 3.

** Note:** You can supply up to 255 arguments to the MEDIAN function.

If you use the MEDIAN function to find the median of a range holding the values 1, 2, 3, and 4, the function returns the value 2.5. Why? Because if you have an even number of data entries, Excel calculates a median by averaging the two middle values.

*MODE: Mode value*

The MODE function finds the most common value in your data set, but the function ignores empty cells and cells that store text or return logical values. The function uses the syntax

=MODE(*number1*,[*number2*])

If you use the MODE function to find the most common value in a range holding the values 1, 2, 3, 4, 4, and 4, the function returns the value 4.

** Note:** You can supply up to 255 arguments to the MODE function.

*GEOMEAN: Geometric mean*

The GEOMEAN function calculates the geometric mean of a set of values. The *geometric mean* equals the *n*th root of the product of the numbers. The function uses the syntax

=GEOMEAN(*number1*,[*number2*]...)

where *number1* and, optionally, other similar arguments supply the values that you want to geometrically average.

*HARMEAN: Harmonic mean*

The HARMEAN function calculates the reciprocal of the arithmetic mean of the reciprocals of a data set. The function uses the syntax

=HARMEAN(*number1*,[*number2*]...)

where *number1* and, optionally, other similar arguments supply the values that you want to harmonically average.

*Finding Values, Ranks, and Percentiles*

Excel provides functions for finding the largest or smallest values in a data set and also for finding values with a particular rank and for ranking values within the data set. Excel also provides a couple of tangentially related functions for calculating frequency distributions and simple probabilities for data sets. I describe all these function tools in the next sections.

*MAX: Maximum value*

The MAX function finds the largest value in your data. The function ignores blank cells and cells containing text or logical values such as TRUE and FALSE and uses the syntax

=MAX(*number1*,[*number2*])

If the largest value in the range A1:G500 is 50, the function =MAX(A1:G500) returns the value 50.

** Note:** You can supply up to 255 arguments to the MAX function.

*MAXA: Alternate maximum value*

In a fashion similar to the MAX function, the MAXA function also finds the largest value in your data. However, unlike the MAX function, the MAXA function includes logical values and text. The logical value TRUE equals 1, the logical value FALSE equals 0, and text also equals 0. The MAXA function uses the syntax

=MAXA(*number1*,[*number2*])

*MIN: Minimum value*

The MIN function finds the smallest value in your data. The function ignores blank cells and cells containing text or logical values such as TRUE and FALSE and uses the syntax

=MIN(*number1*,[*number2*])

If the smallest value in the range A1:G500 is 1, the function =MIN(A1:G500) returns the value 1.

*MINA: Alternate minimum value*

The MINA function also finds the smallest value in your data, but the MINA function includes logical values and text. The logical value TRUE equals 1, the logical value FALSE equals 0, and text also equals 0. The MINA function uses the syntax

=MINA(*number1*,[*number2*])

If the smallest value in the range A1:G500 is 1 but this range also includes text values, the function =MINA(A1:G500) returns the value 0.

*LARGE: Finding the kth largest value*

You can use the LARGE function to find the *k*th largest value in an array. The function uses the syntax

=LARGE(*array*,*k*)

where *array* is the array of values and *k* identifies which value you want the function to return. For example, if you store the values 1, 3, 5, 8, and 9 in the worksheet range A1:A5 and you want the function to return the second largest value, use the following formula:

=LARGE(A1:A5,2)

The function returns the value 8 because that’s the second largest value in the array.

*SMALL: Finding the kth smallest value*

The SMALL function finds the *k*th smallest value in an array. The function uses the syntax

=SMALL(*array*,*k*)

where *array* is the array of values and *k* identifies which value you want to find and have the function return. For example, if you store the values 1, 3, 5, 8, and 9 in the worksheet range A1:A5 and you want the function to return the second smallest value, use the following formula:

=SMALL(A1:A5,2)

The function returns the value 3 because that’s the second smallest value in the array.

*RANK: Ranking an array value*

The RANK function determines the rank, or position, of a value in an array. The formula uses the syntax

=RANK(*number*,*ref*,[*order*])

where *number* is the value you want to rank, *ref* is the array of values, and optionally *order* indicates whether array values should be arranged in descending order (indicated with a 0 or logical FALSE value) or in ascending order (indicated with a 1 or logical TRUE value). By the way, Excel ranks duplicate values the same, but these duplicates do affect the rank of subsequent numbers. If you leave out the *order* argument, Excel ranks values in descending order.

To demonstrate how the RANK function works, suppose you want to rank the values shown in the worksheet range A1:A9 in Figure *9-3*.

**Figure 9-3:** A worksheet fragment with the array 1, 2, 3, 4, 4, 5, 6, 7, 8.

The formula in cell G2

=RANK(6,A1:A9)

returns the value 3, indicating that when a descending order is used, the value 6 is the third value in the array.

The formula in cell G4

=RANK(6,A1:A9,1)

returns the value 7, indicating that when an ascending order is used, the value 6 is the seventh value in the array.

*PERCENTRANK: Finding a percentile ranking*

The PERCENTRANK function determines the percentage rank, or percentile, of a value in an array. The formula uses the syntax

=PERCENTRANK(*array*,*x*,[*significance*])

where *array* gives the array of values, *x* identifies the value you want to rank, and *significance* identifies the number of decimal places that you want in the percentage. The *significance* argument is optional. If you omit the argument, Excel assumes that you want three significant digits.

To demonstrate how the PERCENTRANK function works, again suppose you want to rank the values shown in the worksheet range A1:A9 in Figure *9-3* — only this time, you rank the values using percentages.

The formula in cell G6

=PERCENTRANK(A1:A9,6,2)

returns the value 0.75, which is the same thing as 75 percent.

Excel calculates the percentage rank by looking at the number of array values greater than the x value and the number of array values smaller than the x value. The array shown earlier in Figure *9-3* includes the values 1, 2, 3, 4, 4, 5, 6, 7, 8. The percent rank of 6 in the array equals 0.75 because six array values are smaller than 6 and two array values are larger than 6. The actual formula that the function calculates is 6/(2+6), which equals 0.75.

*PERCENTILE: Finding a percentile ranking*

The PERCENTILE function determines the array value at a specified percentile in an array. The formula uses the syntax

=PERCENTILE(*array*,*k*)

where *array* gives the array of values and *k* gives the percentile of the value that you want to find.

To find the value at the 75-percent percentile in the array of values shown in the worksheet range A1:A9 in Figure *9-3*, use the formula

=PERCENTILE(A1:A9,.75)

The function returns the value 6 because the value 6 is at the 75th percentile in this array. This formula appears in cell G8 in the worksheet shown in Figure *9-3*.

To repeat something in the earlier discussion of the PERCENTRANK function, note that Excel calculates the percentage rank by looking at the number of array values greater than the x value and the number of array values smaller than the x value. For the array shown in Figure *9-3*, the array includes the values 1, 2, 3, 4, 4, 5, 6, 7, 8. The percent rank of 6 in the array equals 0.75 because six array values are smaller than 6 and two array values are larger than 6.

*FREQUENCY: Frequency of values in a range*

The FREQUENCY function counts the values in an array that fall within a range, or bin. The function uses the syntax

=FREQUENCY(*data_array*,*bins_array*)

where *data_array* is the worksheet range that holds the values that you want to count and *bins_array* is a worksheet range that identifies the ranges of values, or bins, that you want to use to create a frequency distribution. Take a look at Figure *9-4*, for example.

**Figure 9-4:** A worksheet that illustrates how the FREQUENCY function works.

To categorize the values in the worksheet range A2:A20 using the bins shown in B2:B6, select the worksheet range C2:C6 and enter the formula

=FREQUENCY(A2:A20,B2:B6)

Then press Ctrl+Shift+Enter to tell Excel that the function formula should be entered as an array. Excel enters your formula into each of the cells in the worksheet range C2:C6, with the result shown in Figure *9-4*.

In cell C2, the function uses the bin value in cell B2 to count up all the data values greater than 0 and less than or equal to 80. In cell C3, the function counts up all the data values greater than 80 but less than 90, and so on. Note that you need to arrange your bin range values in ascending order.

** Working with array formulas**

You can use array formulas to return arrays. For example, you can create an array formula that adds the array 1, 2, 3 to the array 4, 5, 6. This formula produces an array result: 5, 7, 9. The worksheet fragment below shows this. The array in range A1:C1 is added to the array in range A2:C2, and the resulting array is placed into the range A3:C3.

If you want to try entering this formula yourself, create a worksheet that holds the values shown in A1:C2. Then, select the range A3:C3, type the formula **=A1:C1+A2:C2**, and press Ctrl+Shift+Enter. Excel enters the same formula, {=A1:C1+A2:C2}, into each of the cells in the worksheet range A3:C3. You don’t enter the braces, by the way. Excel enters those for you when you press Ctrl+Shift+Enter. The array formula tells Excel to calculate different values for different cells. Excel calculates the value for cell A3 by adding the values in A1 and A2. Excel calculates the value in cell B3 by adding the values in B1 and B2, and so on.

*PROB: Probability of values*

The PROB function uses a set of values and associated probabilities to calculate the probability that a variable equals some specified value or that a variable falls with a range of specified values. The function uses the syntax

=PROB(*x_range*,*prob_range*,*lower_limit*,[*upper_limit*])

where *x_range* equals the worksheet range that holds your values and *prob_range* holds the worksheet range that specifies the probabilities for the values from *x_range*. To calculate the probability that a variable equals a specified value, enter that value using the *lower_limit*argument. To calculate the probability that a variable falls within a range, enter the bounds of that range using the *lower_limit* and *upper_limit* arguments.

Although the PROB function seems complicated at first blush, take a peek at the worksheet shown in Figure *9-5*. The worksheet range A1:A10 holds the values, and the worksheet range B1:B10 holds the probability of those values.

**Figure 9-5:** A worksheet fragment for demonstrating the PROB function.

To calculate the probability that a value equals 4, use the formula

=PROB(A1:A10,B1:B10,4)

In what shouldn’t be a surprise to you, given the value shown in cell B4, this function returns the value 15.00%, as shown in cell G3 in Figure *9-5*. To calculate the probability that a value falls from 4 to 7, use the formula

=PROB(A1:A10,B1:B10,4,7)

The function returns the value 68.00%, which is the sum of the values in the range B4:B7. Figure *9-5* also shows this formula result in cell G5.

*Standard Deviations and Variances*

I’m sure that this will be a big surprise to you. Excel provides almost a dozen functions for calculating standard deviations and variances. A *standard deviation,* by the way, describes dispersion (spread of data) about (around) the data set’s mean. You can kind of think of a standard deviation as an *average* deviation from the mean. A *variance* is just the squared standard deviation. You often use variances and standard deviations in other statistical calculations and as arguments to other statistical functions.

*STDEV: Standard deviation of a sample*

The STDEV function calculates the standard deviation of a sample, a measure of how widely values in a data set vary around the mean — and a common input to other statistical calculations. The function uses the syntax

=STDEV(*number1*,[*number2*])

To calculate the standard deviation of the worksheet range A1:A5 using the STDEV function, for example, use the formula

=STDEV(A1:A5)

If the worksheet range holds the values 1, 4, 8, 9, and 11, the function returns the standard deviation value 4.037326.

The STDEV function lets you include up to 255 arguments as inputs; those arguments can be values, cell references, formulas, and range references. The STDEV function ignores logical values, text, and empty cells.

*STDEVA: Alternate standard deviation of a sample*

The STDEVA function calculates the standard deviation of a sample, but unlike the STDEV function, STDEVA doesn’t ignore the logical values TRUE (which is 1) and FALSE (which is 0). The function uses the syntax

=STDEVA(*number1*,[*number2*])

STDEVA arguments, which can number up to 255, can be values, cell references, formulas, and range references.

**Population statistics compared with sample statistics**

How do you know whether you’re supposed to be using sample versions of statistical functions (such as STDEV and STDEVA) or population versions of statistical functions (such as STDEVP and STDEVPA)? If you’re looking at all the values — the key word is *all *— you’re working with the entire population. In this case, use one of the population standard deviation functions. If you’re working with samples — which are just portions of the population — use one of the sample standard deviation functions.

*STDEVP: Standard deviation of a population*

The STDEVP function calculates the standard deviation of a population to measure how widely values vary around the mean. The function uses the syntax

=STDEVP(*number1*,[*number2*])

To calculate the standard deviation of the worksheet range A1:A5 using the STDEVP function, for example, use the formula

=STDEVP(A1:A5)

If the worksheet range holds the values 1, 4, 8, 9, and 11, the function returns the standard deviation value 3.611094.

The STDEVP function lets you include up to 255 arguments as inputs; the arguments can be values, cell references, formulas, and range references. The STDEV function ignores logical values, text, and empty cells.

*STDEVPA: Alternate standard deviation of a population*

The STDEVPA function calculates the standard deviation of a population, but unlike the STDEVP function, STDEVPA doesn’t ignore the logical values TRUE (which is 1) and FALSE (which is 0). The function uses the syntax

=STDEVPA(*number1*,[*number2*])

STDEVPA arguments, which can number up to 255, can be values, cell references, formulas, and range references.

*VAR: Variance of a sample*

The VAR function calculates the variance of a sample, another measure of how widely values in a data set vary around the mean. The VAR function uses the syntax

=VAR(*number1*,[*number2*])

A standard deviation is calculated by finding the square root of the variance.

To calculate the variance of the worksheet range A1:A5 using the VAR function, for example, use the formula

=VAR(A1:A5)

If the worksheet range holds the values 1, 4, 8, 9, and 11, the function returns the standard deviation value 16.3.

The VAR function lets you include up to 255 arguments as inputs; the arguments can be values, cell references, formulas, and range references. The VAR function ignores logical values, text, and empty cells.

*VARA: Alternate variance of a sample*

The VARA function calculates the variance of a sample, but unlike the VAR function, VARA doesn’t ignore the logical values TRUE (which is 1) and FALSE (which is 0). The function uses the syntax

=VARA(*number1*,[*number2*])

VARA arguments, which can number up to 255, can be values, cell references, formulas, and range references.

*VARP: Variance of a population*

The VARP function calculates the variance of a population. The function uses the syntax

=VARP(*number1*,[*number2*])

To calculate the variance of the worksheet range A1:A5 using the VARP function, for example, use the formula

=VARP(A1:A5)

If the worksheet range holds the values 1, 4, 8, 9, and 11, the function returns the standard deviation value 13.04.

The VARP function lets you include up to 255 arguments as inputs; the arguments can be values, cell references, formulas, and range references. The VARP function ignores logical values, text, and empty cells.

*VARPA: Alternate variance of a population*

The VARPA function calculates the variance of a population, but unlike the VARP function, VARPA doesn’t ignore the logical values TRUE (which is 1) and FALSE (which is 0). The function uses the syntax

=VARPA(*number1*,[*number2*])

VARPA arguments, which can number up to 255, can be values, cell references, formulas, and range references.

*COVARIANCE.P and COVARIANCE.S: Covariances*

Excel supplies two covariance functions: COVARIANCE.S and COVARIANCE.P. The COVARIANCE.S function calculates the covariance of a sample and the COVARIANCE.P function calculates the covariance of a population. The covariance statistics, then, calculate the average of the products of the deviations between pairs of values and uses the syntax

=COVARIANCE.S(*array1*,*array2*)

Or

=COVARIANCE.P(*array1*,*array2*)

where *array1* is the worksheet range holding the first values in the pair and *array2* is the worksheet range holding the second values in the pair.

*DEVSQ: Sum of the squared deviations*

The DEVSQ function calculates the deviations of values from a mean, squares those deviations, and then adds them up. The function uses the syntax

=DEVSQ(*number1*,[*number2*]...)

where *number1* and, optionally, *number2* are worksheet ranges or arrays that hold your values.

*Normal Distributions*

Excel provides five useful functions for working with normal distributions. Normal distributions are also known as *bell curves* or *Gaussian distributions.*

Pssst. Hey buddy, wanna to see how a normal distribution changes when you noodle with its mean and standard deviation? Visit the Stanford University web page at

http://statweb.stanford.edu/~naras/jsm/NormalDensity/NormalDensity.html

*NORM.DIST: Probability X falls at or below a given value*

The NORM.DIST function calculates the probability that variable X falls below or at a specified value. The NORM.DIST function uses the syntax

=NORM.DIST(x,*mean*,*standard_dev*,*cumulative*)

where *x* is the variable that you want to compare, *mean* is the population mean, *standard_dev* is the population standard deviation, and *cumulative* is a logical value that tells Excel whether you want a cumulative probability or a discrete probability.

Here's an example of how you might use the NORM.DIST function: Suppose you want to calculate the probability that some goofball with whom you work actually does have an IQ above 135 like he’s always bragging. Further suppose that the population mean IQ equals 100 and that the population standard deviation for IQs is 15. (I don’t know whether these numbers are true. I do vaguely remember reading something about this in *Zen and the Art of Motorcycle Maintenance* when I was in high school.)

In this case, you use the following formula:

=NORM.DIST(135,100,15,1)

The function returns the value .990185, indicating that if the inputs are correct, roughly 99 percent of the population has an IQ at or below 135. Or, slightly restated, this means the chance that your co-worker has an IQ above 135 is less than 1 percent.

If you want to calculate the probability that your co-worker has an IQ equal to exactly 135, use the following formula:

=NORM.DIST(135,100,15,0)

This function returns the value .001748 indicating that .1748 percent, or roughly one-sixth of a percent, of the population has an IQ equal to 135.

To be very picky, statisticians might very well tell you that you can’t actually calculate the probability of a single value, such as the probability that somebody’s IQ equals 135. When you set the cumulative argument to 0, therefore, what actually happens is that Excel roughly estimates the probability by using a small range about the single value.

*NORM.INV: X that gives specified probability*

The NORM.INV function makes the inverse calculation of the NORM.DIST function. NORM.INV calculates the variable X that gives a specified probability. The NORM.INV function uses the syntax

=NORM.INV(*probability*,*mean*,*standard_dev*)

where *probability* is the percentage that you want the variable X value to fall at or below, *mean* is the population mean, and *standard_dev* is the population standard deviation.

Okay, here’s something that I do remember from *Zen and the Art of Motorcycle Maintenance*. In that book, the protagonist says that he has an IQ that occurs only once in every 50,000 people. You can turn this into a percentile using the formula 1–1/50000, which returns the value.99998.

To calculate the IQ level (which is the variable X) that occurs only every 50,000 people and again assuming that the IQ mean is 100 and that the standard deviation is 15 IQ points, you use the following formula:

=NORM.INV(.99998,100,15)

The formula returns the value 162, when rounded to the nearest whole number.

*NORM.S.DIST: Probability variable within z-standard deviations*

For normal distributions, the NORM.S.DIST function calculates the probability that a random variable is within z-standard deviations of the mean. The function uses the syntax

=NORM.S.DIST(*z*)

To find the probability that a randomly selected variable from a data set is within 2 standard deviations from the mean, use the following formula:

=NORM.S.DIST(2)

which returns the value 0.97725, indicating that there is a 97.725-percent chance that the variable falls within 2 standard deviations of the mean.

*NORM.S.INV: z-value equivalent to a probability*

The NORM.S.INV function is the inverse of the NORM.S.DIST function. If you know the probability that a randomly selected variable is within a certain distance of the mean, you can calculate the z-value by using the NORM.S.INV function to describe the distance in standard deviations. The function uses the syntax

=NORM.S.INV(*probability*)

where *probability* is a decimal value between 0 and 1. To find the z-value for 99 percent, for example, you use the following formula:

=NORM.S.INV(0.99)

The function returns the z-value 2.326348, indicating that there is a 99-percent chance that a randomly selected variable is within 2.326348 standard deviations of the mean.

*STANDARDIZE: z-value for a specified value*

The STANDARDIZE function returns the z-value for a specified variable. The z-value describes the distance between a value and the mean in terms of standard deviations. The function uses the syntax

=STANDARDIZE(*x*,*mean*,*standard_dev*)

where *x* is the variable for which you want to calculate a z-value, *mean* is the arithmetic mean, and *standard_dev* is the standard deviation.

For example, to calculate the z-value for the variable 6600 given a mean equal to 6000 and a standard deviation equal to 800, you use the following formula:

=STANDARDIZE(6600,6000,800)

The function returns the z-value 0.75.

With a z-value, you can use the NORM.S.DIST function to calculate the probability that a randomly selected variable falls within the area calculated as the mean plus or minus the z-value. The probability that a randomly selected variable falls within the area that equals the mean plus or minus the z-value 0.75 is calculated using the formula =NORM.S.DIST(0.75). This function returns the probability value 0.773373, indicating that there’s a 77.3373 chance that a variable will fall within 0.75 standard deviations of the mean.

*CONFIDENCE: Confidence interval for a population mean*

The CONFIDENCE.NORM and CONFIDENCE.T functions calculate a value that you can use to create a confidence intervals for population means based on the sample mean. These definitions amount to a mouthful, but in practice what these functions do is straightforward.

Suppose that, based on a sample, you calculate that the mean salary for a chief financial officer for a particular industry equals $100,000. You might wonder how close this sample mean is to the actual population mean. Specifically, you might want to know what range of salaries, working at a 95-percent confidence level, includes the population mean.

The CONFIDENCE.NORM function calculates the number that you use to create this interval using the syntax

=CONFIDENCE.NORM(*alpha*,*standard_dev*,*size*)

where *alpha* equals 1 minus the confidence level, *standard_dev* equals the standard deviation of the population, and *size* equals the number of values in your sample.

If the standard deviation for the population equals $20,000 and the sample size equals 100, use the formula

=CONFIDENCE.NORM(1-.95,20000,100)

The function returns the value $3920 (rounded to the nearest dollar). This interval suggests that if the average chief financial officer’s salary in your sample equals $100,000, there’s a 95-percent chance that the population mean of the chief financial officers’ salaries falls within the range $96,080 to $103,920.

The CONFIDENCE.T function works in roughly the same way arguments do, but uses a Student T-distribution rather than a normal distribution. The CONFIDENCE.T function uses the following syntax:

=CONFIDENCE.T(*alpha*,*standard_dev*,*size*)

where *alpha* equals 1 minus the confidence level, *standard_dev* equals the standard deviation of the population, and *size* equals the number of values in your sample.

If the standard deviation for the population equals $20,000 and the sample size equals 100, use the formula

=CONFIDENCE.T(1-.95,20000,100)

The function returns the value $3968 (rounded to the nearest dollar). This interval suggests that if the average chief financial officer’s salary in your sample equals $100,000, there’s a 95-percent chance that the population mean of the chief financial officers’ salaries falls within the range $96,032 to $103,968.

*KURT: Kurtosis*

The KURT function measures the tails in a distribution. The function uses the syntax

=KURT(*number1*,[*number2*]...)

where *number1* is a value, cell reference, or range reference. Optionally, you can include additional arguments that provide values, cell references, and ranges.

The kurtosis of a normal distribution equals 0. A kurtosis greater than 0 means the distribution’s tails are larger than for a normal distribution. A kurtosis less than 0 means the distribution’s tails are smaller than for a normal distribution.

*SKEW and SKEW.P: Skewness of a distribution*

The SKEW and SKEW.P functions measure the symmetry of a distribution of values. Both functions use the same syntax, so I’m just going to describe the SKEW.P function here.

The SKEW.P function uses the syntax

=SKEW.P(*number1*,[*number2*])

To illustrate this function, suppose that you want to measure the skewness of a perfectly symmetrical distribution, such as the uniformly distributed values 1, 2, 3, 4, 5, 6, 7 and 8. No skewness exists here, right? You can prove this lack of skewness using the formula

=SKEW.P(1,2,3,4,5,6,7,8)

Which returns the value 0.

If a distribution’s values *tail* (that is, stretch out) to the right, it means the distribution includes greater numbers of large values (or larger values) than a symmetrical distribution would. Thus the skewness is positive. For example, the formula

=SKEW.P(1,2,3,4,5,6,8,8)

returns the value 0.07925.

If the distribution’s values tail (stretch out) to the left, meaning that the distribution includes greater numbers of small values or smaller values than a symmetrical observation would, the skewness is negative. For example, the formula

=SKEW(1,1,3,4,5,6,7,8)

returns the value −0.07924.

*t-distributions*

When you’re working with small samples — less than 30 or 40 items — you can use what’s called a *student t-value* to calculate probabilities rather than the usual z-value, which is what you work with in the case of normal distributions. Excel provides six t-distribution functions, which I discuss in the following paragraphs.

*T.DIST: Left-tail Student t-distribution*

The T.DIST function returns the student’s left-tailed distribution and uses the syntax

=T.DIST(*x*,*deg_freedom*,*cumulative*)

where *x* equals the t-value, *deg_freedom* equals the degrees of freedom, and *cumulative* is a logical value that determines whether the function returns cumulative distribution value or a probability density. You set the cumulative argument to 0 to return a probability density and to1 to return a cumulative distribution. For example, to calculate the left-tailed probability density of the t-value 2.093025 given 19 degrees of freedom, you use the following formula:

=T.DIST(2.093025,19,0)

which returns the value 0.049455, or roughly 5-percent.

Student t-distribution measures let you estimate probabilities for normally distributed data when the sample size is small (say, 30 items or fewer). You can calculate the degrees of freedom argument by subtracting 1 from the sample size. For example, if the sample size is 20, the degrees of freedom equal 19.

*T.DIST.RT: Right-tail Student t-distribution*

The T.DIST.RT function returns the student’s right-tailed distribution and uses the syntax

=T.DIST.RT(*x*,*deg_freedom*)

where *x* equals the t-value and *deg_freedom* equals the degrees of freedom. For example, to calculate the right-tailed probability density of the t-value 2.093025 given 19 degrees of freedom, you use the following formula:

=T.DIST.RT(2.093025,19)

which returns the value 0.02500, or roughly 2.5-percent.

*T.DIST.2T: Two-tail Student t-distribution*

The T.DIST.2T function returns the two-tailed student t-distribution and uses the syntax

=T.DIST.2T(*x*,*deg_freedom*)

where *x* equals the t-value and *deg_freedom* equals the degrees of freedom. For example, to calculate the two-tailed probability density of the t-value 2.093025 given 19 degrees of freedom, you use the following formula:

=T.DIST.2T(2.093025,19)

which returns the value 0.049999, or roughly 5-percent.

*T.INV: Left-tailed Inverse of Student t-distribution*

The T.INV function calculates the left-tailed inverse of a student t-distribution. The function uses the syntax

=T.INV(*probability*,*deg_freedom*)

where *probability* is the probability percentage and *deg_freedom* equals the degrees of freedom. To calculate the t-value given a 5-percent probability and 19 degrees of freedom, for example, use the following formula:

=T.INV(0.05,19)

which returns the t-value -1.729132.

*T.INV.2T: Two-tailed Inverse of Student t-distribution*

The T.INV.2T function calculates the two-tailed inverse of a student t-distribution. The function uses the syntax

=T.INV.@t(*probability*,*deg_freedom*)

where *probability* is the probability percentage and *deg_freedom* equals the degrees of freedom. To calculate the two-tailed t-value given a 5-percent probability and 19 degrees of freedom, for example, use the following formula:

=T.INV.2T(0.05,19)

which returns the t-value -2.093024.

*T.TEST: Probability two samples from same population*

The T.TEST function returns the probability that two samples come from the same populations with the same mean. The function uses the syntax

=T.TEST(*array1*,*array2*,*tails*,*type*)

where *array1* is a range reference holding the first sample, *array2* is a range reference holding the second sample, *tails* is either the value 1 (representing a one-tailed probability) or 2 (representing a two-tailed probability), and *type* tells Excel which type of t-test calculation to make. You set *type* to 1 to perform a paired t-test, to 2 to perform a *homoscedastic* test (a test with two samples with equal variance), or to 3 to perform a *heteroscedastic* test (a test with two samples with unequal variance).

*f-distributions*

f-distributions are probability distributions that compare the ratio in variances of samples drawn from different populations. That comparison produces a conclusion regarding whether the variances in the underlying populations resemble each other.

*F.DIST: Left-tailed f-distribution probability*

The F.DIST function returns the left-tailed probability of observing a ratio of two samples’ variances as large as a specified f-value. The function uses the syntax

=F.DIST(*x*,*deg_freedom1*,*deg_freedom2,cumulative*)

where *x* is specified f-value that you want to test; *deg_freedom1* is the degrees of freedom in the first, or numerator, sample; *deg_freedom2* is the degrees of freedom in the second, or denominator, sample, and *cumulative* is a logical value (either 0 or 1) that tells Excel whether you want to calculate the cumulative distribution (indicated by setting *cumulative* to 0) or the probability density (indicated by setting *cumulative* to 1).

As an example of how the F.DIST function works, suppose you compare two sample’s variances, one equal to 2 and one equal to 4. This means the f-value equals 0.5. Further assume that both samples number 10 items, which means both samples have degrees of freedom equal to 9 and that you want to calculate a cumulative probability. The formula

=F.DIST(2/4,9,9,0)

returns the value 0.6851816.

*F.DIST.RT: Right-tailed f-distribution probability*

The F.DIST.RT function resembles the F.DIST function. F.DIST.RT returns the right-tailed probability of observing a ratio of two samples’ variances as large as a specified f-value. The function uses the syntax

=F.DIST.RT(*x*,*deg_freedom1*,*deg_freedom2,cumulative*)

where *x* is specified f-value that you want to test; *deg_freedom1* is the degrees of freedom in the first, or numerator, sample; *deg_freedom2* is the degrees of freedom in the second, or denominator, sample, and *cumulative* is a logical value (either 0 or 1) that tells Excel whether you want to calculate the cumulative distribution (indicated by setting *cumulative* to 0) or the probability density (indicated by setting *cumulative* to 1).

As an example of how the F.DIST.RT function works, suppose you compare two sample’s variances, one equal to 2 and one equal to 4. This means the f-value equals 0.5. Further assume that both samples number 10 items, which means both samples have degrees of freedom equal to 9and that you want to calculate a cumulative probability. The formula

=F.DIST.RT(2/4,9,9)

returns the value 0.841761 suggesting that there’s roughly an 84-percent probability that you might observe an f-value as large as 0.5 if the samples’ variances were equivalent.

*F.INV:Left-tailed f-value given f-distribution probability*

The F.INV function returns the left-tailed f-value equivalent to a given f-distribution probability. The function uses the syntax

=F.INV(*probability*,*deg_freedom1*,*deg_freedom2*)

where *probability* is probability of the f value that you want to find; *deg_freedom1* is the degrees of freedom in the first, or numerator, sample; and *deg_freedom2* is the degrees of freedom in the second, or denominator, sample.

*F.INV.RT:Right-tailed f-value given f-distribution probability*

The F.INV.RT function returns the right-sided f-value equivalent to a given f-distribution probability. The function uses the syntax

=F.INV.RT(*probability*,*deg_freedom1*,*deg_freedom2*)

where *probability* is probability of the f-value that you want to find; *deg_freedom1* is the degrees of freedom in the first, or numerator, sample; and *deg_freedom2* is the degrees of freedom in the second, or denominator, sample.

*F.TEST: Probability data set variances not different*

The F.TEST function compares the variances of two samples and returns the probability that variances aren’t significantly different. The function uses the syntax

=F.TEST(*array1*,*array2*)

where *array1* is a worksheet range holding the first sample and *array2* is a worksheet range holding the second sample.

*Binomial Distributions*

Binomial distributions let you calculate probabilities in two situations:

· When you have a limited number of independent trials, or tests, which can either succeed or fail

· When success or failure of any one trial is independent of other trials

I also discuss Excel’s sole hypergeometric distribution function here with the binomial functions because, as you’ll see if you slog through this discussion, hypergeometric distributions are related to binomial distributions.

*BINOM.DIST: Binomial probability distribution*

The BINOM.DIST function finds the binomial distribution probability. The function uses the syntax

=BINOM.DIST(*number_s*,*trials*,*probability_s*,*cumulative*)

where *number_s* is the specified number of successes that you want, *trials* equals the number of trials you’ll look at, *probability_s* equals the probability of success in a trial, and *cumulative* is a switch that’s set to either the logical value TRUE (if you want to calculate the cumulative probability) or the logical value FALSE (if you want to calculate the exact probability).

For example, if a publisher wants to know the probability of publishing three best-selling books out of a set of ten books when the probability of publishing a best-selling book is ten percent, the formula is

=BINOM.DIST(3,10,.1,FALSE)

which returns the value 0.0574. This indicates that there’s roughly a 6-percent chance that in a set of ten books, a publisher will publish exactly three best-selling books.

To calculate the probability that a publisher will publish either one, two, or three bestsellers in a set of ten books, the formula is

=BINOM.DIST(3,10,.1,TRUE)

which returns the value 0.9872, which indicates that there is roughly a 99-percent chance that a publisher will publish between one and three bestsellers in a set of ten books.

*BINOM.INV: Binomial probability distribution*

The BINOM.INV function find smallest value for which the cumulative binomial distribution equals or exceeds a specified criterion, or alpha, value. The function uses the syntax

=BINOM.INV(*trials*,*probability_s*,*alpha*)

where *trials* equals the number of Bernoulli trials you’ll look at, *probability_s* equals the probability of success in a trial, and *alpha* equals the criterion value you want to meet or beat.

If you set the trials to 10, the probability to .5 and the criterion value to .75, for example, the formula is

=BINOM.INV(10,0.5,0.75)

which returns the value 6.

*BINOM.DIST.RANGE: Binomial probability of Trial Result*

The BINOM.DIST.RANGE function finds the probability of a trial result or a range of trial results for a binomial distribution. The function uses the syntax

=BINOM.DIST.RANGE(*trials*,*probability_s*,*number_s,[number_s2]*)

where *trials* equals the number of trials you’ll look at, *probability_s* equals the probability of success in a trial, *number_s* sets the number of successful trials, and *number_s2* (which is an optional argument) sets the maximum number of successful trials. (If you do set the maximum number of successful trials using the *number_s2* argument, *number_s* sets the minimum number of trials.)

If you set the trials to 10, the probability to .5 and the number of successful trials to 3, for example, the formula is

=BINOM.DIST.RANGE(10,0.5,3)

which returns the value 0.11718, meaning the probability of having exactly three successful trials equals roughly 12%.

If you set the trials to 10, the probability to .5 and the number of successful trials to anything from 3 to 10, for example, the formula is

=BINOM.DIST.RANGE(10,0.5,3,10)

which returns the value 0.9453, meaning the probability of the number of successful trials range anywhere from 3 to 10 equals roughly 95%.

*NEGBINOM.DIST: Negative binominal distribution*

The NEGBINOM.DIST function finds the probability that a specified number of failures will occur before a specified number of successes based on a probability-of-success constant. The function uses the syntax

=NEGBINOM.DIST(*number_f*,*number_s*,*probability_s*)

where *number_f* is the specified number of failures, *number_s* is the specified number of successes, *probability_s* is the probability of success, and *cumulative* is a switch you set to 0 or FALSE if you want a cumulative distribution and to 1 or TRUE if you want a probability distribution.

For example, suppose you’re a wildcat oil operator and you want to know the chance of failing to find oil in exactly ten wells before you find oil in exactly one well. If the chance for success is 5 percent, you can find the chance that you’ll fail ten times before drilling and finding oil by using the formula

=NEGBINOM.DIST(10,2,.05,0)

which returns the value 0.016465266, indicating that there’s less than a 2-percent chance that you’ll fail ten times before hitting a gusher.

*CRITBINOM: Cumulative binomial distribution*

The CRITBINOM function, which is really an old Excel function and available in recent versions of Excel for reasons of backwards compatibility, finds the smallest value for which the cumulative binomial distribution equals or exceeds a criterion value. The function uses the syntax

=CRITBINOM(*trials*,*probability_s*,*alpha*)

where *trials* is the number of Bernoulli trials, *probability_s* is the probability of success for each trial, and *alpha* equals your criterion value. Both the *probability_s* and *alpha* arguments must fall between 0 and 1.

*HYPGEOM.DIST: Hypergeometric distribution*

The HYPERGEOMETRIC function returns the probability of a specified number of sample successes. A hypergeometric distribution resembles a binomial distribution except with a subtle difference. In a hypergeometric distribution, the success in one trial affects the success in another trial. Typically, you use a the HYPGEOM.DIST function when you take samples from a finite population and don’t replace the samples for subsequent trials. The function uses the syntax

=HYPGEOM.DIST(sample_s,number_sample,population_s,number_pop,cumulative)

where *sample_s* equals the specified number of sample successes, *number_sample* gives the size of the sample, *population_s* gives the number of successes in the population, *number_pop* gives the size of the population, and *cumulative* is a switch which tells Excel to return either a cumulative distribution (indicated with a 1 or TRUE argument value) or a probability density (indicated with a 0 or FALSE argument value).

As an example of a hypergeometric distribution, suppose you want to calculate the probability that in a sample of 30 items, 5 will be successful. Further suppose you know that within a 4,000-item population, 1,000 are successful. You use the following formula to make this calculation:

=HYPGEOM.DIST(5,30,1000,4000,0)

which returns the value 0.0104596, indicating that the chances that exactly 5 items will be successful in a set of 30 items given the characteristics of the population equals roughly 10 percent.

*Chi-Square Distributions*

I get very confused, personally, when I start working with statistical measures that are more complicated than those simple calculations that you learn in junior high. Yet the chi-square functions, which I discuss next, really are practical. I take this one slow and use an easy-to-understand example.

Even if you’re going to use only one of the chi-square functions, read through all three function descriptions. Viewed as a set of statistical tools, the functions make quite a bit more sense.

*CHISQ.DIST.RT: Chi-square distribution*

The CHISQ.DIST.RT function, which calculates the right-tailed probability of a chi-squared distribution, calculates a level of significance using the chi-square value and the degrees of freedom. The chi-square value equals the sum of the squared standardized scores. The function uses the syntax

=CHISQ.DIST.RT(*x*,*deg_freedom*)

where *x* equals the chi-square value and *deg_freedom* equals the degrees of freedom.

As an example of how all this works, suppose you’re more than a little suspicious of some slot machine that shows one of six pictures: diamonds, stars, cowboy boots, cherries, oranges, or pots of gold. With six possibilities, you might expect that in a large sample, each of the six possibilities would appear roughly one-sixth of the time. Say the sample size is 180, for example. In this case, you might expect that each slot machine possibility appears 30 times because 180/6 equals 30. If you built a worksheet fragment like the one shown in Figure *9-6*, you could analyze the one-armed bandit.

**Figure 9-6:** A worksheet fragment we’ll use to look at chi-square measures.

To calculate the level of significance using the data shown in Figure *9-6* and the chi-square distribution function, you could enter the following formula into D10:

=CHISQ.DIST.RT(D8,5)

The function returns the value 0.010362338, which is the level of significance that a chi-square value of 15 is due to sampling error.

Cell D8 holds the chi-square value, which is simply the sum of the squared differences between the observed and expected values. For example, the value in cell D2 is calculated using the formula =+(B2–C2)^2/C2 to return the value 3.333333333. Predictably, similar formulas in the range D3:D7 calculate the squared differences for the other slot machine symbols. And, oh, by the way, the formula in cell D8 is =SUM(D2:D7).

The bottom line: It doesn’t look good, does it? I mean, that there’s only a 1-percent chance that the slot machine that you’re worried about could actually produce the observed values due to chance. Very suspicious… .

*CHISQ.DIST: Chi-square distribution*

The CHISQ.DIST function resembles the CHISQ.DIST.RT function but calculates the *left-tailed* probability of a chi-squared distribution. The function uses the syntax

=CHISQ.DIST(*x*,*deg_freedom,cumulative*)

where *x* equals the chi-square value, *deg_freedom* equals the degrees of freedom, and *cumulative* is a switch you set to 0 or FALSE if you want to calculate a probability density and to 1 or TRUE if you want to calculate a cumulative probability.

*CHISQ.INV.RT: Right-tailed chi-square distribution probability*

The CHISQ.INV.RT function returns the inverse of the right-tailed probability of a chi-square distribution. The function uses the syntax

=CHISQ.INV.RT(*probability*,*deg_freedom*)

where *probability* equals the level of significance and *deg_freedom* equals the degrees of freedom.

To show you an example of the CHISQ.INV.RT function, refer to the worksheet fragment shown in Figure *9-6*. With six possible outcomes on the slot machine, you have five degrees of freedom. Therefore, if you want to calculate the chi-square that’s equivalent to a 0.010362338 level of significance, you could enter the following formula into cell D12:

=CHISQ.INV.RT(D10,5)

This function returns the value 14.99996888, which is pretty darn close to 15 … so I call it 15. Note that I use D10 as the first probability argument because that cell holds the level of significance calculated by the CHISQ.DIST function.

*CHISQ.INV: Left-tailed chi-square distribution probability*

The CHISQ.INV function returns left-tailed probability of a chi-square distribution. The function uses the syntax

=CHISQ.INV(*probability*,*deg_freedom*)

where *probability* equals the level of significance and *deg_freedom* equals the degrees of freedom.

To calculate the chi-square value that’s equivalent to a 0.010362338 level of significance with 5 degrees of freedom, you could enter the following formula into a cell in the worksheet shown in Figure *9-6*:

=CHISQ.INV(0.010362338,5)

This function returns the value .562927.

*CHISQ.TEST: Chi-square test*

The chi-square test function lets you assess whether differences between the observed and expected values represent chance, or *sampling error.* The function uses the syntax

=CHISQ.TEST(*actual_range*,*expected_range*)

Again referring to the example of the suspicious slot machine shown in Figure *9-6*, you could perform a chi-square test by entering the following formula into cell D14 and then comparing what you observe with what you expect:

=CHISQ.TEST(B2:B7,C2:C7)

The function returns the p-value, or probability, shown in Figure *9-6* in cell D14, indicating that only a 1.0362-percent chance exists that the differences between the observed and expected outcomes stem from sampling error.

A common feature of a chi-square test is comparison of the p-value — again the value that the CHISQ.TEST function returns — to a level of significance. For example, in the case of the suspicious slot machine, you might say, “Because it’s not possible to be 100-percent sure, we’ll say that we want a 95-percent probability, which corresponds to a 5-percent level of significance.” If the p-value is less than the level of significance, you assume that something is fishy. Statisticians, not wanting to sound so earthy, have another phrase for this something-is-fishy conclusion:*rejecting the null hypothesis.*

*Regression Analysis*

Excel’s regression functions let you perform regression analysis. In a nutshell, *regression analysis* involves plotting pairs of independent and dependent variables in an XY chart and then finding a linear or exponential equation that describes the plotted data.

*FORECAST: Forecast dependent variables using a best-fit line*

The FORECAST function finds the y-value of a point on a best-fit line produced by a set of x- and y-values given the x-value. The function uses the syntax

=FORECAST(*x*,*known_y's*,*known_x's*)

where *x* is the independent variable value, *known_y's* is the worksheet range holding the dependent variables, and *known_x's* is the worksheet range holding the independent variables.

The FORECAST function uses the *known_y's* and *known_x's* values that you supply as arguments to calculate the y=mx+b equation that describes the best-fit straight line for the data. The function then solves that equation using the *x* argument that you supply to the function.

To use the linear regression functions such as the FORECAST function, remember the equation for a line is y=mx+b*. y* is the dependent variable, *b* is the y-intercept or constant, *m* is the slope, and *x* gives the value of the independent variable.

*INTERCEPT: y-axis intercept of a line*

The INTERCEPT function finds the point where the best-fit line produced by a set of x- and y-values intersects the y-axis. The function uses the syntax

=INTERCEPT(*known_y's*,*known_x's*)

where *known_y's* is the worksheet range holding the dependent variables and *known_x's* is the worksheet range holding the independent variables.

If you’ve ever plotted pairs of data points on an XY graph, the way the INTERCEPT function works is pretty familiar. The INTERCEPT function uses the *known_y's* and *known_x's* values that you supply as arguments to calculate the best-fit straight line for the data — essentially figuring out the y=mx+b equation for the line. The function then returns the *b* value because that’s the value of the equation when the independent, or *x*, variable equals zero.

*LINEST*

The LINEST function finds the *m* and *b* values for a line based on sets of *known_y's* and *known_x's* variables. The function uses the syntax

=LINEST(*known_y's*,[*known_x's*],[*const*],[*stats*])

where *known_y's* equals the array of y-values that you already know, *known_x's* supplies the array of x-values that you may already know, *const* is a switch set to either TRUE (which means the constant *b* equals 0) or to TRUE (which means the constant *b* is calculated), and*stats* is another switch set to either TRUE (which means the function returns a bunch of other regression statistics) or FALSE (which means *enough already*).

*SLOPE: Slope of a regression line*

The SLOPE function calculates the slope of a regression line using the x- and y-values. The functions uses the syntax

=SLOPE(*known_y's*,*known_x's*)

An upward slope indicates that the independent, or *x*, variable positively affects the dependent, or *y*, variable. In other words, an increase in *x* produces an increase in *y.* A downward slope indicates that the independent, or *x*, variable negatively affects the dependent, or *y,* variable. The steeper the slope, the greater the effect of the independent variable on the dependent variable.

*STEYX: Standard error*

The STEYX function finds the standard error of the predicted y-value of each of the x-values in a regression. The function uses the syntax

=STEYX(*known_y's*,*known_x's*)

*TREND*

The TREND function finds values along a trend line, which the function constructs using the method of least squares. The syntax looks like this:

=TREND(*known_y's*,[*known_x's*],[*new_x's*],[*const*])

*LOGEST: Exponential regression*

The LOGEST function returns an array that describes an exponential curve that best fits your data. The function uses the syntax

=LOGEST(*known_y's*,[*known_x's*],[*const*],[*stats*])

where *known_y's* is the set of y-values, *known_x's* is the set of x-values, *const* is a switch set to either TRUE (which means that *b* is calculated normally) or FALSE (which means that *b* is forced to equal 1), and *stats* is a switch that’s set to either TRUE (in which case, the LOGEST function returns a bunch of additional regression statistics) or FALSE (which tells the function to skip returning all the extra information).

In an exponential regression, Excel returns an equation that takes the form y=abx that best fits your data set.

*GROWTH: Exponential growth*

The GROWTH function calculates exponential growth for a series of new x-values based on existing x-values and y-values. The function uses the syntax

=GROWTH(*known_y's*,[*known_x's*],[*new_x's*],[*const*])

where *known_y's* is the set of y-values, *known_x's* is the set of x-values, *new_x's* is the set of x-values for which you want to calculate new y-values, and *const* is a switch set to either TRUE (which means that *b* is calculated normally) or FALSE (which means that *b* is forced to equal 1).

*Correlation*

Excel’s correlation functions let you quantitatively explore the relationships between variables.

*CORREL: Correlation coefficient*

The CORREL function calculates a correlation coefficient for two data sets. The function uses the syntax

=CORREL(*array1*,*array2*)

where *array1* is a worksheet range that holds the first data set and *array2* is a worksheet range that holds the second data set. The function returns a value between −1 (which would indicate a perfect, negative linear relationship) and +1 (which would indicate a perfect, positive linear relationship).

*PEARSON: Pearson correlation coefficient*

The PEARSON calculates a correlation coefficient for two data sets by using a different formula than the CORREL function does but one that should return the same result. The function uses the syntax

=PEARSON(*array1*,*array2*)

where *array1* is a worksheet range that holds the first data set and *array2* is a worksheet range that holds the second data set. The function returns a value between −1 (which would indicate a perfect, negative linear relationship) and +1 (which would indicate a perfect, positive linear relationship).

*RSQ: r-squared value for a Pearson correlation coefficient*

The RSQ function calculates the r-squared square of the Pearson correlation coefficient. The function uses the syntax

=RSQ(*known_y's*,*known_x's*)

where *known_y's* is an array or worksheet range holding the first data set and *known_x's* is an array or worksheet range holding the second data set. The r-squared value describes the proportion of the variance in *y* stemming from the variance in *x*.

*FISHER*

The FISHER function converts Pearson’s r-squared value to the normally distributed variable *z* so you can calculate a confidence interval. The function uses the syntax

=FISHER(*r*)

*FISHERINV*

The FISHERINV function, the inverse of the FISHER function, converts *z* to Pearson’s r-squared value. The function uses the syntax

=FISHERINV(*y*)

*Some Really Esoteric Probability Distributions*

Excel supplies several other statistical functions for working with probability distributions. It’s very unlikely, it seems to me, that you’ll ever work with any of these functions except in an upper-level college statistics course, thus I go over these tools quickly. A couple of them, though — the ZTEST and the POISSON functions, in particular — are actually pretty useful.

*BETA.DIST: Cumulative beta probability density*

The BETA.DIST function finds the cumulative beta probability density — something that you might do to look at variation in the percentage of some value in your sample data. The Excel online Help file, for example, talks about using the function to look at the fraction of the day that people spend watching television. And I recently read a fisheries management study that uses beta probability distributions to report on the effects of setting aside a percentage of marine habitat for reserves. The function uses the syntax

=BETA.DIST(*x*,*alpha*,*beta*,*cumulative*,[*A*],[*B*])

where *x* is a value between the optional bounds *A* and *B*, *alpha* and *beta* are the two positive parameters, and *cumulative* is a switch you set to 0 or FALSE if you want to calculate a cumulative distribution and to 1 or TRUE if you want to calculate a probability density. If *x* equals.5, *alpha* equals 75, *beta* equals 85, you set the *cumulative* switch to 1 to calculate a probability density, and *A* equals 0, and *B* equals 1, use following formula:

=BETA.DIST(.5,75,85,1,0,1)

This function returns the value 0.786080098.

If you leave out the optional bounds arguments, Excel assumes that *A* equals 0 and that *B* equals 1. The function =BETADIST(.5,75,85), for example, is equivalent to =BETADIST(.5,75,85,0,1).

*BETA.INV: Inverse cumulative beta probability density*

The BETA.INV function returns the inverse of the cumulative beta probability density function. That is, you use the BETA.DIST function if you know *x* and want to find the probability; and you use the BETA.INV function if you know the probability and want to find *x*. The BETA.INV function uses the syntax

=BETA.INV(*probability*,*alpha*,*beta*,[*A*],[*B*])

*EXPON.DIST: Exponential probability distribution*

The EXPON.DIST function calculates an exponential distribution, which can be used to describe the probability that an event takes a specified amount of time. The function uses the syntax

=EXPON.DIST(*x*,*lambda*,*cumulative*)

where *x* is the value you want to evaluate, *lambda* is the inverse of the mean, and *cumulative* is a switch set to either TRUE (if you want the function to return the probability up to and including the *x* value) or FALSE (if you want the function to return the exact probability of the *x*value).

For example, suppose that at a certain poorly run restaurant, you usually have to wait 10 minutes for your waitperson to bring a glass of water. That’s the *average wait time,* in other words. To determine the probability that you’ll get your water in 5 minutes or less, use the formula

=EXPON.DIST(5,1/10,TRUE)

which returns the value 0.393469, indicating you have (roughly) a 39-percent chance of getting something to drink in 5 minutes or less.

To determine the probability that you’ll get your water in exactly 5 minutes, you use the formula

=EXPON.DIST(5,1/10,FALSE)

which returns the value 0.060653, indicating there’s roughly a 6-percent chance that you’ll get something to drink in exactly 5 minutes.

*GAMMA.DIST: Gamma distribution probability*

The GAMMA.DIST function finds the gamma distribution probability of the random variable *x*. The function uses the syntax

=GAMMA.DIST(*x*,*alpha*,*beta*,*cumulative*)

where *x* equals the random variable, *alpha* and *beta* describe the constant rate, and *cumulative* is a switch set to TRUE if you want a cumulative probability and FALSE if you want an exact probability.

If *x* equals 20, *alpha* equals 5, *beta* equals 2, and *cumulative* is set to TRUE, you use the formula

=GAMMA.DIST(20,5,2,TRUE)

which returns the value 0.97075, indicating the probability equals roughly 97 percent.

If *x* equals 20, *alpha* equals 5, *beta* equals 2, and *cumulative* is set to FALSE, you use the formula

=GAMMA.DIST(20,5,2,FALSE)

which returns the value 0.00946, indicating the probability is less than 1 percent.

*GAMMAINV: X for a given gamma distribution probability*

The GAMMAINV function finds the *x* value associated with a given gamma distribution probability. The function uses the syntax

=GAMMAINV(*probability*,*alpha*,*beta*)

where *probability* equals the probability for the *x* value you want to find and *alpha* and *beta* are the parameters to the distribution.

*GAMMALN: Natural logarithm of a gamma distribution*

The GAMMALN function finds the natural logarithm of the gamma function. The GAMMALN function uses the syntax

=GAMMALN(*x*)

*LOGNORMDIST: Probability of lognormal distribution*

The LOGNORMDIST function calculates the probability associated with a lognormal distribution. The function uses the syntax

=LOGNORMDIST(*x*,*mean*,*standard_dev*)

where *x* is the value for which you want to find the probability, *mean* is the arithmetic mean, and *standard_dev*, of course, equals the standard deviation.

*LOGINV: Value associated with lognormal distribution probability*

The LOGINV function calculates the value associated with a lognormal distribution probability. The function uses the syntax

=LOGINV(*probability*,*mean*,*standard_dev*)

where *probability* is the probability of a lognormal distribution, *mean* is the arithmetic mean, and *standard_dev* is the standard deviation.

*POISSON: Poisson distribution probabilities*

The POISSON function calculates probabilities for Poisson distributions. The function uses the syntax

=POISSON(*x*,*mean*,*cumulative*)

where *x* is the number of events, *mean* is the arithmetic mean, and *cumulative* is a switch. If set to TRUE, this switch tells Excel to calculate the Poisson probability of a variable being less than or equal to *x*; if set to FALSE, it tells Excel to calculate the Poisson probability of a variable being exactly equal to *x*.

To illustrate how the Poisson function works, suppose you want to look at some probabilities associated with cars arriving as a drive-through car wash. (This type of analysis of events occurring over a specified time interval is a common application of Poisson distributions.) If on average, 20 cars drive up an hour, you can calculate the probability that exactly 15 cars will drive up using the formula

=POISSON(15,20,FALSE)

This function returns the value 0.051648854, indicating that there’s roughly a 5-percent chance that exactly 15 cars will drive up in an hour.

To calculate the probability that 15 cars or fewer will drive up in an hour, use the following formula:

=POISSON(15,20,TRUE)

This function returns the value 0.156513135, indicating that there’s roughly a 16-percent chance that 15 or fewer cars will drive up in an hour.

*WEIBULL: Weibull distribution*

The WEIBULL function returns either the cumulative distribution or the probability mass for a Weibull distribution. The function uses the syntax

=WEIBULL(*x*,*alpha*,*beta*,*cumulative*)

where *x* is the value for which you want to calculate the distribution; *alpha* and *beta* are, respectively, the alpha and beta parameters to the Weibull equation, and *cumulative* is a switch. That switch, if set to TRUE, tells the function to return the cumulative distribution function; if set to FALSE, it tells the function to return the probability mass function.

Visit the web page at

http://keisan.casio.com/has10/SpecExec.cgi?id=system/2006/1180573173

to find a calculator that lets you play around with making different graphs plotting Weibull distributions.

*ZTEST: Probability of a z-test*

The ZTEST function calculates the probability that a value comes from the same population as a sample. The function uses the syntax

=ZTEST(*array*,*x*,[*sigma*])

where *array* is the worksheet range holding your sample, *x* is the value you want to test, and (optionally) *sigma* is the standard deviation of the population. If you omit *sigma*, Excel uses the sample standard deviation.

For example, to find the probability that the value 75 comes from the population as the sample stored in the worksheet range A1:A10, use the following formula:

=ZTEST(A1:A10,75)