Analyzing Data with the Analysis ToolPak - Analyzing Data with Excel - Microsoft Excel 2016 BIBLE (2016)

Microsoft Excel 2016 BIBLE (2016)

Part V
Analyzing Data with Excel

Chapter 37
Analyzing Data with the Analysis ToolPak

IN THIS CHAPTER

1. Getting an overview of the Analysis ToolPak

2. Using the Analysis ToolPak

3. Meeting the Analysis ToolPak tools

Although Excel was designed primarily for business users, people in other disciplines, including education, research, statistics, and engineering, also use the software. One way Excel addresses these nonbusiness users is with its Analysis ToolPak add-in. However, many features in the Analysis ToolPak are valuable for business applications as well.

The Analysis ToolPak: An Overview

The Analysis ToolPak is an add-in that provides analytical capability that normally isn't available in Excel.

These analysis tools offer many features that may be useful to those in the scientific, engineering, and educational communities — not to mention business users whose needs extend beyond the normal spreadsheet fare.

This section provides a quick overview of the types of analyses that you can perform with the Analysis ToolPak. This chapter covers each of the following tools:

· Analysis of variance (three types)

· Correlation

· Covariance

· Descriptive statistics

· Exponential smoothing

· F-Test

· Fourier analysis

· Histogram

· Moving average

· Random number generation

· Rank and percentile

· Regression

· Sampling

· T-Test (three types)

· Z-Test

As you can see, the Analysis ToolPak add-in brings a great deal of functionality to Excel. These procedures have limitations, however. In some cases, you may prefer to create your own formulas to do some calculations.

Installing the Analysis ToolPak Add-In

The Analysis ToolPak is implemented as an add-in. Before you can use it, though, you need to make sure that the add-in is installed. Select the Data tab. If you see an Analyze group, showing Data Analysis, the Analysis ToolPak is installed. If you can't access Data imageAnalyze image Data Analysis, install the add-in by following these steps:

1. Choose File image Options. The Excel Options dialog box appears.

2. Select the Add-Ins tab.

3. At the bottom of the dialog box, select Excel Add-Ins from the Manage drop-down list and then click Go. The Add-Ins dialog box appears.

4. Place a check mark next to Analysis ToolPak. There is no need to check the add-in named Analysis ToolPak – VBA.

5. Click OK to close the Add-Ins dialog box.

Using the Analysis Tools

Using the procedures in the Analysis ToolPak add-in is relatively straightforward as long as you're familiar with the particular analysis type. To use any of these tools, choose Data image Analyze image Data Analysis, and the Data Analysis dialog box, shown in Figure 37.1, appears. Scroll through the list until you find the analysis tool that you want to use, and then click OK. A dialog box specific to the procedure that you select appears.

Data Analysis dialog box featuring a list of analysis tools, among them Description Statistics has been highlighted.

Figure 37.1 Select your tool from the Data Analysis dialog box.

Usually, you need to specify one or more Input ranges, plus an Output range. (One cell is sufficient.) Alternatively, you can choose to place the results on a new worksheet or in a new workbook. The procedures vary in the amount of additional information required. In many dialog boxes, you may be able to indicate whether your Data range includes labels. If so, you can specify the entire range, including the labels, and indicate to Excel that the first column (or row) contains labels. Excel then uses these labels in the tables that it produces. Most tools also provide different output options that you can select, based on your needs.

Caution

The Analysis ToolPak isn't consistent in the way it generates its output. In some cases, the procedures use formulas, so you can change your data, and the results update automatically. For other procedures, Excel stores the results as values, so if you change your data, the results don't reflect your changes.

Introducing the Analysis ToolPak Tools

This section describes each tool in the Analysis ToolPak and provides an example. I don't describe every available option in these procedures. If you need to use the advanced analysis tools, you probably already know how to use most of the options not covered here.

Before you use any of these tools, I suggest that you read the appropriate section in Excel's Help system.

imageThis book's website at www.wiley.com/go/excel2016bible contains a workbook that shows output from all the tools discussed in this section. The file is named atp examples.xlsx. This workbook also contains some alternative formula-based solutions that are sometimes better than using the Analysis ToolPak.

Analysis of Variance

Analysis of Variance (sometimes abbreviated as Anova) is a statistical test that determines whether two or more samples were drawn from the same population. Using tools in the Analysis ToolPak, you can perform three types of analysis of variance:

· Single-factor: A one-way analysis of variance, with only one sample for each group of data

· Two-factor with replication: A two-way analysis of variance, with multiple samples (or replications) for each group of data

· Two-factor without replication: A two-way analysis of variance, with a single sample (or replication) for each group of data

Figure 37.2 shows the dialog box for a single-factor analysis of variance. Alpha represents the statistical confidence level for the test.

Image described by caption and surrounding text.

Figure 37.2 Specifying parameters for a single-factor analysis of variance.

The output for this test consists of the means and variances for each of the samples, the value of F, the critical value of F, and the significance of F (P-value).

Correlation

Correlation is a widely used statistic that measures the degree to which two sets of values vary together. For example, if higher values in one data set are typically associated with higher values in the second data set, the two data sets have a positive correlation. The degree of correlation is expressed as a coefficient that ranges from –1.0 (a perfect negative correlation) to +1.0 (a perfect positive correlation). A correlation coefficient of 0 indicates that the two variables aren't correlated.

Figure 37.3 shows the Correlation dialog box. Specify the input range, which can include any number of variables, arranged in rows or columns.

Image described by caption and surrounding text.

Figure 37.3 The Correlation dialog box.

The output consists of a correlation matrix that shows the correlation coefficient for each variable paired with every other variable.

Note

The resulting correlation matrix doesn't use formulas to calculate the results. Therefore, if any data changes, the correlation matrix isn't valid. You can use the CORREL function to create a correlation matrix that changes automatically when you change data.

Covariance

The Covariance tool produces a matrix that is similar to the one generated by the Correlation tool. Covariance, like correlation, measures the degree to which two variables vary together. Specifically, covariance is the average of the product of the deviations of each data point pair from their respective means.

Because the Covariance tool does not generate formulas, you may prefer to calculate a covariance matrix using the COVAR function.

Descriptive Statistics

The Descriptive Statistics tool produces a table that describes your data with some standard statistics. Figure 37.4 shows some sample output.

Worksheet of the descriptive statistics output pivot table. Columns A to C are the samples of W. Coast, Midwest, and E. Coast. Columns E to J list the details of the samples.

Figure 37.4 Descriptive Statistics output.

Because the output for this procedure consists of values (not formulas), you should use this procedure only when you're certain that your data isn't going to change; otherwise, you'll need to re-execute the procedure. You can generate all these statistics by using formulas.

Exponential Smoothing

Exponential Smoothing is a technique for predicting data that is based on the previous data point and the previously predicted data point. You can specify the damping factor (also known as a smoothing constant), which can range from 0 to 1. This factor determines the relative weighting of the previous data point and the previously predicted data point. You also can request standard errors and a chart.

The exponential smoothing procedure generates formulas that use the damping factor you specify. Therefore, if the data changes, Excel updates the formulas.

F-test (two-sample test for variance)

An F-test is a commonly used statistical test that enables you to compare two population variances. Figure 37.5 shows a small data set and F-test output.

Worksheet of the F-Test tool output pivot table. Columns A and B are Groups 1 and 2, respectively. The details for the F-test two-sample for variances occupy columns D to F.

Figure 37.5 Output from the F-Test tool.

The output for this test consists of the means and variances for each of the two samples, the value of F, the critical value of F, and the significance of F.

Fourier Analysis

The Fourier Analysis tool performs a “fast Fourier” transformation of a range of data. Using the Fourier Analysis tool, you can transform a range limited to the following sizes: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, or 1,024 data points. This procedure accepts and generates complex numbers, which are represented as text string (not numerical values).

Histogram

The Histogram tool is useful for producing data distributions and histogram charts. It accepts an Input range and a Bin range. A Bin range is a range of values that specifies the limits for each column of the histogram. If you omit the Bin range, Excel creates ten equal-interval bins for you. The size of each bin is determined by the following formula:

=(MAX(input_range)– MIN(input_range))/10

Output from the Histogram tool is shown in Figure 37.6. As an option, you can specify that the resulting histogram be sorted by frequency of occurrence in each bin.

Worksheet e f a pivot table. Column A is a data list. Columns C to H list the values for Bin, Frequency, and Cumulative %. A Histogram is located under the data in columns C to H.

Figure 37.6 Use the Histogram tool to generate distributions and graphical output.

If you specify the Pareto (Sorted Histogram) option, the Bin range must contain values and can't contain formulas. If formulas appear in the Bin range, Excel doesn't sort properly, and your worksheet displays error values. The Histogram tool doesn't use formulas, so if you change any of the input data, you need to repeat the histogram procedure to update the results.

imageFor other ways of generating frequency distributions, see Chapter 13, “Creating Formulas That Count and Sum,” and Chapter 34, “Analyzing Data with Pivot Tables.” Excel 2016 also supports two new chart types: Histogram and Pareto. These work with a range of data and do not require a separate bin range. For an example, see Chapter 19, “Getting Started Making Charts.”

Moving Average

The Moving Average tool helps you smooth out a data series that has a lot of variability. This procedure is often used in conjunction with a chart. Excel does the smoothing by computing a moving average of a specified number of values. In many cases, a moving average enables you to spot trends that otherwise would be obscured by noise in the data.

Figure 37.7 shows a chart generated by the Moving Average tool. You can, of course, specify the number of values that you want Excel to use for each average. If you select the Standard Errors check box in the Moving Average dialog box, Excel calculates standard errors and places formulas for these calculations next to the moving average formulas. The standard error values indicate the degree of variability between the actual values and the calculated moving averages.

Worksheet of a pivot table. Columns A and B are feature lists for Measurements and Moving Average, respectively. A Moving Average line graphis adjacent to them.

Figure 37.7 A chart produced from data generated by the Moving Average tool.

The first few cells in the output are #N/A because not enough data points exist to calculate the average for these initial values.

Random Number Generation

Although Excel contains built-in functions to calculate random numbers, the Random Number Generation tool is much more flexible because you can specify what type of distribution you want the random numbers to have. Figure 37.8 shows the Random Number Generation dialog box. The Parameters section varies, depending on the type of distribution that you select.

Image described by surrounding text.

Figure 37.8 This dialog box enables you to generate a wide variety of random numbers.

Number of Variables refers to the number of columns that you want, and Number of Random Numbers refers to the number of rows that you want. For example, if you want 200 random numbers arranged in 10 columns of 20 rows, you specify 10 and 20, respectively, in these fields.

In the Random Seed field, you can specify a starting value that Excel uses in its ­random-number-generating algorithm. Usually, you leave this field blank. If you want to generate the same random number sequence, however, you can specify a seed between 1 and 32,767 (integer values only). You can create the following types of distributions via the Distribution drop-down list in the Random Number Generation ­dialog box:

· Uniform: Every random number has an equal chance of being selected. You specify the upper and lower limits.

· Normal: The random numbers correspond to a normal distribution. You specify the mean and standard deviation of the distribution.

· Bernoulli: The random numbers are either 0 or 1, determined by the probability of success that you specify.

· Binomial: This option returns random numbers based on a Bernoulli distribution over a specific number of trials, given a probability of success that you specify.

· Poisson: This option generates values in a Poisson distribution. A Poisson distribution is characterized by discrete events that occur in an interval, where the probability of a single occurrence is proportional to the size of the interval. The lambda parameter is the expected number of occurrences in an interval. In a Poisson distribution, lambda is equal to the mean, which also is equal to the variance.

· Patterned: This option doesn't generate random numbers. Rather, it repeats a series of numbers in steps that you specify.

· Discrete: This option enables you to specify the probability that specific values are chosen. It requires a two-column input range; the first column holds the values, and the second column holds the probability of each value being chosen. The sum of the probabilities in the second column must equal 100 percent.

Rank and Percentile

The Rank and Percentile tool creates a table that shows the ordinal and percentile ranking for each value in a range. You can also generate ranks and percentiles by using Excel functions (those that begin with RANK and PERCENTILE).

Regression

Use the Regression tool (see Figure 37.9) to calculate a regression analysis from worksheet data. You can use regression to analyze trends, forecast the future, build predictive models, and, often, to make sense out of a series of seemingly unrelated numbers.

Regression dialog box. It has section for Input, which feature Y and X range, and a section for Output options, which feature output range, residuals, and normal probability.

Figure 37.9 The Regression dialog box.

Regression analysis enables you to determine the extent to which one range of data (the dependent variable) varies as a function of the values of one or more other ranges of data (the independent variables). This relationship is expressed mathematically, using values that Excel calculates. You can use these calculations to create a mathematical model of the data and predict the dependent variable by using different values of one or more independent variables. This tool can perform simple and multiple linear regressions and calculate and standardize residuals automatically.

As you can see, the Regression dialog box offers many options:

· Input Y Range: The range that contains the dependent variable.

· InputX Range: One or more ranges that contain independent variables.

· Confidence Level: The confidence level for the regression.

· Constant Is Zero: If selected, forces the regression to have a constant of 0 (which means that the regression line passes through the origin; when the X values are 0, the predicted Y value is 0).

· Residuals: The four options in this section of the dialog box enable you to specify whether to include residuals in the output. Residuals are the differences between observed and predicted values.

· Normal Probability: Generates a chart for normal probability plots.

Sampling

The Sampling tool generates a random sample from a range of input values. The Sampling tool can help you work with a large database by creating a subset of it.

This procedure has two options: periodic and random. If you choose a periodic sample, Excel selects every nth value from the Input range, where n equals the period that you specify. With a random sample, you simply specify the size of the sample you want Excel to select; every value has an equal probability of being chosen.

T-Test

Use the T-Test tool to determine whether a statistically significant difference exists between two small samples. The Analysis ToolPak can perform three types of t-tests:

· Paired two-sample for means: For paired samples in which you have two observations on each subject (such as a pretest and a posttest). The samples must be the same size.

· Two-sample assuming equal variances: For independent, rather than paired, samples. Excel assumes equal variances for the two samples.

· Two-sample assuming unequal variances: For independent, rather than paired, samples. Excel assumes unequal variances for the two samples.

Figure 37.10 shows output for the Paired Two Sample for Means t-test. You specify the significance level (alpha) and the hypothesized difference between the two means (that is, the null hypothesis).

Worksheet of a pivot table. Columns A, B, and C, are Student#, Pretest, and Posttest, respectively. Data for the t-Test: Paired Two Samples for Means are in columns E, F, and G.

Figure 37.10 Output from the paired t-test dialog box.

Z-Test (two-sample test for means)

The T-Test tool is used for small samples; the Z-Test tool is used for larger samples or populations. You must know the variances for both input ranges.