Understanding Results - JMP Essentials: An Illustrated Guide for New Users, Second Edition (2014)

JMP Essentials: An Illustrated Guide for New Users, Second Edition (2014)

Appendix B. Understanding Results

In the process of analyzing data, you will encounter terms in the results that reference vital information. Throughout Chapters 5 and 6, we discussed many of the relevant terms and results in the context of the application and analysis we were performing. This appendix is designed to provide a basic description of the common terms used in each of those platforms to help you understand your own results. We also point out where you can find complete descriptions in JMP’s documentation. In most cases, you can simply use the “?” help tool and select the term in question to obtain more information. In other cases, by holding your mouse cursor over statistics, you can see Hover Help, which provides context-specific assistance in interpreting a statistic.

While a graph can be worth a thousand words and is more easily interpreted, the statistical results generated from the Analyze menu contain numerical results and terms that might be less clear. It is important that you understand the basic ideas behind the results. This section offers a basic and general reference rather than a comprehensive one. To increase your confidence in interpreting statistical results, we encourage you to seek advice from an experienced data analyst or reference the books that ship with JMP (see the Help menu) or those in the bibliography.

This appendix covers key terms and related concepts in the book, in alphabetical order. For the purpose of this appendix, the terms column and variable are interchangeable in these definitions.

Bivariate Plot: Also known as a scatter diagram or scatterplot where each point in the plot expresses both an X and Y value. These two-dimensional plots are used when comparing one continuous column with another continuous column. See Help ▶ Books ▶ Basic Analysis ▶ Chapter 5: Bivariate Analysis.

Box and Whisker Plot: Also called a box plot or an outlier box plot. A graphical presentation of the important characteristics of a continuous variable. Box plots display the interquartile range of the data (the “box”), the spread (the whiskers), and potential outliers (disconnected points). Box plots are useful for describing variables that have a skewed distribution and for comparing two or more distributions.

Confidence Interval: An interval within which we expect a value to fall with a given certainty. For a 95% confidence interval, we are 95% certain, or confident, that a new value will fall within this interval.

Contingency Table: A table showing the observed frequencies of two nominal variables, with the rows indicating one variable and the columns indicating another.

The table reports a chi-square statistic. See Help ▶ Books ▶ JMP Stat and Graph Guide ▶ Contingency Table Analysis.

Count: The total number of members in a group.

Correlation: A measure of relationship between two or more variables. It is a relationship where changes in the value of one variable are accompanied by changes in another variable or variables. For example, a correlation of 1 indicates that as one variable increases, the other variable increases the same amount. A correlation of -1 would indicate that as the value of one variable increases, the other variable will decrease by the same amount. See Help ▶Books ▶ Basic Analysis ▶ Chapter 5: Bivariate Analysis.

Degrees of Freedom: Also abbreviated DF, degrees of freedom are associated with many statistical estimates. Intuitively, degrees of freedom are the number of freely varying observations in a data-based calculation. The larger the sample size, the larger the degrees of freedom and the stronger the inferences we can draw about population parameters. See Help ▶ Books ▶ Basic Analysis ▶ Oneway Analysis.

Distribution: The values of a single column or variable in terms of frequency of occurrence, spread, and shape. A distribution can be observed (based on data) or theoretical. Some examples of theoretical distributions are normal (bell- shaped), binomial, and Poisson. Distribution is also JMP’s univariate platform.

F Ratio: In an ANOVA), the ratio of between-group variability to the within-group variability (See One Way Analysis of Variance). The F ratio is used, in conjunction with Prob > F, to test the null hypothesis that the group means are equal (that there is no real difference between them). In general, larger F ratios indicate significant differences between at least two means. See Help ▶ Books ▶ Fitting Linear Models ▶ Stand Least Squares Report and Options.

Frequency: The number of times a categorical value is observed, a type of event occurs, or the number of elements of a sample that belong to a specified group. It is also called count.

Interquartile Range: A measure of variability or dispersion for a continuous column calculated as the difference or distance between the 25th and 75th percentiles (the first and third quartiles, respectively). Fifty percent of values fall within the interquartile range and is expressed by the box in a box and whisker plot.

Logistic Regression: A type of regression technique where the Y, dependent variable is nominal or ordinal and at least one X, independent variable is continuous. Logistic models (sometimes called logit models) are used to predict the probability of occurrence of an event based on the values of one or more continuous variables. See Help ▶ Books ▶ Fitting Linear Models ▶ Logistic Regression with Nominal or Ordinal Responses.

Maximum: The largest value observed in the sample.

Mean: A measure of location or central tendency of a column of continuous data. It is the arithmetic average computed by summing all the values in a column and dividing by the number of non-missing rows.

Median: A measure of location or central tendency of a continuous column of data. It is the middle value in an ordered column, which divides a distribution exactly in half. Fifty percent of the values are higher than the median and 50% are lower.

Minimum: The smallest value observed in the sample.

Multiple Regression: An analysis involving two or more independent X variables as predictors to estimate the value of a single dependent variable. The dependent Y variable is usually continuous, but the independent X variables can be continuous or categorical (provided that at least one of them is continuous). The model is usually estimated by the method of standard least squares. See Help ▶ Books ▶ Fitting Linear Models ▶ Stepwise Regression Models.

One-Way Analysis of Variance (or One-Way ANOVA): A procedure involving a categorical X, independent variable and a continuous Y, dependent variable. One-way ANOVA is used to test for differences in means among two or more independent groups (though it is typically used to test for differences among at least three groups because the two- group case can be analyzed with a t-test). When there are only two means to compare, the t-test and the F-test (see F-test) are equivalent. See Help ▶ Books ▶ Fitting Linear Models ▶ Examples of Model Specifications and Their Model Fits.

Outlier: An observation that is so extreme that it stands apart from the rest of the observations; that is, it differs so greatly from the rest of the data that it gives rise to the question of whether it is from the same population or involves measurement error. One common rule of thumb for an outlier is any value that is 1.5 times the interquartile range if the distribution is approximately normally distributed.

Partition: The Partition platform iteratively separates data according to a predictive relationship between a Y and multiple X values, from strongest to weakest, forming a tree structure. Partition searches through the data table to find values within X columns that best predict the outcome of Y, your column of interest. Partition is a data mining or predictive modeling technique. See Help ▶ Books ▶ Specialized Models ▶ Partition Models.

Prob > F: In ANOVA, the probability of obtaining (by chance alone) an F-value greater than the one calculated if, in reality, there is no difference in the population group means. Prob (or “p” values) of 0.05 or less are often considered evidence that a model fits the data. See Help ▶ Books ▶ Basic Analysis ▶ Bivariate Analysis ▶ Fit Line and Fit Polynomial ▶ Linear Fit and Polynomial Fit Reports ▶ Analysis of Variance Report.

Prob > t: A p-value or measure of significance for a t-test. For a one-sample test or for a test of differences between two means, it is the probability of obtaining a value more extreme than the hypothesized value, if the null hypothesis were true. Prob > t values of 0.05 or less are usually considered significant. See Help ▶ Books ▶ Basic Analysis ▶ Bivariate Analysis ▶ Fit Line and Fit Polynomial ▶ Linear Fit and Polynomial Fit Reports ▶Analysis of Variance Report.

Quantiles: Values that divide an ordered set of continuous data (from smallest to largest) into equal proportions. Related terms are deciles (dividing data into 10 parts) and quartiles (dividing data into four parts, or quarters). Values in the 97th percentile, or quantile, are equal to or larger than 97% of all values in the distribution.

Quartiles: Values in a continuous column of data that are first ordered (from smallest to largest) and then divided into four quarters, each of which contains 25% of the observed values. The 25th, 50th, and 75th percentiles are the same as the first, second, and third quartiles, respectively. See also Quantiles.

Regression: A statistical procedure that shows how two or more variables are related, which is represented in the simple case by a fitted line in a bivariate scatterplot. The fitted line, along with its regression equation, allows one to predict values of Y based on observations of X. The simple form of this equation is expressed as y=mx+b, which determines the extent to which one variable changes with another.

RSquare: A measure of the adequacy of a model defined as a proportion of variability that is accounted for by the statistical model. RSquare provides a measure of how well future outcomes are likely to be predicted by the model. See Help ▶ Books ▶ Basic Analysis ▶ Bivariate Analysis ▶ Fit Line and Fit Polynomial ▶ Linear Fit and Polynomial Fit Reports ▶ Summary of Fit Report.

Standard Deviation: A measure of variability or dispersion of a data set calculated by taking the square root of the variance. It can be interpreted as the average distance of the individual observations from the mean. The standard deviation is expressed in the same units as the measurement in question. It is usually employed in conjunction with the mean to summarize a continuous column.

Standard Least Squares: A method of fitting a line to data in a bivariate plot or multiple regression model. Least squares is used where there is one continuous Y and at least one continuous X column in your model. Least squares fits a model that minimizes the total sum of squares in the data, hence the name “least squares.” The least squares method is used extensively for prediction and for calculating the relationship between two or more variables. See Help ▶ Books ▶ Fitting Linear Models ▶ Standard Least Squares Report and Options.

Sum of Squares: A measure of variation of the model to the observed data. This is calculated by squaring the errors (the vertical distance between each data point and the model fit) and summing those values. We square the values to obtain a positive sum of the errors regardless of whether the observed values are above or below the fit. The best model fit is one where the total sum of squares is minimized.