Problem-Solving with Multiple Columns - JMP Essentials: An Illustrated Guide for New Users, Second Edition (2014)

JMP Essentials: An Illustrated Guide for New Users, Second Edition (2014)

Chapter 6. Problem-Solving with Multiple Columns

The previous chapter focused on problem-solving when you have one or two columns of interest. This chapter builds on the previous chapter’s framework for problem solving and introduces a method to understand multiple column and group relationships.

We introduce advanced analytics including Partition, Fit Model, and Prediction Profiler to help us understand the multi-column/variable relationships in the data. Most real-world problems involve more than one- or two-column relationships so this approach is warranted.

We also introduce the Data Filter for real-time slicing of the data. The Data Filter enables easy inference testing and hypothesis development by excluding, hiding, and marking selected observations across any number of column ranges or groups.

The Fit Model platform includes several modeling personalities, including Multiple Regression. Unlike the two- column analysis (one Y and one X) we introduced in the last chapter, Multiple Regression allows you to employ additional X columns. We briefly introduce this topic at the end of the chapter, along with a very powerful model visualization tool, the Prediction Profiler.

6.1 Introduction

As we explained in the previous chapter, the transformation of data into information is a process that involves a few basic JMP platforms including Distribution and Fit Y by X, which are found in the Analyze menu. Table 6.1 provides a review of the tool name, how many columns it supports, and its common statistical identity. The process of discovery starts with univariate/one-column analysis using the Distribution platform to explore the single-column/variable properties and proceeds to bivariate/two-column relationships using the Fit Y by X platform. As in life itself, these analyses can often lead to more complex questions involving more than two columns, which is the subject of this chapter.

In Table 6.1, we add Fit Model and Modeling to the items on the Analyze menu column. Partition and Multiple Regression are associated with these menus and are added to the table framework. These methods support multivariable or multi-column relationships.

Table 6.1 Tool Name, Columns Supported, and Statistical Terms

Analyze Menu

How Many Columns

Statistical Terminology

Distribution

Single column

Univariate methods, histograms, box plots, quantiles, and descriptive statistics, and more

Fit Y by X

Two columns

Bivariate, contingency, logistic, oneway ANOVA, nonparametrics, linear regression, and more

Fit Model

Multi-Column

Partition, Data Mining, Multiple Regression, Logistic regression, Stepwise, and more

Modeling

Multi-Column

Partition, Data Mining, Time Series, and more

Remember, JMP’s goal in using this menu framework is to introduce you to methods in a logical order, allowing you to learn about your data progressively without forcing you to first understand statistical jargon or the conditions required to implement them. The process of digging deeper into your data is built into the menu structure.

6.2 Comparing Multiple Columns

What remains to be answered is whether there are important multiple-column relationships (looking at more than two columns) that might be present in the data. Complex products or systems often have these relationships. While these multiple-column relationships are not always present in the questions you are trying to answer, knowing that you have access to methods to identify them should help. In fact, sometimes to see the forest for the trees, you need these methods so that you can focus on only those columns that drive your learning from the data.

We do not need to change the profit column to a nominal type to use Partition. Output columns like profit can be either continuous, nominal, or ordinal in the Partition platform as well as any input columns. We made this change only to simplify the interpretation of our analysis for illustration purposes.

Example 6.2: Financial

We will be using the Financial.jmp data file to illustrate the steps in this chapter. The data is from companies in the Fortune 500, selected from the April 23, 1990 Fortune magazine issue. This data includes columns for:

Type type of company

Sales($M) yearly sales in millions of dollars

Profit($M) yearly profits in millions of dollars

#emp number of employees at time of measurement

Profits/emp profits per employee in thousands of dollars

Assets($Mil.) assets in millions of dollars

Sales/emp sales per employee in thousands of dollars

Stockholder’s Eq($Mil.) stockholder’s equity in millions of dollars

You can access this data at Help ▶ Sample Data ▶ Business & Demographic ▶ Financial.

Note: Statistical modeling in practice is both art and science and simply adding more columns to your model is not always better. While real world models often include more than two columns, adding too many columns will undermine the usefulness of the model on new data. There are tools built into JMP to help you decide the right number of columns to use in a model. JMP Pro includes many state of the art platforms and tools to validate your models.

You should have the Financial.jmp data table open from the previous chapter. If you don’t, you can open it at: Help ▶ Sample Data ▶ Business & Demographic ▶ Financial (see Figure 6.1).

Figure 6.1 Financial Data Table

image

Preparing Data Using Recode

Before we perform an analysis, let’s create a new column that simply identifies each company as being profitable or not (we will, in effect, be making a new profit column that indicates whether it meets a certain profit threshold and will be treated as nominal). Our intention is to identify profitable companies based on all the other columns. We want to pick companies whose profits will be sufficient to provide us with solid returns, so we want to identify companies whose profits are above $10 million and store them in a new column. The $10 million mark is an arbitrary break point (it could be any value), but it is based on the idea that companies with higher profits in general will probably be larger and less volatile. For purposes of illustration, we will use $10 million as a reasonable and conservative strategy at this point.

Note: We do not need to change the profit column to a nominal type to use Partition. Output columns like profit can be either continuous, nominal, or ordinal in the Partition platform as well as any input columns. We made this change only to simplify the interpretation of our analysis for illustration purposes.

Let’s make our new Profit column by deriving what we need from the existing Profits($M) column:

1. Click the Rows menu.

2. Select Row Selection.

3. Click Select Where… (see Figure 6.2). This generates the Select rows window (see Figure 6.3).

Figure 6.2 Select Where to Generate the Select Rows Window

image

Figure 6.3 Select Where Profit is Greater Than $10M

image

4. Select Profits($M).

5. Select is greater than.

6. Enter 10. (The data is expressed in millions.)

7. Click Add Condition.

8. Click OK.

Rows that meet the condition of Profits being greater than $10 million are now highlighted in the data (see Figure 6.4).

Figure 6.4 Selected Rows are Highlighted

image

9. Now select Rows ▶ Row Selection ▶ Name Selection in Column… (see Figure 6.5).

Figure 6.5 Select Name Selection in Column

image

10. Type the new column name and labels for the Selected and the Unselected rows exactly as shown (see Figure 6.6) and click OK.

Figure 6.6 Name Selection in Column

image

Note that we will hereafter refer to this new column as “Profits recode”.

A new column is created that identifies profitable and not profitable companies for each row in the table, with a dividing point of $10 million and higher for those we deem to be sufficiently profitable (see Figure 6.7). This is called recoding the data. Not having to write any code to do this is very handy. To learn about another method to accomplish data recoding using the Recode command on the Cols menu, see section 2.4.

Figure 6.7 New Recoded Profits Column

image

Let’s move on with the analysis.

Consider the data table in Figure 6.8. More than one column might be related to the Profits recode column: for example, Sales($M), #emp, and Assets($M) may explain or influence a company’s profitability. With more than two columns to investigate we need to move further down the Analyze menu. There are many methods to investigate this kind of relationship. Let’s look at one of these with a tool called Partition.

Figure 6.8 Recoded Profits

image

Note: You might find these multi-way relationships by simply (but laboriously) producing distributions or examining output from Fit Y by X. We use Partition because it is a powerful and quick method for finding these relationships when you have more than two potential relationships in your data. Partition results are usually easier to interpret as well.

Mining Data Using Partition

The Partition platform allows you to determine which columns in a data table most influence or predict the outcome of another column (profits, in our example). It achieves this by searching all of your selected X columns and finding a set of splits or subgroups within them whose values best predict your Y value. These splits (or partitions) of the data are done recursively (starting with the best predictor), forming a tree of decision rules until the desired fit is reached.

A decision tree is essentially a ranked set of predictors (such as Type or Assets($Mil.), in our example) placed in a hierarchical tree structure according to the strength of their relationship to the column of interest (Profits or Y, in our example). Items higher in the tree have more influence than items lower in the tree.

In our example, Profit recode is our column of interest (Y Column) and we want to determine which of the remaining columns (X Column) in our data table most influence or predict Profit. See Figure 6.9.

The Partition method introduced here is one of a set of methods that is sometimes termed data mining because you are mining relationships in your data to find which columns most relate to or predict profits.

Figure 6.9 Determine Which Columns Affect Profit Recode (Outputs versus Inputs)

image

Note: The Partition technique is introduced here to enable quick exploration of your data. As an advanced analytical technique it is recommended that you familiarize yourself with the underlying concepts by using the question mark tool to call up the relevant chapters in the documentation. Because these tools often require greater sophistication in their interpretation, we recommend that you seek out experienced data analysts for assistance when necessary.

To use Partition we will move further down the Analyze menu to the Modeling menu (see Figure 6.10).

Figure 6.10 Analyze Modeling Menu

image

Because we have already explored profits compared with the type of company, we might now ask other questions.

These are just a few of the questions that can be answered using the Modeling platforms.

Framing Our Analysis

Questions That Involve Multiple Columns

What are the best predictors of profit?

Are there relationships between profits and other columns?

What or where are the biggest differences found between company profits?

Where are there no differences?

Is there some way to use multiple columns to pick the most profitable companies?

1. Select Analyze ▶ Modeling ▶ Partition (see Figure 6.11).

Figure 6.11 Analyze Modeling Partition

image

2. Select Profits recode for Y, Response (see Figure 6.12).

Figure 6.12 Partition Window Selections

image

3. And for X, Factor, select Type, Sales, #emp, Assets($Mil.), Sales/emp, and Stockholder’s Eq (see Figure 6.12).

4. Click OK.

You now see profits for all the companies represented by a random pattern of dots from left to right in the partition graph (see Figure 6.13).

Figure 6.13 Profits Before Split

image

The horizontal black line that is perpendicular to the Y axis is the dividing point we set for companies that have at least $10 million in profit.

The companies that are above the line are those that we have deemed profitable ones and those below are not profitable enough and thus deemed the unprofitable ones. So, looking at the line on the scatterplot, about 13% of the companies are not profitable and about 87% are profitable. Your graph may look slightly different in that the dots representing the data points are randomly arranged or in statistical terms, jittered.

You can see those very high values (outliers) to which we assigned a red X marker from the previous chapter. We saved these in the table as row properties previously. If you don’t see the red X’s, it’s ok. We can go forward without them.

Now let’s try something.

5. Click Color Points. It’s the button next to Split and Prune (see Figure 6.13).

Remember that the point of this analysis is to understand the multiple-column relationships between profits and the other columns in the data table.

6. Select Split and new output appears (see Figure 6.14).

Figure 6.14 First Split of Profits

image

What happened? Partition went through all the columns you selected as X’s and found that the best predictor of Profits Recode was number of employees (#emp). Remember, you also included a half-dozen other columns in your model, but the first to be selected was the #emp column. Also note that your high performers from the previous chapter’s work show up as X’s (that is, if you saved them in the previous chapter). These should show up as profitable as you discovered them using the Distribution platform. It’s a good confirmation that Partition found them also!

This predictor is important and is expressed by finding the column whose differences in means (or categories, in the case of a nominal column) are most different as they relate to our Y, thus allowing us to predict our Y from our split X values. Want to know more about the analytical method used by Partition? Go to Help ▶ Books ▶ Specialized Models ▶ Partition Models.

Let’s express the splits as probabilities, so go to the red triangle:

7. Select Display Options.

8. Select Show Split Prob (see Figure 6.15).

Figure 6.15 Select Show Split Prob

image

9. Select Split again. Another branch appears on the report (Figure 6.16).

Notes on Partition Trees: Partition trees are useful when you have large amounts of unexplored data. Partition trees are also flexible to your column modeling types. The output column (Profits recode in our example) can be either continuous, nominal, or ordinal AND the input columns (in our example, #emp and Type among others) can be either continuous, nominal, or ordinal. However, you should be cautious on drawing conclusions from partition trees when the data sets are small, sparse, or messy, especially as you continue to split. Starting in JMP 11, enhancements were made to many modeling methods, including Partition, to utilize the information value of missing data in improving model accuracy. Methods to measure the usefulness of a partition model are also supported in JMP. To learn more about the Partition platform, go to HelpBooksSpecialized ModelsPartition Models.

Figure 6.16 Probabilities for Each Split

image

The partition tree output now also shows the probabilities for each split (see Figure 6.16). Out of the total group of companies we found about 13% of them are not profitable and about 86% of them are profitable.

Now where is most of the blue? (Reminder: Blue is profitable!) It’s mostly on the left-hand branch of the tree. These blue bars have been circled for you. It shows that 90% of the companies are profitable; about 10% are not profitable and come from companies with employees greater or equal to 1816 when the data is not missing. On the right-hand side, the method picked companies where the number of employees was less than 1816. You see more red on the right side of the tree. (Reminder: Red is unprofitable by our definition.) So, we conclude that the most significant predictor of profit is the #emp column where about 90% of the companies that have more than 1816 employees meet our criteria for profitable companies. In comparison, on the right side of the tree, we found about 68% of the companies are not profitable and about 31% are profitable where the number of employees is less than 1816 or the data are missing.

Note that in the output JMP has taken Missing or not Missing into account in the model. The information value of missing data can result in improved models. In JMP 11, a new feature called Informative Missing was added to improve modeling results and is reflected here. Go to Help ▶ Books ▶ Specialized Models ▶ Partition Models and the ‘Informative Missing’ section to learn more.

Based on what you just learned, what type of companies would you choose?

Where are the most profitable groupings after your last split (see Figure 6.17)? Hint: Follow the blue. On what side of the tree did the split happen? Can you find the combination where 99% of the companies are profitable? We find these where the Type of companies are Beverages, Drugs, Oil, and Soap, and the number of employees are greater than 1816. The best combinations in the tree are circled for you.

Figure 6.17 Most Profitable Groups after Last Split

image

You have now identified the most likely combination of predictors for profitable companies so far. Let’s make that easier to see in the Leaf Report.

10. From the red triangle for partition, select Leaf Report (see Figure 6.18).

Figure 6.18 Select Partition Leaf Report

image

The Leaf report assembles your discoveries in a table format that mirrors the tree format we just covered (see Figure 6.19). Looking at the Leaf report and reading the Leaf Label from left to right, we can see which split combinations are most or least desirable. What combination looks the most risky so far? It appears to be where the number of employees is less than 1816. Some 68% of companies in that group do not achieve our definition of profitable.

Figure 6.19 Leaf Report

image

What group looks most profitable?

It appears to be where the number of employees is greater than or equal to 1816 and the type of companies are Beverages, Drugs, Oil, and Soap. In our example, 99% of companies in this group are profitable. Remember this group, as we’ll be using it in the Using Data Filter section a bit later.

11. Back to our tree, click Split again.

Where did the next split happen? (See Figure 6.20). We have found another possible combination where profits are predicted on the left-hand side of the tree.

Figure 6.20 Results Showing Third Split

image

The combination of Aerospace and Computer and Number of Employees greater or equal to 95,000 has identified nine companies that are profitable. All of them are profitable in that combination. This profitable set has been circled for you (see Figure 6.20).

With each split, the columns in the model are sorted and ranked by the next most significant predictor. You can now see this last split in the updated Leaf report that appears in Figure 6.21.

Figure 6.21 Third Split Leaf Report

image

6.3 Filtering Data for Insight

Drilling down or filtering your data is an important objective in data analysis and can be accomplished in several ways with JMP. In this section, we will show you three distinct ways this can be done with our example including commands from the Tables menu, the Data Filter platform, and Lasso tool. You will find that each filtering approach provides unique benefits and is suited to the context of the analysis you are performing.

Using a Table Command to Extract a Subset

Now let’s identify those nine companies:

12. Select the red triangle within the tree node where #emp>=95000 not Missing

13. Select Select Rows (see Figure 6.22).

Figure 6.22 Select Rows on the Partition Node

image

14. Select the Tables menu.

15. Select Subset (see Figure 6.23)

Figure 6.23 Tables Subset

image

16. In the window, select Selected Rows and All Columns as shown, and click OK (see Figure 6.24).

Figure 6.24 Tables Subset Selected

image

A new data table appears (see Figure 6.25). You have identified a subset of desirable and profitable companies that are from the computer and aerospace categories and that have sales above $10 billion. These companies can now be submitted for further study and consideration for the investment portfolio. This subset step will save you time from finding them in the data table.

Figure 6.25 New Data Table Showing Profitable Subset

image

Using Data Filter

The Data Filter command from the Rows menu is a powerful means to visualize a subset of your data interactively. The Data Filter is especially useful when you are visualizing large data tables where graphs are packed with data points and it is difficult to see meaning in them. The Data Filter allows you to easily select rows (which can be ranges or categories) within any column of interest and, in effect, hide or exclude all other rows in your data table (see also section 2.5).

We’ll use the Data Filter to identify the best of the best from among the groups you already identified using Partition.

Early in the last section, recall that a promising group of companies were from the Type column (Soap, Oil, Beverages, and Drugs) where the number of employees is greater than or equal to 1816 where the data are not missing. We need to restrict the next analysis only to this promising subset. Here’s how:

1. We need to bring the Financial data table to the front again. Select WindowFinancial.

2. From the Rows menu, select Data Filter (see Figure 6.26).

Figure 6.26 Select Rows Data Filter

image

The Data Filter window appears (see Figure 6.27).

Figure 6.27 Data Filter Window

image

3. Select Type and click Add. A change to the Data Filter appears showing the different company types (see Figure 6.28).

Figure 6.28 Data Filter Type

image

4. Now, press and hold the CTRL key and select Beverages, Drugs, Oil, and Soap. Click the AND button. (The results are shown in Figure 6.29).

Figure 6.29 Data Filter Beverages, Drugs, Oil, and Soap

image

5. Select the #emp column and click Add. (The results with #emp slider control are shown circled in Figure 6.30).

Figure 6.30 Select Number of Employees

image

A slider appears under the range, labeled 560 <= #emp <= 383220. Recall that the dividing point discovered in the Partition platform earlier favored companies with a number of employees greater than or equal to 1,816 where the data was not missing.

6. Click and enter 1816 where you see 560 on the left-hand side of the slider. On your keyboard, select the Tab key (see Figure 6.31)

Figure 6.31 Data Filter

image

7. Now select the Show and Include check boxes (see the circled area in Figure 6.32). This will restrict the analysis to those same profitable companies identified in the partition and leaf report earlier in this chapter.

Figure 6.32 Select Show and Include

image

8. Let’s take a look at the result of this exercise by returning to the data table, Select Window ▶ Financial (see Figure 6.33).

Figure 6.33 Bring Financial Window to Front

image

Take a look at the data table and notice what has happened. It now shows groups of rows that are selected and included for profitable rows, and hidden and excluded for every non-profitable company group (see Figure 6.34).

Figure 6.34 Data Table Showing Rows Hidden and Excluded

image

As we covered in section 2.5, the rows that are hidden (and that will not appear in graphs) feature a mask icon in the row, while rows that are excluded (not used in any analysis) feature a circle icon with a strike-through symbol.

Rows that are still included and selected (highlighted) are those that meet the criteria for selection we established from our Partition example. They are now selected with the Data Filter.

The Data Filter is a terrific tool for exploring your data visually by allowing you to see subsets of your data in any JMP graph. These subsets can be derived from a prior analysis as demonstrated here or can simply be used to restrict your analysis to any subset you wish. As you will see in the next few pages, the Data Filter will be employed to further subset the most desirable companies in our example.

Now let’s save the Data Filter for the future:

9. From the red triangle on the Data Filter, select Script ▶ Save Script to Data Table (see Figure 6.35).

Figure 6.35 Save Script to Data Table

image

In the upper left-hand panel of the data table, a new item appears named Data Filter. It’s circled for you (see Figure 6.36). The Data Filter item has created a JMP script that stores the steps needed to reproduce the subset you selected.

Figure 6.36 Data Filter Script in Data Table

image

Let’s rename it something more descriptive so we can remember what it does.

10. Double-click on Data Filter and name it BevDrugOilSoap, EMP >= 1816 in the window that appears (see Figure 6.37).

Figure 6.37 Data Filter Script Renamed

image

11. Click OK.

The Data Filter script is now renamed BevDrugOilSoap, EMP > = 1816 in the upper left-hand panel of the data table (see Figure 6.38).

Figure 6.38 Data Filter Renamed in Table

image

Now we can reuse the filter if we want to apply this criterion to the data. It’s easy with the new script that applies the filter.

Note: The script also will act upon any new data you add or merge into the data table. If you add or change data in the data table, you may want to create a new subset. If that is true, use Partition again to analyze it and then use Data Filter again to apply it.

You have now filtered your data to only the best performers, so our task is to select only the very best from this group for our portfolio. There are many methods to finding high performers from among the groups we selected. We will go back and use the Fit Y by X platform (that we introduced in the previous chapter) along with the Lasso tool to visually select individual points. We find that using Lasso in this context easily enables visual selection of points that meet our performance requirements. The goal is to select the most profitable companies by industry type. In this example we’ll use the continuous representation of profits, Profits($M), as it allows us to distinguish with more precision among this group of high performers.

Recall that we also identified profitable companies from the Aerospace and Computer types in the partition Analysis and we sued the subset command to extract those. Because these companies also had the requirement of having sales over $10B, we could have used a conditioning statement (such as an “and” or “or” statement) in the Data Filter or merged that group with the one illustrated above to create one data table. To simplify this illustration, we’ve omitted these steps. To learn about how to merge data tables, see HelpBooksUsing JMP ▶ section“Reshape Data” and the topic Concatenate.

12. Select Analyze ▶ Fit Y by X (see Figure 6.39).

Figure 6.39 Analyze Fit Y by X

image

13. In the launch window, select Profits($M) as the Y, Response and select Type for the X, Factor (see Figure 6.40).

Figure 6.40 Fit Y by X Profit by Type

image

14. Click OK.

The Oneway graph (see Figure 6.41) shows profits from one of the best performing groups you identified in your partition analysis and selected using the Data Filter. All of the other companies have been excluded and hidden from this analysis. It is from these included companies that we want to identify the very best of these performers.

Figure 6.41 Oneway ANOVA Report

image

15. From the red triangle, select Means and Std Dev (see Figure 6.42).

Figure 6.42 Select Means and Std Dev from Red Triangle

image

The result appears in Figure 6.43a.

Figure 6.43a Oneway Means Std Dev Lines

image

We want to select companies that are exceptional performers and these are represented by those that are at least one standard deviation above the mean. Notice that the blue horizontal lines that appear in each type demark the standard deviations in the close-up in Figure 6.43b.

Figure 6.43b Mean Error Bars

image

The smaller blue vertical lines with horizontal whiskers denote mean error bars (see Figure 6.43b). Use the question mark tool from the Tools menu and click on the Oneway plot to review standard deviations and other options.

Individual data points above or below horizontal error bars denote extremes. We are interested in those above the whiskers and above the blue horizontal lines because they are exceptional performers in terms of profit. Let’s try a new method to select them.

Using Lasso to Select Individual Points

16. From the toolbar, select the Lasso tool (see Figure 6.44).

Figure 6.44 Select Lasso Tool

image

17. Now move the cursor (which should look like a little lasso) to the Oneway result. Left-click and draw around the points above the top whiskers of the error bars (see Figure 6.45), making sure that you close the circle around the points before releasing the button.

Figure 6.45 Lasso around Points

image

There should be about thirteen selected rows in your table. It’s okay if the count is not exact. Now we just need to extract these very best historical performers from the rest.

Note: To help make these easier to see, the size of the graph has been increased by clicking and dragging the lower right corner to enlarge the graph.

18. From the Tables menu, select Subset (see Figure 6.46).

Figure 6.46 Tables Subset

image

19. In the launch window, select Selected Rows and All Columns (see Figure 6.47).

Figure 6.47 Select Selected Rows and All Columns

image

20. Click OK.

The resulting data table (see Figure 6.48) contains thirteen companies that have been sifted down from a group of 97 that represent the most profitable. The most profitable companies are mostly oil companies, though soap, drugs, and beverage companies each made a showing among the most profitable. We now have some potential portfolio selections for further study or investment.

Note: You may not get exactly 13. This is OK as some of the best performing companies might be challenging to grab using the lasso tool.

Figure 6.48 Data Table Showing Most Profitable Companies

image

6.3 Model Fitting, Visualization, and What-If Analysis

In this section, we will extend many of the JMP skills and concepts you’ve learned thus far into the insight you’ll need to articulate your data effectively. We will now combine our models with the Data Filter in order to better understand the dynamics of our data. We will also introduce the Prediction Profiler for real time visual what-if analysis.

Consider for a moment a thought experiment. Some old radios have an analog tuner dial on them. Imagine you turn on the radio and hear noisy static. You slowly turn the tuner dial through the frequencies until you hear a transmission. Sometimes you might tune past the station, but you slowly tune back to obtain the optimal signal. Your ear heard the difference between the signal and the background noise. The Profiler and Data Filter (as described in the previous section) can be used as just such a tuner, but instead of using your ears, you use your eyes. The next section shows you how.

Let’s test the idea that a positive relationship exists between profits and the number of employees in all of our companies. To do this, we need to use two things you already learned: Fit Y by X and the Data Filter.

1. Select Rows ▶ Clear Row States (see Figure 6.49)

Figure 6.49 Clear Row States

image

This clears the row selections, the markers, and the hidden and excluded rows from the last section.

Now let’s build a simple model of the relationship between profits and number of employees.

2. Select Analyze ▶ Fit Y by X. Select Profit($M) as the Y, Response role and #emp as the X, Factor role (see Figure 6.50).

Figure 6.50 Fix Y by X Profit Employees

image

3. Click OK. A bivariate fit appears (see Figure 6.51).

Figure 6.51 Bivariate Plot

image

4. From the red triangle for the bivariate fit, select Fit Line and Fit Mean (see Figure 6.52).

Figure 6.52 Select Fit Line

image

A red line appears in the data that extends generally from the lower left of the graph to the upper right of the graph. A green line appears that represents the mean of the response or average profit (see Figure 6.53).

Figure 6.53 Bivariate Profits by Employee with Fit Line

image

This red line shows that, in general, as the number of employees increases, so do the profits. Notice how sparse the data is for companies above 150,000 employees. Because there are few companies or rows of data above 150,000 employees, it might suggest that the relationship is weaker for very large companies. An unusually large and profitable company in the upper right is influencing the slope of the line as well.

Note: This is a simple or linear least squares regression fit, which fits a straight line through the points in a manner that balances the differences between those points above and below the line.

Let’s see if graphics can help us interpret this relationship:

5. Select the red triangle item next to the line labeled Linear Fit and select Confid Shaded Fit (see Figure 6.54). A red shaded boundary around the line appears (see Figure 6.55).

Figure 6.54 Conf Shaded Fit

image

Figure 6.55 Conf Shaded Fit Displayed

image

The curves appearing on the graph are 95% confidence curves. The shaded area around the line gets wider where data is sparser and the farther away from the mean of X. This is a visual indication that where data is sparse and far away from the mean of X, the estimates are less precise. The converse is also true: where data is denser and closer to the mean of X, the confidence curves are narrower and estimates are more precise. Thus, we can be more confident that the line better represents the relationship where the confidence curves are narrower than where they are wider. This might indicate that the relationship between profits and number of employees is strongest up to about 150,000 employees because the confidence curves start to widen there.

Notice, too, that a lot of the data points are pretty far away from the line. That’s curious. It appears especially true for companies with more than 50,000 employees. Maybe even more than one pattern is present. How might we explore this?

The goal now is to see if the positive relationship between profits and number of employees holds across all company types. The Data Filter acts like a tuner as it toggles through the company types. Here’s how.

6. Select Rows ▶ Data Filter (see Figure 6.56).

Figure 6.56 Select Rows Data Filter

image

7. In the Data Filter window, select the Type column (see Figure 6.57) and click Add.

Figure 6.57 Select Data Filter Type

image

8. Put a check mark next to the Select, Show and Include check boxes (see Figure 6.58).

Figure 6.58 Data Filter with Select, Show, and Include Selected

image

9. Now return to your Fit Y by X results. From the red triangle, select Script ▶ Automatic Recalc (see Figure 6.59).

Figure 6.59 Automatic Recalc

image

Automatic Recalc updates the scatterplot and fit based on the new row states controlled by the Data Filter. Alternatively, you could leave Automatic Recalc off and re-fit the trend line interatively for multiple views but this takes more time to do.

Now, you should see two floating windows (see Figure 6.60).

Figure 6.60 Bivariate Fit Y by X and Data Filter

image

10. From the red triangle of the Data Filter, select Animation (see Figure 6.61). Animation controls will now appear in the Data Filter window that look like the controls on a DVD player.

Figure 6.61 Data Filter Animation

image

11. Click on image (the step forward control) (see Figure 6.62).

Figure 6.62 The Step Forward Control

image

Each time you click the step forward control, you filter on just one of company groups within the Type column and you will see these highlighted in the Data Filter window.

Now click sequentially through the company types and notice what happens to the graph each time you click. Toggle through several times. Can you see the one type that is different from the rest? Figure 6.63 includes the series of graphs you should be seeing as you toggle through each company type.

Figure 6.63 Results of Clicking Through Company Type

image

Which graph is most different from the rest? Which graph has a fit line that is almost flat? You might have noticed that Beverages shows a line that is much flatter than the others, which indicates that the number of employees or size of the company has less influence on the profitability. Notice that, in general, the other company types show a reasonably strong correlation between profits and number of employees.

Notice also that Beverages is the sparsest fit, with only seven data points widely dispersed so the confidence bands also flare out widely. Also notice that the green mean line for Beverages is completely inside the confidence curves. This is another indication that the correlation for Beverages is not strong. At least for the beverage company type, we can conclude that the greater number of employees does not predict greater profits in this sparse sample.

12. Select Rows ▶ Clear Row States for next section.

Conducting What If Analysis

The Prediction Profiler is a different type of graphical tuner for your data and provides a clear picture of your model. The Prediction Profiler is an interactive graph that produces estimates of your Y column of interest (profits, in our example) subject to your predictors, or X, columns, (such as number of employees from the last example). The interactive feature allows you to drag and change the settings of any column to see the estimated effect on the other columns.

The advantage of the Prediction Profiler is that it lets you try what-if scenarios dynamically and get immediate estimates on any column of interest. Very cool!

1. To create a profiler, you first need to create a model to describe the relationships in your data. Select Analyze ▶ Fit Model (see Figure 6.64).

Figure 6.64 Analyze Fit Model

image

2. In the Fit Model window, select Profit($M) as the Y column and then select Sales($M), #emp, Assets($Mil.), and Stockholders Eq($Mil.). Click Add to place them in the Construct Model Effects window (see Figure 6.65).

Figure 6.65 Fit Model Window

image

Caution: The Prediction Profiler appears in the context of a multiple regression model here. The technique is introduced here to enable quick exploration of your model. It is an advanced analytical method. You should familiarize yourself with the underlying analysis concepts by using the question mark tool or searching the relevant chapters in the documentation (Help ▶ Search).

3. Select the pop-down menu and change it from Effect Leverage to Effect Screening as shown. This just changes the output to include additional tools to interact with the output (see Figure 6.66).

Figure 6.66 Fit Model with Emphasis Effect Screening

image

Now we are ready to run the model to test the relationships between profits and the other columns we’ve selected. This approach using the Fit Model platform employs a method called multiple regression. For now, we’ll exclude the type of company from the model because we will be using it later to investigate differences.

4. Click Run Model. A report window appears (see Figure 6.67).

Figure 6.67 Fit Model Screening

image

The Actual by Predicted Plot indicates there is a strong, positive relationship between profits and the columns you picked because of angle of the red fit line relative to the blue mean line. Just as in the earlier example, the blue mean line is not within the confidence bands, indicating a strong relationship. There are additional technical indicators of the quality of the relationships in the other report items. Use the question mark tool (?) to investigate the report items and learn more about the results.

5. Scroll down the window until you see the Prediction Profiler (see Figure 6.68).

The Prediction Profiler enables you to see visual relationships between Profits(#M) (Y column) and sales, number of employees, assets, and stockholder equity (X columns) as depicted by the fitted solid black lines (see Figure 6.68).

Figure 6.68 Prediction Profiler

image

Note: Each graph is bisected by a vertical red dotted line. These red dotted lines are interactive and provide us with a kind of X-ray vision into the relationships between the columns.

6. Drag the vertical dotted red line for sales from its current set point near 0 to a new set point near 40000 (see Figure 6.69).

Figure 6.69 Prediction Profiler Sliders

image

As you drag the red line for Sales($M), notice that the estimate for profits increases to around 1625. Dragging the line you may find it difficult to get it exactly. This is okay. You can also click on the red value and type in the new desired value, 40,000 in this case.

Each of the vertical and horizontal red dotted lines in the Prediction Profiler are interactive and work in this manner.

Look at the angle of the fitted line in the plots within the Prediction Profiler (see Figure 6.70). The angle of each of the fitted lines give us clues about their influence on Profit. A steeper line suggests that small changes in X (sales for example) will have significant changes in Y (profit).

Figure 6.70 Prediction Profiler

image

With sales, for example, this is an indication that our column of interest (profits) changes a lot when we move the red dotted vertical line associated with Sales($M). Thus, the steeper the line, the greater effect a change has on our Y column of interest. Lines that are flat, or nearly so (such as number of employees), mean that changes in these columns have little impact on our Y column (profits). With the Profiler, you can conduct dynamic, real-time what-if analyses.

We previously learned that profits varied among some company types. However, the model we just created did not include the Type column. So let’s explore how each type of company might impact profits in this model. We again use the Data Filter to accomplish this:

7. Select Rows ▶ Data Filter (see Figure 6.71).

Figure 6.71 Rows Data Filter

image

8. In the Data Filter window, select Type and click Add. Select Show and Include (see Figure 6.72).

Figure 6.72 Data Filter Type, and Show and Include

image

We need the report window to respond to the Data Filter dynamically using automatic recalculation. To do this, follow these steps:

9. From the red triangle (within the report window that contains the Prediction Profiler), select Script ▶ Automatic Recalc. The Prediction Profiler now responds to changes you make in the Data Filter.

10. In the Data Filter window, click on the red triangle and select Animation (see Figure 6.73).

Figure 6.73 Data Filter Animation

image

11. Click on image (the step forward control) (see Figure 6.74).

Figure 6.74 Data Filter Animation Step Control

image

Each time you click the step forward control, you are filtering the Fit Y by X analysis to just one of the company groups within the Type column and you will see these highlighted in the Data Filter window.

12. Now click the step forward control again (which toggles through the company types). Watch what happens to the Prediction Profiler with each click as it steps through the company types.

Have you noticed that some of the fitted lines in the profiler flip directions as the company type changes? How would you interpret these changes? We have summarized the first few of the profilers you see as you click through the company types. We will also provide a brief interpretation of each profile. Within each type, we can still drag the red dotted lines to investigate the sensitivity analysis or relationships within the model for that company type.

Aerospace

As sales increase, profits decrease (see Figure 6.75). As the number of employees increases, profits increase. As assets increase, profits decrease. As stockholder’s equity increases, profits increase.

Figure 6.75 Prediction Profiler for Aerospace

image

Beverages

The confidence bands are so far away from the fit that we must be very cautious about trusting the profiler for Beverages (see Figure 6.76).

Figure 6.76 Prediction Profiler for Beverages

image

Computer

As sales increase, profits increase (see Figure 6.77). As the number of employees increases, profits decrease slightly. As assets increase, profits decrease steeply. As stockholder’s equity increases, profits increase.

Figure 6.77 Prediction Profiler for Computer

image

Soap

As sales increase, profits increase (see Figure 6.78). As the number of employees increases, profits stay the same. As assets increase, profits decrease. As stockholder’s equity increases, profits increase.

Figure 6.78 Prediction Profiler for Soap

image

Now it’s your turn. On your computer screen, try reading the Prediction Profiler for Oil using what you have learned.

6.5 Summary

Most real-world problems are complex and involve multiple columns. The Partition platform is a flexible tool for solving many types of problems that involve multiple columns and rapidly identifies the key relationships in the data.

Tools like Data Filter and Lasso enable quick identification and extraction of interesting subsets of data. The Data Filter also provides an exploration method using animation that lets you tune in to just the slice of the data that best supports an inference or hunch.

The Profiler using the Fit Model platform offers a powerful way to understand the relationships among columns in your models. The ability to manipulate values (using the red dotted lines) within a column and immediately see their effect on other columns provides the means to conduct visual what-if analyses. With the Data Filter, you can drill down to subcategories or ranges of columns to discover those nuggets of insight that are often hidden at first glance.

This chapter and the last have presented an approach to problem-solving that is unique to JMP. The approach underscores the progressive nature of discovery that tends to build from simple descriptions of one column of data to complex relationships among many columns. This problem- solving process leads to a better understanding of your data and, in turn, to the insights and answers you seek. This process might not only go in one direction from simple to complex. As discoveries are made with the more advanced multi-column tools, confirmation and further analysis can be made with the simpler ones, Distribution and Fit Y by X, from where our journey began.