# Customer Analytics For Dummies (2015)

*Appendix*

### Predicting with Customer Analytics

*In This Appendix*

Recognizing relationships

Predicting performance

Predictive analytics comprises several methods to analyze what happened in the past to predict what will most likely happen in the future. You use your historical and transactional customer data to identify risks and opportunities.

You’ve almost surely encountered the results of predictive analytics as a customer yourself. Some examples you likely encounter include

· **Amazon’s recommendation:** Probably one of the most famous examples of predictive analytics that touch the customer is Amazon’s recommendation engine. This includes the “customers who purchased this book, also purchased this book.”

· **Facebook and LinkedIn:** Social media websites like Facebook and LinkedIn use algorithms to determine both whom you might want to connect to and which stories and updates you want in your timeline based on patterns in your viewing behavior and people with similar behavior to you.

· **Netflix:** Netflix recommends which movie or TV show you’ll like based on your past views and matching that to customers with similar behavior.

· **Return rates:** I worked with a mobile carrier to predict which phones customers would return most often based on the opinion of customers evaluating the phone’s usability.

· **Credit cards:** Your credit score and credit report are the results of the banking and credit industry wanting to predict who is more likely to pay on time and those who will more likely default.

· **Insurance:** Life insurance, car insurance, and health insurance providers notoriously collect a number of data points about customers to predict which customers will more likely get sick and need care, have a higher chance of dying prematurely, or are more likely to get into a car accident.

In all these examples, some past customer data is being used to predict future events. The same principle applies to customer analytics: using past customer behavior to predict future behavior. Throughout this book, I’ve covered both what customer analytics to collect and methods to collect them. With these analytics collected to describe the customers’ current and past experience with products and services, you can also predict the future. This appendix is a primer to help you get started with the skills needed to predict with customer analytics.

Three essential techniques to make predictions with customer analytics include

· **Finding similarities:** Identify how customers are similar, either based on behavior like purchase history or attitudes like customer satisfaction

· **Identifying trends and patterns:** Predict when customers will purchase, future revenue, website page views, subscription rates, or same-store sales.

· **Detecting differences:** Understand how customers differ or respond differently to product features and designs, which allows for customizing products, experiences, and pricing.

*Finding Similarities and Associations*

Finding similarities and associations with customer analytics data is the most common analysis technique to predict future customer behavior. Some examples of the types of questions based on making associations with customer data include:

· For customers who purchase product A, what other products do they purchase?

· Will coupons increase same-store sales?

· Does a longer time on a website result in more purchases?

· Will a reduced price mean higher sales?

· Is customer loyalty tied to future company growth?

· Does the change in home page design cause higher conversions?

Understanding the relationship between variables, how strong that association is, and ultimately the cause of outcome variables, is a fundamental and useful skill for predicting with customer analytics.

*Visualizing associations*

You can visualize the relationship between two variables by graphing them in a scatterplot. Scatterplots are a useful tool to identify associations and examine the strength of the relationship.

Figure *A-1* shows the relationship between the time it takes customers to make a purchase on a website using their mobile phone and how many finger taps the purchase took. This data came from a usability study (see Chapter *14*) from 181 participants on an e-commerce mobile site.

**Figure A-1:** A scatterplot between mobile phone “taps” and the amount of time it took to make a purchase.

In Figure *A-1*, each dot represents 1 of the 181 customers’ time and how many taps it took them to make the purchase:

· The horizontal axis (called the *x*-axis) shows the number of taps.

· The vertical axis (called the *y*-axis) shows the number of seconds it took for each participant to check out.

Figure *A-2* shows the same scatterplot with an arrow pointing to one customer who took 50 seconds to make the purchase with 16 taps.

As the number of taps increase, so too does the time it takes the customer to check out. You can infer a positive association between the two pieces of data.

**Figure A-2:** This customer took 50 seconds and 16 taps to make an online purchase from a mobile phone.

*Quantifying the strength of a relationship*

You can numerically quantify the strength of an association by using the Pearson Product Moment Correlation. It’s often just called the correlation coefficient and is represented by the symbol *r.* The correlation is used to quantify the association between two continuous variables, (such as revenue, time, or rating scales). (See Chapter *2* for a reminder on the difference between continuous and discrete variables.) I cover associations between binary variables later in this appendix.

The correlation coefficient varies from an *r* of -1, which indicates a perfect negative correlation to 1, which means a perfect positive correlation. Figure *A-3* shows three examples of scatterplots that show a perfect negative correlation (*r* = -1), no relationships (*r* = 0), and a perfect positive relationship (*r* = 1).

**Figure A-3:** Scatterplots of relationships between variables.

Using two perfectly correlated variables isn’t helpful. They’re redundant; if you have the value for one variable, you can perfectly predict the other.

In practice, correlations are weak to strong. Some examples of correlations of different strengths include:

· **Height and Weight:*** r* = .8

· **Scholastic Aptitude Test (SAT) and First-Year College Grades:*** r* = .5

· **Usability and Customer Loyalty:*** r* = .7

The correlation between variables means that one variable can predict the value of the other variable:

· If you know a customer’s height, you can estimate his weight.

· If you know a customer’s weight, you can estimate his height.

But because these aren’t perfect correlations, the further a correlation is from 1 or -1, the more error you have in predicting one variable based on the other.

*Computing a correlation*

You can compute the correlation coefficient by hand, or use software like Excel to compute it for you.

To compute a correlation on a set of data using the Pearson Correlation formula, follow these steps. (Refer to Figure *A-1* for the data I’m using.)

1. **Set up the data in rows and columns in Excel.**

Have one column for each variable and the customers’ IDs. Each row should represent the same customer’s data on two variables. Figure *A-4* shows 17 customers’ time to make the purchase and the number of taps needed for the purchase.

2. **In any cell, type**

=PEARSON(

3. **Select all the values for the first variable.**

My data for time appears in column B and the data goes from cell B2 to cell B182.

4. **Type a comma (**,**) and select all the values for the second variable.**

My data appears in column C and the data goes from cell C2 to cell C182.

Be sure to select the same number of values for both variables.

5. **Close the parenthesis and then press Enter to get the correlation.**

=PEARSON(B2:B182,C2:C182)

The correlation for this data, between taps and time, is .560666. As the scatterplot in Figure *A-1* shows, there’s a positive correlation between time and taps.

**Figure A-4:** Setting up the data in Excel to compute a correlation.

*Interpreting the strength of a correlation*

Once you compute a correlation, you need to interpret the strength of the relationship. The correlation between taps and time is *r* = .56. Is that a strong correlation? It depends.

The strength of a correlation is context dependent. A “strong” correlation in one context may be a weak correlation in another. It depends on how much error you can tolerate and the consequences for being wrong in your predictions. Predicting time from taps probably won’t involve a loss of life or money, so it’s strong enough to be useful. In fact, it’s about the same strength of an association as between the SAT and first-year college grades — where there’s a lot at stake!

While correlations are context dependent, it can help to have some guidance on what you’ll likely see with customer analytics data. A famous researcher by the name of Jacob Cohen examined correlations in the behavioral sciences, something similar to measuring customer behavior, and provided the following rules based on how common the correlations were reported in the peer-review literature:

· Small *r* = .10

· Medium: *r* = .30

· Large *r* = .50

Therefore, one simple interpretation of correlation of *r* = .56 between taps and time is that it’s large. But there is another way of interpreting the correlation coefficient, which I cover next.

*Coefficient of determination r*^{2}

Multiplying the correlation coefficient by itself (squaring it) produces a metric known as the *coefficient of determination.* It’s represented as *r** ^{2}* (pronounced

*r-squared)*and provides a better way of interpreting the strength of a relationship.

For example, a correlation of *r* = .5 squared becomes .25. Note that *r** ^{2}* is often expressed as a percentage, 25%. For the correlation between taps and time, the

*r*

*is 31%. That means taps can explain 31% of the variation in time. And conversely, time explains 31% of the variation in taps. As you can see, even a strong correlation of above*

^{2}*r*= .5 still explains a minority of the differences between variables.

Height, for example, explains around 64% of the variation in weight. That means that knowing people’s heights will explain most — but not all — of why they are a certain weight. Other factors explain 36% of the variation. That would include things like exercise, eating habits, or genetic factors that make some people weigh more at a certain height than others of the same height.

Use this same approach when correlating customer analytics. Find the correlation, square it, and then interpret the *r*-squared value. When stakes are high, you want to have high correlations and explain most of the variation between variables. With customer analytics, there are usually multiple variables that predict another variable. I get to multiple regression later in this appendix.

*Correlation is not causation*

One of the most important concepts about correlation that you will hear repeated, because it’s worth repeating, is that correlation is not causation. That means just because one variable is correlated with another, doesn’t mean one variable is caused by another variable. Time doesn’t cause taps. SAT scores don’t cause higher grades. Net Promoter Scores don’t cause higher revenue (see Chapter *12*).

You can say there is an association, but that association doesn’t imply causation. See the later section on ways to determine causation.

It could be that a new design causes higher website conversion rates or it could be that a coupon increases same-store sales. However, there could be other variables that are actually affecting the outcome variable. For example, it could be that same-store sales were already increasing because of an increase in customers. Or it could be that more customers are converting on a website (making a purchase) because the competitor website sold out of the same product — not because of your website design change. Always consider what other variables might be affecting the relationship when making statements about causation.

*Associations between binary variables*

Very often in customer analytics, you encounter binary data that takes the form of yes/no, purchase/didn’t purchase, agree/disagree, and so forth (see Chapter *2*). You need to understand the association between binary variables just as you need to understand the association between continuous variables that I describe in the preceding section. While the principle of correlation is the same with binary data, however, the computations are different.

One of the most famous and visible examples of predictive analytics with binary data is the Amazon recommendation engine, as shown in Figure *A-5*.

**Figure A-5:** Amazon’s recommendation engine.

While the exact algorithm Amazon uses is proprietary, it’s known that much of it is based on an association that indicates that a person who purchases one book also purchases another book. The recommendations are based on binary variables. To generate a recommendation, Amazon computes the proportion of customers who purchase one book and the proportion of the same customers who purchase any number of other books. (See Chapter *2* for a reminder on binary data and proportions.) Books with the highest association are recommended first, the next-highest associations next, and so forth. Figure *A-6* shows transactions from 15 customers across four books. These could just as likely be software, groceries, songs in a playlist, TV shows, or any products or services customers can select from.

If the customer purchased the book, there’s a 1 in the row; if she didn’t, there’s a 0. For example, Customer 1 purchased Book A and Book B, but not C or D. Customer 2 purchased only Book B.

**Figure A-6:** Purchases per customer represented as a yes/no choice.

To compute the association between any two book purchases, follow these steps:

1. **Count the number of customers who purchased each of these combinations of books:**

· Neither book

· Both books

· Only Book A

· Only Book B

2. **Put the totals in a table, like this:**

For example, six customers bought both Books A and B.

3. **Label the table cells A to D, like this:**

4. **Use the formula for the correlation between binary variables:**

5. **Fill in the values for the books to find the correlation between binary variables, like this:**

In this case, the correlation between customers who purchase Book A and Book B is .327.

A correlation between binary variables is called *phi,* and is represented with the Greek symbol

You can interpret the association between binary numbers the same way as the Pearson Correlation *r.* In fact, phi is a shortcut method for computing *r.* You get the same results by using the Excel Pearson formula and computing the correlation for all sets of data.

Figure *A-7* shows the data setup in Excel. I computed the correlation between all pairs of books using the =PEARSON() Excel function.

I then created a matrix of correlations for each pair of books, as shown in Figure *A-8*.

Confirming the earlier result, the correlation between Book A and B is .33. The second-highest correlation is between Book A and Book D at .25.

The correlation between Book B and Book C is -.48. This negative correlation means that customers who purchase Book B are less likely to purchase Book C.

So if a customer is viewing and considering purchasing Book A, it would make sense to recommend (and possibly offer that customer an incentive) to also purchase Book B and D, but not Book C.

You may hear the terms *Basket Analysis* or *Affinity Analysis.* Both of these are just other names for finding associations and correlations between variables. It’s like examining customers’ shopping baskets in a grocery store to see what items are purchased together.

**Figure A-7:** The correlation between books using the Pearson Excel function.

**Figure A-8:** Correlations between book purchases.

*Determining Causation*

While correlation alone is not causation, there are ways to determine and show causation between customer variables. The amount of faith you can have in claims of causation depends on the method used to collect the data. While you may think that a new web page design resulted in more page views, it could be that page views were already increasing.

You can use any of five methods to make claims about causation, starting from the strongest and proceeding through the weakest.

*Randomized experimental study*

Randomly assigning participants to different design treatments and/or a control in a research study is an experimental design. For example, if you wanted to know which design customers would understand the most on a check-out page, you can create three different designs:

· The dependent variable could be something like

· Accuracy in answering questions

· Difficulty in checking out

· Confidence in checking out

· Time to check out

· The independent variable is the design — with three variations.

The hallmark of experimental research is randomly assigning participants to different treatments. You identify the design that users correctly selected and were most confident in using to make their selection.

There are all sorts of variables you can’t control for — or are unaware of — that could impact results. But by randomly assigning participants to different designs or treatment conditions, you spread those nuisance variables evenly across designs. This increases the internal validity and generalizability of the findings.

As another example, researchers in Europe conducted an experiment in which they manipulated both the usability and visual appeal of an online e-commerce website. They essentially took one website, made the navigation intuitive or not intuitive, and then changed the colors and contrast to be appealing or unattractive. They found that customers find more usable websites more attractive. The researchers concluded that better usability increases opinions about attractiveness. Their conclusion is well-substantiated because they used a randomized experimental design.

Experiments (with random assignment) provide the strongest controls against extraneous variables and provide the highest levels of internal validity. These generate the strongest types of research results. But what happens if you cannot randomly assign participants?

*Quasi-experimental design*

If you want to test different conditions, but you cannot randomly assign participants to the different conditions, then the study is quasi-experimental. For example, you might want to know if customers find the beta version of a software product more usable than an existing version. Customers of beta software usually volunteer to use the software during the beta-test period. This self-selection (non-random) assignment introduces a potential source of bias into the results. It has higher external validity because these groups are naturally segmented, but has lower internal reliability.

When you compare attitudes of usability (say from the SUS or SUPR-Q, as discussed in Chapter *9*) from the beta software customers to the existing version customers and find a difference, the difference could be due to differences in the type of people using the software and not actual differences in the attitude. This type of problem is confounding and makes the quasi-experimental design type less internally valid than the experimental condition.

As another example, I worked with a national retailer a few years ago that wanted to know the effects of direct mail coupons for in-store purchases. I used two markets: One received one new coupon (treatment) and the other acquired the standard coupon (control) from newspaper inserts mailed to homes. I compared the sales of stores prior to the coupon and after the coupon in both markets.

I couldn’t randomly assign people to live in different cities, so I used two similarly sized midwestern advertising markets and looked to see what the new coupon did to sales. While I was able to show more sales with the new coupon, there was still some uncertainty about whether that difference was just due to other differences in the markets.

The weakness with quasi-experimental studies is that you can never be as sure as you can with random assignment that any increase in sales is attributable to the variable (in this case, sales) or to other nuisance variables (in this case, just differences between the markets).

*Correlational study*

A *correlational study,* as the name suggests, is when you look at the relationship between two variables and report the correlation. For example, the relationship between product usability and likelihood to recommend is a strong positive correlation (meaning ease is strongly associated with, and likely predicts, much of why users do and don’t recommend products).

While correlational studies provide valuable results, they don’t have random assignment and the independent variables aren’t manipulated, which lessens the internal validity of the findings and weakens the case for causation.

The next time you hear that one customer metric causes another metric, look to identify how that was determined. Chances are it was done with either a correlational study or a quasi-experimental design. That doesn’t mean one variable doesn’t cause another; it just means you can’t be as confident.

*Single-subjects study*

It’s often the case that getting access to customers is extremely difficult. For example, you might be interested in whether a new interface to a PET scanner reduces the time it takes attending radiologists to adjust a setting on the scanner.

If you had access to one of these customers, you could ask her to perform a task on the existing software version three times, record how long it took to complete, have her attempt the same task three times on the new software, and finally, have her attempt it again three times on the old version. Figure *A-9* shows how this data looks on a scatterplot.

This type of single-subject study uses what’s called an ABA condition (where A is the existing software and B is the new software). The repeated trials help establish stability in the measures and increase the internal validity of the finding (as much as you can from a single subject).

The obvious limitation with the single-subject design is generalizability. All you know is that when you manipulate an independent variable (the software), task time goes down for one user. There could be a number of variables you’re not accounting for. For this reason, single-subject designs aren’t used very often in customer research.

**Figure A-9:** A single-subject customer study.

You can actually use more than one participant in a single-subject design (for example, two or three radiologists) and use the same technique to establish the pattern. To be more sophisticated in your analysis, you can also use time series analysis to examine trends over time and by condition for each user or the data in aggregate (I discuss time series analysis later in this chapter).

*Anecdotes*

Unfortunately, many business decisions are made based on opinion or hearing from a vocal customer or sales rep. While a good story of a successful product strategy can be convincing emotionally, it carries little weight when establishing causation.

*Predicting with Regression*

While a correlation speaks to the strength of a relationship between two variables, and the r^{2} helps explain that strength of the relationship, what you need to do to predict one variable from another is to use an extension of correlation called regression analysis. *Regression analysis* is known as a “workhorse” in predictive analytics. The math isn’t too complicated, and most software packages support regression analysis.

Regression analysis extends the idea of the scatterplot used in correlation and adds a line that best “fits” the data.

One of the requirements of using correlations and regression analysis is that the data is linear. *Linear* means a line can reasonably describe the relationship between variables and then be used to predict values that don’t appear in your data (future customer data points). If the scatterplot of your data forms a curve, or any shape that a line doesn’t fit well, you may get misleading results.

While there are many ways to draw lines through the data, the least squares analysis is a mathematical way that reduces the distance between the line and each dot in the scatterplot. This analysis can be done by hand or by using software such as Minitab, SPSS, SAS, R, or Excel.

Figure *A-10* shows the least squares regression line from the scatterplot of taps and time (refer to Figure *A-1*).

**Figure A-10:** A least squares regression line.

The software gives you the equation to the regression line above the graph:

Time = 86.57 + 4.486 Taps

The regression equation takes the general form of

. Here’s an explanation of each part of the equation:

· ** (pronounced y-hat):** This is the predicted value of the dependent variable: predicted time.

· *b*_{0}** :** Called the

*y*-intercept, this is where the line would cross (or intercept) with the

*y*-axis.

· *b*_{1}** :** This is the slope of the predicted line (how steep it is).

· **X:** This represents a particular value of the independent variable: taps.

· ** e:** represents the inevitable error the prediction will contain.

So in this example, the regression equation indicates that the predicted amount of time it takes a customer to make a purchase is equal to 86.57 (the *y*-intercept) plus 4.486 (the slope) multiplied by the number of taps (X).

*Predicting with the regression line*

It’s the regression formula that allows you to predict customer values that don’t exist in your data. It allows you to perform “what-if” analyses on future customer values. This is the “predictive” part of predictive customer analytics.

For example, using the regression equation from the preceding example, you can predict how long a customer takes to make a purchase with 38 taps. You just fill 38 in the regression equation.

Time = 86.57 + 4.486(38)

Time = 86.57 + 170.47 = 257.04

A customer needs 257 seconds, or a bit longer than four minutes, to make a purchase that requires 38 taps.

The dependent variable is denoted “Y” and is displayed on the y (vertical) axis. The independent variable is called X and is displayed on the horizontal (x) axis. (See Chapter *2* for discussion about independent and dependent variables.)

Instead of predicting a customer’s task time from taps, this same approach can be used to predict other customer analytics, including:

· Customer revenue from advertising revenue

· Likelihood to recommend from usability data

· Number of conversions from website page views

*Creating a regression equation in Excel*

To create a regression equation using Excel, follow these steps:

1. **Insert a scatterplot graph into a blank space or sheet in an Excel file with your data.**

You can find the scatterplot graph on the Insert ribbon in Excel 2007 and later.

2. **Select the x-axis (horizontal) and y-axis data and click OK.**

Put what you want to predict in the *y*-axis (so my time data is in column B). The taps are in column C.

You now have a scatterplot.

3. **Right-click on any of the dots and select “Add Trendline” from the menu.**

The Format Trendline dialog box opens, as shown in Figure *A-11*.

**Figure A-11:** Add the regression equation and *r*-squared value to the trendline.

4. **Select Trendline Options on the left, if necessary, then select the Display Equation on Chart and Display R-Squared Value on Chart boxes.**

You now have a scatterplot with trendline, equation, and r-squared value, as shown in Figure *A-12*. The regression equation is *Y* = 4.486x + 86.57. This is the same regression equation from the earlier “*Predicting with the regression line*” section, except that the *y*-intercept (86.57) is after the slope.

The r^{2} value of .3143 tells you that taps can explain around 31% of the variation in time. It tells you how well the best-fitting line actually fits the data.

**Figure A-12:** Scatterplot, regression line, and equation computed in Excel.

Going beyond the ends of observed values is risky when using a regression equation. There’s no guarantee that the regression line will continue to be linear because it extends before and after the data points.

Watch out for the following three things when correlating customer analytics data and using regression analysis:

· **Range restriction:** Two variables might have a low correlation because you’re only measuring in a narrow range. For example, height and weight have a strong positive correlation, but if you measure only National Basketball Association (NBA) players, the correlation would mostly go away. This can happen, for example, if you are looking at a narrow range of customers — say, the ones with the highest incomes or most transactions.

· **Third variables:** It’s often the case that another variable you aren’t measuring is actually the cause of the relationship. For example, high school grades are correlated with college grades. It may seem like better studying in high school leads to better grades in college. However, it’s often the case that a third variable, Socio Economic Status (SES) is a better explanation of both high school and college grades. Students in families with higher SES tend to have higher grades in high school and college than students from families with low SES. In customer analytics an improving economy or a growing company may be the reason for increases in sales, and not your marketing campaign or feature changes.

· **Nonlinearity:** The relationship between variables needs to be linear — that is, follow a line somewhat. If the relationship curves downward or upward, a correlation and regression equation will not properly describe the relationship.

*Multiple regression analysis*

When you have one independent variable predicting one dependent variable, it’s called *bivariate regression.* You can extend the idea of regression to include more than one independent variable, which then becomes multiple regression. As with the Tap and Time data, taps only predict 31% of the variation in time. Other variables also predict customer time. Multiple regression analysis includes additional variables to see how much, if at all, they contribute to explaining variations in the dependent variable, beyond the variables already included.

In Chapter *12*, I discuss the value of a key driver analysis for predicting which variables best predict customer loyalty. A key driver analysis typically uses multiple regression analysis (another technique is Shapley Value Analysis, which uses a different algorithm but provides similar output).

With multiple regression, you build the regression equation with two or more variables. You can then see how important each independent variable is, relative to the others included.

With another independent variable, the regression formula becomes :

· ** (pronounced y-hat):** Represents the predicted value of the dependent variable: predicted time.

· *b*_{0}** :** This is the

*y*-intercept.

· *b*_{1}** :** The regression coefficient “weight” for variable 1.

· *b*_{2}** :** The regression coefficient “weight” for variable 2.

· *X*_{1}** :** Represents a particular value of independent variable 1

· *X*_{2}** :** Represents a particular value of independent variable 2

· ** e:** Represents the inevitable error a prediction will contain

For example, 2,584 customers rated their likelihood to recommend a learning management system (LMS) used at a university:

· They rated their likelihood to recommend on a 0 to 10 scale (0= not at all likely to recommend and 10 = extremely likely).

· These customers rated their satisfaction on several other variables using a five-point scale from not at all satisfied (1) to very satisfied (5). They rated attributes including

· Stability of the system (whether it crashed)

· Satisfaction with customer support

· Efficiency (how quickly customers get their tasks done)

See Chapter *9* for ideas on measuring customer attitudes.

· Usability of the product across four 5-point questions for a total of 20 points.

Here’s how I used a bivariate regression equation to see how much customers’ satisfaction with the product’s stability predicts their likelihood to recommend it.

1. I computed the correlation and found that satisfaction with stability has a reasonably strong correlation *r* = .49.

2. I performed a regression analysis using the statistical software SPSS.

You can also use Minitab, SAS, and R, and online calculators, to perform a multiple regression analysis.

In SPSS, I chose Analyze⇒Regression⇒Linear Regression. In the Linear Regression dialog box, I used the satisfaction with the product stability (labeled Stability) of the software to predict the likelihood to recommend (labeled NPS), as shown in Figure *A-13*.

I got the following table of results shown in Figure *A-14*. While there are a lot of numbers, what you’re interested in is the regression equation in the B column. This column contains the B’s in the regression equation described earlier (the *y*-intercept and slope). The estimate of likelihood to recommend = 1.950 + 1.423 (Stability). The “Sig” column values show that the predictor of Stability Satisfaction is statistically significant (not just a chance association) because the values of .000 are less than .05. See the sidebar discussion on statistical significance. The r^{2} value is .24, meaning stability explains around 24% of customers’ likelihood to recommend (this is found from squaring the correlation of .49).

**Figure A-13:** Predicting the likelihood to recommend.

**Figure A-14:** Regression output from SPSS.

Next I wanted to see if usability adds any predictive power to customers’ likelihood to recommend, beyond what is being explained by their satisfaction toward the stability.

3. I ran the correlation between usability and likelihood and got a strong correlation of *r* = .733.

The correlation between usability and stability is also medium-strong, *r* = .477. It’s often the case that independent variables will correlate with each other and the dependent variable. But you don’t want to include variables that provide completely redundant information. It could be that when customers rate the usability of the software, they are already considering its stability. Multiple regression can tell you if the stability satisfaction is adding anything that usability isn’t already adding (and vice versa).

4. I repeated the procedure with the updated table of results.

5. I added in usability to the regression equation. The results are shown in Figure *A-15*.

**Figure A-15:** Multiple regression output from SPSS.

Look at the B column to see the regression weights and to build the equation for likelihood to recommend:

Likelihood to Recommend = -.937 + .541 (Stability)+ .419 (Usability)

Again, both predictors are statistically significant because the values in the “Sig” column are under .05. The r^{2} for this equation is .561, meaning these two variables together predict around 56% of likelihood to recommend.

If the correlation between independent variables is too high, an undesirable condition called *multicollinearity* occurs, which can provide a misleading regression equation. I get concerned when the correlations exceed *r*>.9 and I use additional tests to be sure multicollinearity isn’t occurring.

This regression equation output reveals a few things.

· Adding usability as a variable substantially increased the explanatory power compared to just stability alone:

· With stability alone, you could predict 24% of likelihood to recommend.

· When you add in usability along with stability satisfaction, the explanatory power more than doubles to 56%, which is the majority of variation.

· You can predict a customer’s likelihood of recommending the software product by inserting values for stability and usability ratings.

For example, a stability satisfaction rating of 4 (out of 5) and a usability score of 15 (out of 20) would result in a likelihood to recommend score of

Likelihood to Recommend = -.937 + .541 (4)+ .419 (15) = 7.5

That would place this customer in the Passive category, not terribly likely to recommend, but probably someone who also won’t detract (see Chapter *12*).

· The relative importance of each of these two variables in predicting likelihood to recommend.

The weights of each value are the coefficients in the regression equation (.541 for stability and .419 for usability). While this appears to show that stability has the higher weight, and therefore importance, this is misleading because the items used different scales (a 5-point versus a 20-point scale).

To better understand the importance, look at the standardized coefficients to find the relative importance. The standardization process converts each raw score into a standard score, making the values comparable. You see this value in the Standardized Coefficients column (refer to Figure *A-15*). The stability has a standardized weight of .188. This means that increasing customers’ satisfaction with the stability increases the likelihood to recommend by .188 points. In comparison, the standardized weight of usability is .641. That means a 1-point increase in the standardized usability score increases the likelihood to recommend by .641. In other words, usability is more than three times as important as stability in predicting customer loyalty (.641 / .188 = 3.4).

*Predicting with binary data*

The regression examples used in the preceding section both use continuous dependent and continuous independent variables. You can use also use categorical predictor variables with two, three, or more specific categories — for example, task-completion, agree-disagree, or a customer segment (high income, medium income, or low income), as long as they are dummy coded with 1’s and 0’s, where the 1 represents the category presence and 0 the absence. See Chapter *2* for a reminder on dummy coding. For regression analysis, the dependent variable needs to be a continuous variable. A special form of regression analysis called *logistic regression* can handle categorical data as the dependent variable.

**Statistical significance and p-values**

When dealing with customer analytics in general, you’ll encounter the phrase *statistically significant.* You’ll also run into something called a *p-value.* There’s a lot packed in that little p and there are books written on the subject. Here’s what you need to know.

In principle, a statistically significant result (usually a difference) is a result that’s not attributed to chance. More technically, it means that if the Null Hypothesis is true (which means there really is no difference), there’s a low probability of getting a result that large or larger.

Consider these two important factors.

· Sampling Error. There’s always a chance that the differences we observe when measuring a sample of customers is just the result of random noise; chance fluctuations; happenstance.

· Probability; never certainty. Statistics is about probability; you cannot buy 100% certainty. Statistics is about managing risk. Can we live with a 10-percent likelihood that our decision is wrong? A 5-percent likelihood? 33 percent? The answer depends on context: What does it cost to increase the probability of making the right choice, and what is the consequence (or potential consequence) of making the wrong choice? Most publications suggest a cutoff of 5% —it’s okay to be fooled by randomness 1 time out of 20. That’s a reasonably high standard, and it may match your circumstances. It could just as easily be overkill, or it could expose you to far more risk than you can afford.

The p-value is one of the outcomes of a statistical test when making a comparison, say, between the conversion rate in a test of one marketing campaign compared to another.The p-value stands for *probability value.* The p-value is the probability of obtaining the difference you see in a comparison from a sample (or a larger one) if there really isn’t a difference for all customers.

Some examples of p-values are .012, .21, or .0001; a p-value of .012 indicates that the difference observed would only be seen about 1.2% of the time, if there really is no difference in the entire customer population.

Given that this is a pretty low percentage, in most cases, researchers conclude that the difference observed is not due to chance and call it statistically significant. By convention, journals and statisticians say something is statistically significant if the p-value is less than .05. There’s nothing sacred about .05, though; in applied research, the difference between .04 and .06 is usually negligible.

Statistical significance doesn’t mean practical significance. Only by considering context can you determine whether a difference is practically significant (that is, whether it requires action).

*Predicting Trends with Time Series Analysis*

A natural extension of regression analysis is time series analysis, which uses past customer data collected over regular intervals to predict future customer data on the same intervals. Time series analysis can be used to predict things like

· Subscription rates

· Train ridership

· Product sales

· Web page views

For example, requiring customers to register for updates with a website is a way to nurture lead generation. With customers providing their email addresses, they are also giving permission for an organization to directly communicate, market, and (attempt to) convert them into paying customers.

Figure *A-16* shows the total number of subscribers from January 2012 through February 2014 from a B2B services company website. With this data, you can use the past pattern of subscribers to predict what the future number of subscribers will be.

**Figure A-16:** Number of website subscribers for the specified time frame.

To estimate the cumulative number of subscribers in the future, follow these steps to use time series analysis in Excel:

1. **Create a line graph from the data by month and year in Excel. Insert a line graph into an Excel sheet with the data.**

2. **Add the cumulative column as the series values in the graph in the Edit Series dialog box.**

3. **To create x-axis date labels, select both the month and year columns in the Axis Labels dialog box.**

Figure *A-17* shows the cumulative number of subscribers by month and year.

**Figure A-17:** The cumulative number of subscribers by month and year.

You can see that pattern of cumulative subscribers is generally linear (forming a line going up). By adding a regression equation, you can predict the future number of subscribers (assuming subscriber growth continues to exhibit this linear pattern).

4. **Add a regression equation:**

· Click on the data line and right-click “Add Trendline.”

· In the Format Trendline dialog box, select the Display Equation on Chart” and Display R-Squared Value on Chart boxes.

Figure *A-18* shows a linear regression equation. The best fitting line does a good job of describing the relationship. This r^{2} value is .988, meaning this line explains 98.8% of the variation in subscriber rates, which is excellent.

**Figure A-18:** The linear regression line added to the graph.

The only independent variable used here is the sequence of time over 26 months (from 1 to 26). The regression equation for subscribers for the 26 months is:

Subscribers = 81.109(x) +1896.8

You can now predict the number of subscribers for a specific month — say, May 2014, which would be the 29th data point (3 into the future).

The estimated total number of subscribers for May is:

May Subscribers = 81.109(29) +1896.8 = 4249

Any judgment about the future is susceptible to errors. It’s important to understand the limitations of using past data to predict the future.

*Exponential (non-linear) growth*

One of the benefits of first graphing data is that you can examine the relationship to be sure a line does a good job of fitting it. Customer growth is a key metric for social media companies like Facebook, LinkedIn, and Twitter. Over short intervals (weeks and months), growth looks linear, but over longer periods of time, the growth is exponential. It will often be the case that an exponential (non-linear) equation fits your data better and will provide a better prediction.

You can see if an exponential trendline better describes the subscriber growth than a linear one:

1. **Right-click the data and choose Format Trendline.**

2. **Choose Exponential in the Format Trendline dialog box.**

An updated trendline with an exponential regression equation is shown in Figure *A-19*.

**Figure A-19:** The exponential regression equation added to the graph.

Here is the new regression equation.

Number of Subscribers = 2027.6e^{0.0273 (x) }= 4249

The “e” represents a constant, which is approximately 2.71828, and is raised to the power of .0273 and multiplied times the number of months. You can see why it’s called an exponential equation, as the month is now part of an exponent. The r^{²} value is 0.9988, which is actually higher than the linear equation (which had an r^{²} value of 0.988), meaning this equation fits better.

The prediction for May, the 29th data point, is:

May Subscribers = 2027.6e^{0.0273 (29) }= 4475

In Excel, use the function =EXP(0.0273*29) × 2027.6 to get 4475.

*Training and validation periods*

A more sophisticated and often essential approach to time series analysis involves partitioning your customer data into training and validation periods. In the training period, you build a regression equation on the earliest section of data (approximately two-thirds to three-fourths of your data). You then apply the regression equation to the later part of your data in the validation period to see how well the earlier data actually predicts the later data.

With the subscriber data, you could use the first 20 months (January 2012 through August 2013) as the training period and September 2013 to February 2014 as the validation period. This approach is testing the equation using data you already have, which is as close as you can get to testing how well a prediction might perform when new data comes in.

The regression equation for the first 20 months is:

Subscribers = 2033.9e^{0.0269x}

The r^{²} = 0.9979, which shows a good fit for the exponential line. You can then use this regression equation to see how well it predicts the final six months of the dataset. The last six months are 21 through 26. Figure *A-20* shows the predicted and actual values for August 2013 through February 2014, labeled the Validation (in the Period column).

To assess how well this prediction actually is, I created two additional columns. The first is the raw error from the actual number to the prediction. For example, in September 2013, the prediction was short by 5 subscribers. In February 2014, it was short by 28. This sort of raw error can itself be understandable, if you’re familiar with the customer data you’re working with. When communicating how much error your predicted values have, it’s often easier to speak in terms of percentage error.

The Mean Absolute Percentage Error (MAPE) can be a bit more understandable to stakeholders. It’s computed by finding the absolute value of the difference between the actual and predicted values, then dividing that difference by the actual value to compute the absolute percentage error. This is then averaged for each value.

The APE column shows the absolute percentage error. For example, for January 2013, the regression equation predicted 2,885 subscribers; the actual number of subscribers was 2,844, meaning the equation overpredicted by 41 subscribers.

**Figure A-20:** The predicted and actual values for certain months.

Applying the Excel formula for the absolute percentage error (APE) generates an error of 1.4%:

=ABS(2885-2844)/2885 = .014 or 1.4%

The MAPE for the training period is .589%. The MAPE for the validation period is .870%, which is a bit higher, but both are still under 1%.

Finally, the predictions for March, April, and May 2014 are 4,205, 4,320, and 4,437.

=EXP(0.0269*27)*2033.9= 4205

=EXP(0.0269*28)*2033.9= 4320

=EXP(0.0269*29)*2033.9= 4437

There are a number of more sophisticated techniques that can make more accurate models by taking into account seasonality and autocorrelation, and then smoothing the data to better interpret patterns. Software such as JMP and Minitab have these features built in.

Predicting the future is always risky because you’re assuming the future will have similar patterns as the past. In most cases it does and can be an excellent predictor of customer behavior. However, unusual events (outraged customers on social media, a terrorist attack, or recession) that are unpredictable can substantially affect the accuracy of your predictions. Treat predictions as a guide, not an absolute.

*Detecting Differences*

Determining if differences are statistically significant is another valuable tool for predicting with customer analytics. In fact, I cover three different applications in other parts of this book: A/B tests (see Chapter *10*), usability tests (see Chapter *14*), and findability tests (see Chapter *15*). All three of these tests use sample customer data to make inferences about a larger customer population.

The particular statistical test you use depends on the type of data you have. Because binary data is so common in customer analytics, I cover comparing two proportions (recall that a percentage is just a proportion times 100) in this section.

Common questions include

· Do higher-income customers purchase at a higher rate than lower-income customers?

· Is there a difference in buying habits between men and women?

· Does one online advertisement result in more click-through than another?

· Will a green button lead to more purchases than a blue button?

To determine if there are statistical differences between proportions, you use a similar contingency table just as you do when computing correlations between binary variables.

A key difference here is that you’re working with between-subjects data:

· Of the customers who considered Book A, which proportion ultimately purchased Book A?

· Of the customers who considered Book B, what proportion of customers ultimately purchased Book B?

· Is the difference in proportions from a sample of customers greater than what you’d expect from chance alone?

Even if you don’t sample your customer data, but instead compute analytics on all transactions in a given period (for example, on a website for a month), you’re interested in what will happen in the future. So you are using the current customer data as a sample of all future customer data. For that reason, you need to take into account chance variation, which is what computing statistical significance is all about.

Say you want to know if more customers are likely to click on an advertisement using one marketing message (Ad A) or are more likely to click through using an alternate message (Ad B). Imagine 435 customers were randomly served either Advertisement A or Advertisement B on a web page during a week. If 18 out of 220 clicked through on Ad A (8%) and 6 out of 215 clicked through on Ad B (3%), is there enough evidence that more future customers will click on Ad A over Ad B? There is a 5 percentage point difference in click-through rates, but is this difference just random variation or does it represent a real difference in the effectiveness of ads?

The type of statistical test used for detecting statistical differences between binary variables is a Chi-Square test. It is represented with the Greek symbol *X** ^{2}* (Chi). Follow these steps to conduct a Chi-Square test.:

1. **Set up the data, like this:**

2. **Apply the*** X*^{2}** formula (which works for small and large sample sizes):**

Here are the values filled in.

3. **Fill in the values in the formula to get the Chi-Square statistic.**

4. **Apply the Excel formula and obtain a p-value.**

In this case, you use the Chi-Square statistic of 6.05:

=CHIDIST(6.05, 1) = .014

A p-value this low indicates that a difference of 5 percentage points or greater would only happen about 1.4 times in 100, if there really was no difference. A difference this large is probably not just random noise (see the earlier sidebar on statistical significance and p-values). So Advertisement A would likely get more customers to click through and would be a more effective ad to implement going forward.