Quantitative Methods for Product Management - Freemium Economics: Leveraging Analytics and User Segmentation to Drive Revenue (2014)

Freemium Economics: Leveraging Analytics and User Segmentation to Drive Revenue (2014)

Chapter 3. Quantitative Methods for Product Management

This chapter, “Quantitative Methods for Product Management,” presents a framework and toolset for quantitatively evaluating product performance. The success of freemium products is highly dependent on scale, and with increases in scale come increases in data volume. Analyzing and parsing that data requires competence in the statistical and quantitative methods presented in this chapter. The chapter begins with an introduction to descriptive statistics, which is a basic set of standard statistical metrics that depict the character of a data set. The chapter proceeds into an overview of A/B testing, which is the process whereby two alternatives are considered simultaneously to determine which is the best for a particular use. After this, regression methods, which use historical data to attempt to extrapolate new values from the data, are discussed at length within the context of developing freemium products: using regression to predict future behavior, user base size, revenue, etc. The chapter concludes with a discussion of user segmentation, which involves categorizing users based on various sets of characteristics in order to analyze the product at a granular level.

Keywords

regression; linear regression; descriptive statistics; logistic regression; predictive models; product management; A/B testing; user segmentation; regression in product management; exploratory data analysis

Data analysis

The large volumes of data generated by freemium products can’t be used in a product development feedback loop without being analyzed; analysis is the medium through which data informs iterations. And while many products are attended to by well-trained and highly skilled analysts whose sole job it is to parse insight out of data sets, product teams should understand the basic concepts of data analysis in order to effectively communicate in the vernacular of data.

Data analysis is usually discussed in terms of variables, or dimensions by which data is collected and calculated. Variables fall into two broad categories: independent variables, which are observed as model inputs and, for the sake of analysis, are considered to be unaffected by other variables in a model, and dependent variables, which are calculated as model outputs and derived with respect to a potential relationship with independent variables.

A variable can exist as one of a number of different data types, depending on the range of values the variable encompasses. Binary variables take the form of one of two possible values: most commonly, true (1) and false (0). Categorical variables take the form of one of a limited number of specific values; country of origin, day of week, and month of year are examples. Count variables are non-negative numbers that span the scale from 0 to infinity and represent the frequency with which a specific value exists over an arbitrarily defined interval. Finally, ordinal variables are numeric variables used to provide rank to a set of values; the value of the variable itself is generally irrelevant and is used solely for the sake of sorting objects.

In order to glean meaningful insight from a data set, a basic set of characteristics that describe that data set must be known. These characteristics are known as descriptive statistics—summary attributes about a sample of data that don’t rely on any probabilistic methods to draw conclusions about the broader population from which the sample came. Descriptive statistics of user behavior help a product team understand, at the highest conceptual level, fundamental truths about the way users interact with a product.

Descriptive statistics

A small sample data set is shown in Figure 3.1. The data set is depicted with a horizontal bar graph; below the graph is a table containing the set’s data points (with x values on the top and y values on the bottom). The data set’s basic descriptive statistics are below the data table.

image

FIGURE 3.1 A five-point sample data set with descriptive statistics.

The word “sample” is used deliberately to describe a subset of a larger data set because the data was sampled, or selected, from a larger source of data. When a data set represents a sample of the broader scope of data (the population), the size of the set is designated by n; when a data set represents the entire population, the size of the set is represented by N.

Given the degree to which freemium data is collected, sampling is often necessary; conducting basic analyses on data sets consisting of millions or tens of millions of data points requires significant infrastructure overhead and reduces the speed with which conclusions can be drawn. The most common method of selecting a sample from a larger population is known as simple random sampling, and it is undertaken by defining a size for the sample and then selecting data points from the population at random, with no given data point in the population having a greater probability of being selected than any other.

The most common descriptive statistic is the mean, or average (usually represented as µ when describing a population and image when describing a sample). Note that, while in common usage the terms “mean” and “average” almost always refer to arithmetic mean, other flavors of the mean statistic, such as the geometric and harmonic means, do exist. For the sake of simplicity, the term “mean” in this text always refers to the arithmetic mean, which describes the central tendency of a group of numbers.

The mean of a data set is calculated by summing the values of the data points and then dividing that sum by the number of data points in the set. The calculation of the mean from the data set in Figure 3.1 is defined in Figure 3.2. The mean is a useful statistic when comparing trends between groups such as mean revenue or mean session length. But the mean is susceptible to influence from anomalous data points; in statistical parlance, these are called outliers. Outliers are data points that represent extremes on a distribution of expected values.

image

FIGURE 3.2 The calculation of the mean of the sample data set.

The median can be used to contextualize the mean and act as a safeguard against drawing spurious conclusions against a data set populated with extreme outliers. The median of a data set is the value that separates the bottom half of the set from the top half when the set is sorted by value. In other words, the median is the middle value. When a data set contains an even number of data points, the median is the mean of the middlemost two values.

The median provides depth to the mean statistic by highlighting influence from outliers; the larger the difference between the mean and the median, the greater the influence on the mean from outlier values. Figure 3.3 illustrates the process of identifying the median value from the example data set defined in Figure 3.1.

image

FIGURE 3.3 Identifying the median value from the sample data set in Figure 3.1.

One example of the median’s usefulness in making product is demonstrated when user session lengths are evaluated; a very large difference between the median session length and the mean session length, where the mean is greater than the median, might point to the presence of a few very long sessions relative to a majority of very short sessions. In most cases the median and mean should be presented together to add clarity to the data set.

The mode and range of a data set describe the set’s numerical properties. Mode is the value occurring most often in the data set, and range is the difference between the highest and lowest values in a data set. While these descriptive statistics aren’t often used in product management analysis, they can be helpful in gaining a deeper understanding of the structure of a data set.

A data set’s variance (usually represented as σ2 when describing a population and s2 when describing a sample) is a measure of its “spread,” or the degree to which its data points differ in value from the mean. The variance of a sample is measured by taking the squared sum of distances of the data points from the sample mean and dividing that sum by one less than the number of values in the sample (called the unbiased sample variance). A high level of variance within a data set means that values don’t generally cluster around the mean but rather fall across a wide range of values. In such a case, the mean value doesn’t provide much help in terms of predicting future or unknown values based on the properties of the data set. The equation used to calculate variance from the data set in Figure 3.1 is defined in Figure 3.4.

image

FIGURE 3.4 The calculation of variance from the sample data set in Figure 3.1.

A data set’s standard deviation is the square root of its variance (and is thus represented as σ when describing a population and s when describing a sample). Standard deviation is a helpful descriptive statistic because it communicates the same general concept as variance (average dispersion from the mean value, or “spread”) but is in the same unit of measurement that the data set is represented in (variance is squared). Therefore, standard deviation corresponds to distance from the mean.

A common way to distribute descriptive statistics is with a key performance indicator (KPI) report, which may take the form of a dashboard or a regularly distributed spreadsheet file. The KPI report contains a set of descriptive statistics that are usually disaggregated by filters such demographic or behavioral characteristics and presented as a time series.

Insight from descriptive statistics is usually delivered as change over time; for instance, it is hard to take action on knowledge of a specific value for the running total average session length for a specific product, but it is instructive to know that the average session length dropped by a specific value over the past week. For that reason, descriptive statistics in KPI reports usually take the form of bar charts or line charts, with the x-axis representing dates and the y-axis representing values on those dates.

Descriptive statistics can be calculated on any sample size at any level of granularity. Generally speaking, descriptive statistics benefit from specificity: broad-stroke characteristics of an entire population don’t provide much depth to product design decisions at the feature level. Some filtering is usually necessary before a given descriptive statistic can be used in the design process; this is especially true for descriptive statistics describing revenue metrics for freemium products. Given the 5% rule, most descriptive statistics about revenue—especially averages over the entire population—are useless unless segmented by behavior.

Removing outliers from a data set before calculating descriptive statistics is a controversial practice in product management for a number of reasons, the most important of which is that outliers represent the behaviors of real people and aren’t necessarily statistical noise. Extreme values, especially in revenue and engagement metrics, should be taken into account when evaluating the performance of a product feature. In freemium analyses, outliers generally represent the behaviors that a product team endeavors to optimize for: outliers in terms of spending, engagement, and virality are often the lifeblood of a freemium product. Removing those values to deliver a clearer mean value for a metric might remove a valuable data point that could better inform a decision.

There is another reason the practice of removing outliers from data is controversial—it generally leads to misrepresentation of a data set. The manipulation of a data set prior to calculating metrics from it generally comes as a surprise, even when the manipulation is clearly stated in a report. Providing broad context through a collection of descriptive statistics is more sensible than changing a data set. The insight gleaned from differences in descriptive statistics—such as the spread between median and mean—is a form of meta data that can be as helpful in facilitating decision-making as the descriptive statistics themselves.

Exploratory data analysis

Exploratory data analysis refers to an ex ante investigation of a data set that is conducted to build informal assumptions and hypotheses about the data set before undertaking more rigorous analysis. Exploratory data analysis is an important component of the develop-release-measure-iterate feedback loop; it allows for drawing general conclusions without committing to exhaustive analysis. It also helps to provide structure and focus to subsequent analyses so that effort is expended only on the most fruitful analytical pursuits. Exploratory data analysis serves as an introduction to a data set and is an important part of the resource allocation strategy in freemium product development.

When exploratory data analysis is informing the product development cycle, it generally takes one of two forms: as an initial interpretation of data generated by a newly launched feature to gauge the feature’s effectiveness, or as the precursor to a larger analysis when considering the development of a new feature. Both of these points in the product’s life cycle should be inflective based on performance. If a new product feature isn’t meeting expectations, it should be either iterated upon or shut down. If a new product feature concept is being vetted, the decision must be made as to whether it should be built. In either case, exploratory data analysis is not meant to fully explore an issue but rather to provide some context so that further analysis is focused specifically on the trends that will facilitate a decision.

Exploratory data analysis is not exhaustive; it is usually conducted with desktop spreadsheet software and consists primarily of high-level descriptive statistics and visuals. Since the point of exploratory data analysis is to provide guidelines for future analysis, it should paint in broad strokes a picture of the product being analyzed and focus on the key metrics (such as revenue and engagement) that evaluate the feature’s success.

In exploratory data analysis, descriptive statistics should inspect the data set from as many perspectives as possible. Aggregation by relevant demographic and behavioral attributes, such as user location or total revenue generated, can provide insight into how successful a product feature is among broadly defined user segments, while, descriptive statistics help set the tone for future analysis.

Mean and median metric values may not provide adequate context for making optimization decisions, but at the product cycle inflection points—build or don’t build, iterate or shut down—these statistics provide a starting point for framing a discussion. Exploratory data analysis doesn’t necessarily need to address specific questions; that can be done in further stages of analysis. Exploratory analysis is meant for gaining familiarity with a data set so that the most productive analysis themes can be pursued first.

Probability distributions

A probability distribution ascribes a probability to each value in a range of potential outcomes from a random experiment. It is usually represented as a graph, depicting a set of values on the x-axis and the result of the experiment, or the probability of those specific values occurring, on the y-axis.

Probability distributions take two forms, depending on whether the random variable under consideration is continuous or discrete. A continuous random variable is a variable for which values are not countable, or for which an infinite number of partial values can exist (e.g., the length of a user’s first session, given that time can be measured to a theoretically infinite number of decimal places). A discrete random variable is a variable for which values are countable, or for which values must be whole numbers (e.g., the number of friends a user invites to the product in the user’s first session; friends can be invited in whole numbers only).

A function describing the probabilities at various values of a continuous random variable is called a probability density function, although the probability for a specific point on the density function is never referenced. This is because the probability of an experiment resulting in any one specific value, given an infinitely large number of potential result values, is 0. To illustrate this paradox, consider the probability that the length of a user’s first session will be exactly 60.12345 seconds; given an infinite number of decimal places by which to measure seconds, it must be 0.

Instead, probabilities for continuous random variables are always discussed relative to ranges of values, for example, “The probability that the length of a user’s first session falls between 60 seconds and 120 seconds is .80.” The area under the curve formed by a probability density function always sums to one, representing a 100 percent probability that the result of the experiment will be of some value within that range.

For discrete random variables, a probability mass function is used in lieu of a probability density function. The probability mass function represents the probability that a given value included in the discrete set of possible values is the result of an experiment. Probabilities that correspond to precise points on a probability mass function curve can be determined because a discrete random variable can take only a countable number of values, for example, “The probability that a user will invite five friends into the product in the user’s first session is .10.” Similar to a probability density function, the values of all discrete probabilities on a probability mass function sum to 1.

The concept of the probability distribution is important to understand when undertaking any form of analysis; any sample data set’s underlying probability distribution affects the applicability of assumptions drawn from it. The most widely recognized probability distribution is the Gaussian, or normal, distribution. This is the standard “bell curve,” which is symmetrical and densest around the mean, as shown in Figure 3.5. One useful property of the bell curve is that it abides by what is known as the 68-95-99.7 (or the three-sigma) rule: 68 percent of the points on the probability distribution function lie within one sndard deviation of the mean, 95 percent lie within two standard deviations, and 99.7 percent (or nearly all of them) lie within three standard deviations of the mean.

image

FIGURE 3.5 A bell curve with µ=1 and σ2=1.

One common concern when interpreting a data set is skew, the effect of a cluster of values on the symmetry of the probability distribution relative to the mean. In a normal distribution, the probability distribution function is symmetrical around the mean; in a skewed probability distribution function, a mass of values on one side of the mean shifts the distribution in that direction, causing the tail on the other end to be thinned. Negative skew exists when this mass exists on the right side of the mean, shifting the probability distribution curve to the right (i.e., to higher values) and creating a thin tail on the left (i.e., to lower values); positive skew exists when the opposite is true. (See Figure 3.6.) In some cases, positive skew can shift the mean below the median and negative skew can shift the mean above the median, but this isn’t always true.

image

FIGURE 3.6 Two skewed probability distributions.

While normal distribution is present in many natural phenomena such as biology and demography, it isn’t broadly applicable in product management. A far more common distribution for freemium data is the Pareto distribution, which is a power law probability distribution often used in social sciences and finance. The Pareto distribution describes a system in which the majority of values are clustered at the lowest end of the distribution with a long tail that accommodates extreme values at the highest end of the scale, as shown in Figure 3.7. The Pareto distribution is often used in the insurance industry to calculate premiums for incidents where catastrophic damages are improbable but the incidents have a non-zero probability of happening.

image

FIGURE 3.7 The probability density function for a Pareto distribution with shape parameter a=5 and scale parameter b=1.

The Pareto distribution describes many aspects of the freemium model well because the model is viable only under the premise of the existence of extreme values (in terms of revenue, user base size, user lifetime, etc.). The Pareto distribution provides a more appropriate framework for making freemium product decisions than the normal distribution does; behavioral data rarely cluster around mean values, and outliers generally represent desirable behavior.

Freemium products should be built to entice behavior that falls on a Pareto distribution and to exaggerate to the greatest degree possible the low-probability, long-tail events. When a massive user base is driven through a product producing Pareto-distributed behavioral data, the long-tail events can occur with enough frequency to meaningfully influence the metrics they are designed to impact.

The utility of knowing a data set’s probability distribution within the context of freemium product management is that user behavior won’t always fit the normal distribution—in fact, it very rarely does—and thus assumptions about a data set’s distribution can lead to reports containing inaccurate descriptive statistics for a product feature. For instance, when the vast majority of users spend nothing, the mean for revenue data can often sit above the median in a Pareto model with a long right tail. While it almost always makes sense to report a mean value, that value must be contextualized with other descriptive statistics relevant to the distribution, such as standard deviation and variance. For data to drive beneficial product decisions, that data must be capably interpreted; in the freemium model, evaluating a data set within the context of its probability distribution is a cornerstone of prudent data analysis.

Basic data visuals

When exploring a data set for the first time, creating visual depictions of various aspects of the data may provide a greater depth of understanding of the underlying dynamics of the system than merely computing descriptive statistics. Most statistical packages have simple commands that allow for quickly rendered visuals from a data set; creating visuals of a data set’s elements is therefore a common component of early exploratory analysis.

Charting data is an easy way to spot blatant trends or relationships. The type of chart used to depict the data depends on the types of dependent and independent variables. When the independent variable is categorical—that is, it is represented by a limited range of discrete values—then bar charts are often the best way of describing the data set. An example of a data set conducive to a bar chart, depicted in Figure 3.8, is a count of individuals by geography: the x-axis represents a selection of countries, and the y-axis represents counts of users in those countries. The height of each bar is determined by the proportion of the group’s value to the scale of the y-axis.

image

FIGURE 3.8 A bar chart depicting counts of users by country.

When a data set’s independent variable is represented by progression values—values that increase or decrease by a regular interval, such as time—then it is best illustrated with a line chart, with the dependent variable plotted on the y-axis. An example is a time series, or a set of data points measured at regular intervals across a period of time. A common time series used in freemium analytics is the count of users measured by day, where the x-axis tracks days and the y-axis tracks user counts. This example is depicted inFigure 3.9.

image

FIGURE 3.9 A time series of user counts measured by day.

When both the independent and dependent variables of a data set are continuous—that is, when the values of both variables are numeric and fall across a continuous range—then the data can be instructively depicted with a scatter plot. A scatter plot places a point on the graph for each point in the data set with the x-axis labeled as the independent variable and the y-axis labeled as the dependent variable.

Scatter plots are useful in identifying clusters of commonality and trends in relationships. A common example of two freemium variables well-suited to depiction in a scatter plot is the total amount of revenue a user spends (the dependent variable) and the length of the user’s first session (the independent variable). This example is depicted in Figure 3.10.

image

FIGURE 3.10 A scatter plot capturing total lifetime revenue spent by length of the first session.

A histogram is a type of bar chart that plots the frequency of values for a continuous variable and can be rendered to visually approximate its probability density function. Generally, when a histogram is invoked by a statistical package, the program determines the appropriate size of each range of values for the independent variable over which frequency counts will be calculated; these ranges of values are known as bins. Put another way, a bin establishes a range of values for the independent variable, and the frequencies (or counts) of values within each bin are calculated as values for a dependent variable. Bins are generally grouped in equivalent ranges; that is, the range of each bin is the same size. An example of a histogram for a freemium product is the count of users by the range of the first session length, as illustrated in Figure 3.11.

image

FIGURE 3.11 A histogram of user counts by first session length.

The shape of a histogram provides basic insight into a data set’s probability distribution. A cursory glance at Figure 3.11 reveals that the data is not normally distributed; rather, it follows a negative exponential distribution, with the value at each bin decreasing non-linearly from the bin before it.

A box plot (sometimes called a box-and-whisker plot) is a graph used to illustrate the dispersion, or “spread,” of a one-variable data set. A box plot segments a data set into quartiles and graphs the median for each quartile within a box; additionally, the plot graphs the minimum and maximum values for the entire data set, as well as the data set’s median. In order to graph a box plot, five metrics must be calculated: the median of the highest and lowest quartiles, the median of the entire data set, and the minimum and maximum values of the data set. These are calculated for a sample data set of users’ first session lengths and shown in Figure 3.12.

image

FIGURE 3.12 The calculated components for a box plot from a sample data set of user first session lengths.

In order to calculate the median and quartiles, the data set must first be arranged in numerical order. Once sorted, the upper quartile is calculated as the median of the higher half of the range, and the lower quartile is calculated as the median of the lower half. The median of the data set in Figure 3.12 is 66, or the midpoint of the third (60) and fourth (72) sequential values. Once these values are calculated, they can be plotted on a graph as a box containing the quartiles with “whiskers” representing the overall median, minimum, and maximum values. A box plot for the data in Figure 3.12 is depicted in Figure 3.13.

image

FIGURE 3.13 A box plot for a sample data set of first session lengths.

Data visuals are not themselves analyses; rather, they should be used to develop a basic understanding of a data set’s structure in order to preempt analysis. Visuals may provide better guidance than do numerical descriptive statistics for the transformations that must be executed on a data set before proper analysis can be undertaken, especially as related to a data set’s distribution. Likewise, visuals can highlight relationships between two variables that would otherwise require iterative guess-and-check testing to spot numerically. For these reasons, the construction of visuals is often the starting point of an analysis.

Confidence intervals

The descriptive statistics introduced thus far are used to describe a sample, or a subset of a larger data set collected from the entire population. This is often done to avoid engaging with the processes needed to parse very large freemium data sets (which often add a significant layer of complexity to the analysis process, delaying the delivery of results). But the values of descriptive statistics derived from sample data sets must be tempered with some measurement of how well they reflect the properties of the entire population; without knowing how representative of the population a sample is, conclusions drawn from analyses conducted on sample data sets can’t be applied to the product without imposing significant risk.

In order to contextualize a descriptive statistic (or any other statistic, usually referred to in this context as a parameter) derived from a sample data set, a range of values can be provided that are likely to include the population parameter. This range is called theconfidence interval, denoted by CI; a confidence interval is always given with the probability that the interval contains the true value of the parameter from the population. This probability is called the confidence level and is denoted by C. For instance, the mean value of a sample parameter, such as mean first session length, can be more instructive when communicated alongside a range of values where the broader population’s mean first session length is likely to sit. This is the value of the confidence interval: it describes how well the mean first session length per user, as measured from the sample, matches the actual mean first session length per user for the broader population of users.

The use of the word “confidence” in naming this convention is a bit peculiar; the word is not used in other areas of the statistical discipline. This is because the confidence interval concept comes from the frequentist school of statistics, which stipulates that probability be employed to describe the methods used to build samples from populations and not to describe the experimenter’s sense of certainty for a result, which would be subjective. The confidence interval, then, is a means of reconciling the need to attach a level of probability to a statistic derived from a sample data set and an adherence to reporting only objective metrics.

A confidence interval contains four elements: the two values that define the range within which the population parameter is likely to be found (L and H), the population parameter (θ), and the attendant level of confidence for the interval (C). These elements are combined to form the expression shown in Figure 3.14.

image

FIGURE 3.14 The general expression for a confidence interval.

For frequentist statisticians, this expression is valid in the abstract. The expression becomes invalid, however, when real values are inserted into the expression because, in frequentist statistics, probabilities are objective, and thus there is no need to attach a probability to a value existing between a range of two other values; the statement is either objectively true (the population parameter falls between the two endpoints) or not true (the population parameter does not fall between the two endpoints).

Thus, the interpretation of the confidence interval expression has nothing to do with the population parameter; rather, the probabilistic sampling method (SM) by which the sample is drawn is what produces the interval bounded by L and H. Thus, when a confidence interval is derived and values are placed into the expression from Figure 3.14, the correct interpretation of the expression is, “The true value of the population parameter θ will sit between L and H for C percent of the time that a sample is taken using sampling method SM.” As stated earlier, a common sampling method in freemium data analysis is the simple random sampling approach.

Confidence intervals are generally constructed with standard levels of confidence of 90 percent, 95 percent, and 99 percent, with the higher levels of confidence corresponding to smaller value intervals. The level of confidence of the interval relates a range across the normal distribution within which one can expect to find the true value of the population parameter. Because the normal distribution is symmetric, the area of the curve considered is centered around the mean.

Following from the earlier discussion of normal distribution, estimating a range of probabilities for a given distribution requires knowing the distribution’s mean and its standard deviation. But the calculation of a confidence interval assumes only knowledge of the sample, not the population. However, these components of the population’s distribution can be estimated using what is known as the sampling distribution of sample means.

The sampling distribution of sample means is a theoretical distribution of all possible samples that could have been taken from a population. The mean of the sampling distribution of sample means can be substituted for the unknown mean of the population as an application of the central limit theorem, which states that the recorded mean values of a sufficiently large number of samples taken from a population will be normally distributed around the mean of the population.

The standard deviation of the population can be approximated by the standard deviation of the sampling distribution of sample means—known as the standard error (SE)—which is calculated by dividing the sample’s standard deviation by the square root of the sample size, as defined in Figure 3.15.

image

FIGURE 3.15 Standard error calculation.

A z-score (represented by Z) corresponding to the required level of confidence must also be known; z-scores represent the number of signed (positive or negative) standard deviations at which a value sits from the mean on a normal distribution. The z-scores for levels of confidence at 99 percent, 95 percent, and 90 percent are listed in Figure 3.16. (These z-scores are the basis of the 68-95-99.7 rule, as described earlier).

image

FIGURE 3.16 A table of selected z-scores.

Given that a z-score describes the distance from a specific value to the mean, the z-score for a level of confidence value can be multiplied by the standard error of the sample distribution to determine an endpoint for the interval. This logical reversal is true as a result of the properties of a normal distribution, which the sample distribution of sample means always assumes: if 95 percent of the sample means lie within 1.96 standard errors of the true population mean, then it follows that the true population mean has a 95 percent chance of being within 1.96 standard errors of any given sample mean. This is illustrated in Figure 3.17.

image

FIGURE 3.17 The sample distribution of sample means, with 95 percent (1.96 standard errors) of the distribution shaded in around the mean.

Calculating a confidence interval for a given level of confidence simply involves multiplying the z-score by the standard error of the sample distribution. Take, for example, an analysis of first session lengths (assuming first session lengths are normally distributed) on a sample of 2,500 users. The average first session length of the sample is 100 seconds with a standard deviation of 25 seconds. The endpoints of the confidence interval for the population mean can be calculated as [99.02, 100.98] with a 95 percent confidence level (corresponding to a z-score of 1.96), as shown in Figure 3.18.

image

FIGURE 3.18 The calculation of a confidence interval at a 95 percent level of confidence.

Note that decreasing the confidence level decreases the range of values produced for the confidence interval. Using the same example, calculating a confidence interval for a confidence level of 90 percent produces endpoints of [99.17, 100.82], as illustrated inFigure 3.19.

image

FIGURE 3.19 The calculation of a confidence interval at a 90 percent level of confidence.

The results of the second confidence interval could be interpreted as “The true population mean of first session lengths will sit between 99.17 seconds and 100.82 seconds 90 percent of the time that a sample is taken when using the same sampling method as the one used in this experiment.”

A/B testing

The term behavioral data is used specifically when describing data-driven design because it adds nuance to the concept of user data, which can be gleaned qualitatively. Behavioral data is objective: it represents users’ reactions to various feature implementations, ranging from core mechanics to design. While qualitative user data is valuable, behavioral data is better suited to the MVP development mindset; quantitative, objective data points allow a product team to easily make a binary decision about retaining a specific feature.

What is an A/B test?

Behavioral data is collected and most effectively interpreted via tests—limited feature implementations that produce data points about how users interact with those features. For the freemium model, and for most digital consumer products acquired via the Internet, the most popular form of testing is called the A/B test. The term “A/B test” is so named because it represents making a decision between two variants, A and B. The process involves selecting a group of users, based on a set of criteria, splitting them into sub-groups, and then simultaneously exposing each group to a different variant of a feature.

The number of groups in an A/B test isn’t necessarily restricted to two; any number of variants can be tested. An example of an A/B/C test would be a test where data is collected from three variants, although tests composed of more than two variants are more popularly called multivariate tests. For the sake of simplicity, A/B testing as discussed in this book refers to a test comparing two groups.

The key to performing an A/B test and generating an actionable result is gathering enough data to be confident that the differences in the performances of the groups are due to the differences in the variants and not due to chance. This confidence is known asstatistical significance. The larger the number of people in each group, the easier statistical significance is to determine, which is why testing more than two variants is reasonable only when the group sizes are very large.

An alternative to simultaneous A/B testing is A/B testing over time—that is, comparing the effects of different features in successive implementations and not concurrently. In general, simultaneous A/B testing is preferable to A/B testing conducted over time, because it avoids cyclicality and exogenous factors that can’t be controlled. For instance, measuring one group in December and another in January for a feature that impacts revenue might skew the data in favor of December, owing almost completely to holiday shopping and not to a difference in the feature variants.

Time-based testing may be the only option for feature testing when just a small number of users can be reached at once. Since A/B tests become more reliable at larger sample sizes, simultaneous A/B testing results can be less actionable than those from time-based tests when sample sizes are restricted. The type of test chosen depends on data availability and the impact of time-based effects on the behavior of users, but generally speaking, A/B testing over time should be undertaken only when simultaneous A/B testing is not viable.

To produce actionable, interpretable results, an A/B test should be isolated; that is, it shouldn’t be run in conjunction with another A/B test. Mixing A/B tests can muddy the effects of each test, rendering them difficult to act on. The benefit of running different A/B tests sequentially and not in unison is that each test receives a maximum proportion of the total available user base; the downside of testing sequentially is that it requires an extended period of time over which to execute all tests. But this isn’t necessarily a bad thing; as a product evolves based on testing, certain product features in the development pipeline may become irrelevant. Sequential testing provides for a more flexible and reactive—as opposed to deterministic—product backlog.

Ultimately, A/B testing should provide a clear result of the supremacy of one feature implementation over another. The granularity of feature differences isn’t prescribed as a rule, but when feature variants differ significantly from one another, it becomes more difficult to ascribe the difference in user behavior to a specific aspect of the better-performing feature. Ideally, the feature variants compared in an A/B test would be identical save for one key difference.

Designing an A/B test

Designing an A/B test and an A/B testing schedule requires forethought and consideration. As discussed earlier, A/B tests shouldn’t conflict; users should see only one A/B test in a given session and, preferably, with enough infrequency so as to produce credibly independent results.

A filter should determine which users are exposed to an A/B test; for instance, a change to user registration flow necessarily involves only new users, whereas a change to a feature that only long-time users find relevant should be restricted to that user group. A/B tests must also be designed with consideration given to sample size. But demographics and past behavior may influence which users a given feature variant should involve; for instance, testing the effects of advertising positioning on users’ propensities to click on those ads might be appropriate only for users who have not previously made direct purchases, limiting the total number of users available for the test. In such a case, consideration must be made for further restrictions on sample size; excluding users from a certain geography or age bracket might reduce the sample size to an unacceptable level. As the sample size of a test decreases, its results become less clear.

The logic that selects which users are exposed to a given test must be implementable in the product; this means that a mechanism for A/B testing needs to be developed and integrated into the product. Depending on how robust the product’s framework for grouping users is, the logic might be limited to specific predefined attributes, which could ultimately determine which A/B tests are feasible. In general, A/B testing infrastructure is complicated to implement and requires extensive upkeep as products develop; more often than not, the limitations of the A/B testing mechanism define the parameters of an A/B test, rather than the other way around.

The length of time over which an A/B test runs will contribute to the overall sample size of the test, given that a longer test will apply to more users. The sample size needed to produce confidence in the test results isn’t set in stone; sample size depends on the difference between the results of each variant. A large difference in variant results requires a smaller sample size to produce statistical significance; thus, establishing a time limit for the test is less effective than establishing a minimum user threshold. Unless user exposure can be predicted with reasonable certainty, the test should run until it has reached some minimum level of exposure, at which point it can be tested for significance.

As with product development, testing should be managed through iterations; the best-performing variant of one test is not necessarily the best functional variant possible. Tests can be continually executed on a product feature to ensure optimal performance, but testing should be regarded as a means to an end and not an end of itself. In other words, testing for the sake of testing is a distraction that can potentially divert resources from their best possible applications. Multiple iterations of the same thematic test (e.g., the background color of a landing page or the text on a registration form’s submit button) will, at some point, experience a diminishing rate of return; when that rate of return drops below the 1 percent threshold, testing resources may be better allocated to new thematic tests (or moved away from testing altogether).

A/B testing is a product development tool, and like all such tools, it can help improve existing product features but it can’t manifest new features out of the ether. Testing accommodates the iterative, data-driven design process by providing an avenue through which design decisions can be quickly audited for effectiveness. Testing provides data; the burden of utilizing that data to the greatest possible effect falls upon the product team.

Interpreting A/B test results

The purpose of an A/B test is to determine the superiority of one feature variant over another. If each variant is exposed to the same number of people, and the behavior of one user group is clearly preferable to the behavior of the other, then determining the better variant is straightforward: choose the variant that produced the best results. But this approach is overly simplistic, as the differences in those results could potentially be due entirely to chance. If the number of users exposed to each variant was low—say, less than 100—the results from either group could be skewed by outliers or coincidence and not by genuine, fundamental differences between the variants.

To safeguard against skewed results polluting an experiment’s data sample, the product team must validate the results. This can be done in a number of ways: the first is to continue to run the test, exposing more users to it and collecting additional data points. With enough data, the product team can be confident that the sample that the test produces approximates the distribution of expected values—that is, that the sample of users being tested resembles the greater population of users and thus the results from the test represent the anticipated behavior from the entire user base. This approach necessarily requires a longer timeline than simply acting on an initial result set; in entrepreneurial environments, this extended time horizon may not be acceptable.

A second method of ensuring that test results are actionable and representative of the broader behavioral predilections of the user base is to run the test again. If the results of subsequent tests fall within an acceptable margin of the original results, then those results can be considered reliable. The problem posed by this approach is the same as that presented when staggering tests: exogenous, time-related factors (such as intra-week cyclicality or seasonal effects on user behavior) might significantly influence the results of any one test, making it incomparable to tests conducted at other times.

A third approach to validating test result reliability is a statistical method called the t-test, which gauges whether the difference in the average values of two groups (specifically two groups; measuring the differences across more than two groups is done with a different statistical method) is due to chance or to reliable differences in the behavior of those two groups. The t-test produces what is known as an inferential statistic, a generalization for the population from which a specific sample was taken.

For a t-test to produce valid results, three assumptions about the data sample must be true. The first assumption is that the groups being tested contain roughly the same number of data points. The second is that the underlying distributions of the two groups being tested exhibit the same degree of variance; this simply means that the values in each group are spread out around the average value to the same degree. The third assumption is that the samples taken from the population are normally distributed for small sample sizes, where “small” is defined somewhat loosely as fewer than 30. (Note that, when using the t-test, one does not assume that the population is normally distributed but assumes that the samples are). When the experiment samples are not normally distributed, the Wilcoxon signed-rank testcan be used as an alternative to the t-test. If any of these assumptions are violated, the t-test will require some adjustment before producing trustworthy results. If all three of these assumptions hold, however, then the t-test, known in this case as the independent, two-sample t-test, can be executed on the groups.

To undertake a t-test, a null hypothesis must first be formulated. A null hypothesis is any statement that corresponds to a default position regarding the two test groups: the null hypothesis should represent a situation in which the test results promote that no concrete action be taken. For instance, when a test is run on the effect of a change to the background color of a registration process on two variants, a possible null hypothesis would be that a blue background versus a red background for the registration process screen produces no meaningful increase in the number of successful registrations. If accepted, neither variant would be preferred over the other, and the null hypothesis would be true. If rejected—that is, a blue background does result in a meaningful increase in the number of successful registrations versus a red background—then the blue background would be accepted and the null hypothesis would be rejected. The t-test produces a t-statistic, as expressed in Figure 3.20.

image

FIGURE 3.20 The t-statistic calculation.

In the numerator, image represents the mean value of group one and image represents the mean value of group two. In the denominator, SE represents standard error, described earlier as a measure of the level of variance within the sample distribution of sample means. Standard error in the context of the t-statistic is calculated differently than within the context of confidence intervals because the t-statistic encompasses multiple samples, A and B. When calculated across more than one sample, standard error must be adjusted bypooled standard deviation (SP), as expressed in Figure 3.21.

image

FIGURE 3.21 Standard error calculation for the t-statistic.

Pooled standard deviation serves as a mean value of variance for both sample groups. Pooled standard deviation is multiplied by the square root of the sum of n1 and n2 to produce standard error. The symbols n1 and n2 represent the sizes of group one and group two, respectively and, in a two-sample t-test, they should be the same value. The equation to calculate pooled standard deviation is defined in Figure 3.22.

image

FIGURE 3.22 Pooled standard deviation equation.

In the equation for pooled standard deviation, image represents the variance of group one and image represents the variance of group two. Likewise, n1 and n2 represent the sample sizes of groups one and two, respectively.

When combined, the factors defined in Figure 3.20, Figure 3.21, and Figure 3.22 form the t-statistic, which represents the ratio of the difference between the groups to the difference within the groups. The t-statistic is paired with a corresponding p-value, which can be referenced from any standard p-value chart. A p-value represents the probability that a result is at least as extreme as what could be expected in a subsequent test given that the null hypothesis is actually true; in other words, the p-value doesn’t produce a binary decision about the result but rather a probability that the result was achieved by chance.

As with confidence intervals, standard p-values used in accepting and rejecting hypotheses, and that correspond to ranges on the normal distribution, are 10 percent (or 100 percent minus 90 percent), 5 percent (or 100 percent minus 95 percent), and 1 percent (or 100 percent minus 99 percent); in academia, a p-value less than 0.05 (5 percent) is generally required to reject the null hypothesis. The product team must determine an acceptable p-value for a given test; a lower p-value corresponds to a stricter interpretation of the test results in rejecting the null hypothesis.

Whatever the method used to validate the results of an A/B test, the general principle underlying the testing methodology should be to acquire as much data as possible. No statistical acrobatics can substitute for a robust data set; data affords confidence, and product decisions made confidently are more authoritative and, ultimately, more gratifying for the product’s users.

Regression analysis

Building products that cater to (and are optimized for) observable patterns of user behavior is a necessity in the freemium model because of the revenue dynamics presented by the 5% rule. But perhaps more important than accommodating existing user patterns is predicting future behavior; since product development is iterative and data-driven in the freemium model, the trajectory of various user profiles is constantly changing. In order to gauge the effect of product changes or new features on the overall performance of the product, some predictive statistical methods must be utilized at the product management level.

One such method, and perhaps the most popular, is the regression method. It quantifies the relationship between a set of variables; a regression model produces a framework for predicting an outcome based on observable patterns in a given set of data.

What is regression?

Regression models are built using a data set of historical values. They are used to evaluate the relationship between independent and dependent variables in an existing data set and produce a mathematical framework that can be extrapolated to values of the independent variables not present in the data set. A diverse range of regression models exists, and the appropriate model to employ for a given task depends on the nature of the dependent variable being predicted. In some cases, an explicit value must be predicted—say, the total amount of revenue a new user will spend over the user’s lifetime.

In other cases, the value predicted by the regression model is not numeric but categorical; following from the example above, if, instead of the total revenue a new user will spend over the user’s lifetime, a model was constructed to predict whether or not the user would ever contribute revenue, the model would be predicting for a categorical (in this case, binary) variable: revenue or no revenue.

Regression models for prediction face two basic limitations. The first is that, in complex systems, the number of independent variables that affect the value of the dependent variables may be too numerous to accommodate. Following the revenue example, consider a model designed to predict the exact numerical revenue contributed by a user on a given day. Such a model might take as inputs (independent variables) various aspects of the user’s behavior on the day prior to the day being predicted for: amount of contributed revenue, total sessions, total session length, etc. But a given user’s predilection to contribute revenue on a specific day can be affected by events and conditions simply beyond the scope of what the product can observe: job stability, major life events, weather, etc. Any such model of per-day revenue contributions is therefore fundamentally inadequate; its output data can never be trusted as accurate, given an incomplete set of inputs.

The second limitation of regression models is that predicting future results is dependent on historical data. This problem isn’t unique to regression, but regression models in product management can sometimes be interpreted as unimpeachably accurate—analytical solutions to creative problems—which, by definition, they are not. User behavior can change over time for any number of reasons, in aggregate and at the individual level.

As discussed earlier, any competent product released into a market changes the dynamics of that market and the demands that users make of products in that vertical. Since these changes take place over time, regression models must adapt to new behaviors, and behavioral transition phases—where new behavioral data conflicts with old—wreak havoc on regression results. For these reasons, regression models must be tempered with seasoned product management intuition and market insight. No market is static, and neither is user behavior.

With these limitations considered, regression models can provide valuable guidance in predicting user behavior. Regression models are commonly used in freemium product management to put users into segments that predict their “end states”—descriptions of how the users interacted with the product over their lifetimes of use—starting as early as possible. This predictive segmentation allows broad product decisions to be made from a limited set of data on a restricted timeline.

In order to make decisions from actual end state data, the product team must collect data points over many users’ entire lifetimes; the behavior most important to analyze with a regression model is that of the most highly engaged users, who are in an extreme minority and can use a product for months or even years. Collecting enough real data points on highly engaged users’ behavior isn’t practical; instead, users must be predictively segmented from an early stage, using their behaviors throughout their lifetimes to improve the segmentation routine. The downside of this approach is lack of accuracy, but the benefits are an increased speed at which decisions can be made and an increased volume of decision-making data (since every product user, by definition, creates behavioral data artifacts in their first product session, but fewer users create data artifacts weeks later).

The regression model in product development

The process of influencing product development decisions using regression models fits nicely into the develop-release-measure-iterate feedback loop. Regression models are always built on a set of historical data in a procedure known as training. Training a regression model involves iteratively adding and modifying parameters to the base model until the model reaches an acceptable level of error. Once the model has been trained, it is run on a second set of data to ensure that it is predictive; in this stage, a different set of historical data is fed into the model and the model output is compared to actual independent variable values.

The definition of an acceptable level of error in a regression model’s output depends on a product team’s tolerance for inaccuracy, how important the metric being predicted is to the organization, and the significance of the product decision being made. If the model’s output can be useful and can provide value in being merely directionally accurate or accurate in magnitude, then the precise value of the output isn’t necessarily important as long as the model produces a reasonable result. If, instead, absolute precision is required, then regression modeling should be allocated an appropriate level of resources to fulfill that requirement (or a more intensive machine-learning technique should be employed).

In order to build a training set, a feature space—a set of independent variables that could potentially affect the value of the dependent variable or variablesmust be selected. If this feature space isn’t available in its entirety in the existing events library, the missing features should be added and tracked to accumulate a historical data set. Often in regression modeling, categorical (usually binary, taking a value of 1 or 0) variables that indicate the absence or presence of a user’s behavior or characteristic (called dummy variables), are included in the feature space.

An example of a dummy variable is a 1 or 0 value corresponding to whether or not a user provided a valid email address upon registration or has ever contacted customer support. These behaviors, while not measured as continuous variables, could potentially have an effect on the independent variable, especially when the behaviors are directly related to some consequence of the independent variable.

The analyst should take care to not include too many parameters, dummy or otherwise, into a regression model in an attempt to perfectly match training data. In any model (and indeed, any system), randomness contributes to whatever outcome or pattern is being observed. No model can be totally, unfailingly accurate; the more parameters that are added to a model to attempt to perfectly match the training data, the more the model is susceptible to overfitting. This term describes a model that puts undue emphasis on randomness, thereby sacrificing an accurate measurement of the underlying relationship between the variables. There may not be a relationship between two variables; common sense necessarily plays a role in discerning viable candidates for regression models, lest spurious relationships be overfitted and used to make important product decisions.

Meaningfully extracting value from a regression model can take a number of forms. If the purpose of the regression model is to predict long-term behavior given some nascent product changeperhaps a change implemented through an A/B testthen the regression output itself is the value; the output dictates whether or not the change should be retained (given a minimum success threshold for the change, e.g., 50 percent increase in revenue). Often, however, a regression model may test the long-term impact of assumed changes to various metrics in an attempt to prioritize feature development. In these cases, regressions may fill no other purpose than to add analytics depth by providing several “what if” scenarios to what will essentially be an exercise in intuition.

Regression can be programmatically implemented into products to automate certain features, especially as it relates to classification; for instance, users put into certain revenue groups via real-time regression modeling may dynamically experience a change to the product interface. This happens often with advertising; as users are classified as likely to make purchases in the future, advertising features are dynamically turned off to avoid alienating them. These regression classifiers tend to not be accurate unless they are trained on a voluminous set of data, however, and manual testing should reveal some level of accuracy before these are deployed in an automated fashion.

Linear regression

Linear regression is a type of regression model where a set of data points is entered into a scatter plot and a straight line is fitted to the data as well as possible. The line indicates a directional trend resulting from the relationship between the independent and dependent variables. A linear regression model works on the assumption that the relationship between the variables is linear; that is, the relationship’s intensity and direction are constant over the range being examined. The purpose of linear regression is to quantify how the dependent variable changes, given a change in the independent variables.

Before a linear regression model can be constructed, some assumptions about the data set must be validated. The first assumption is that the variables indeed share a linear relationship; that is, on a per-unit basis, a change to the independent variable will always result in the same change in the value of the dependent variable. This doesn’t hold true in a quadratic relationship, which is illustrated by a curvature in the data points. See Figure 3.23.

image

FIGURE 3.23 Graphs depicting linear (left) and quadratic (right) relationships.

The second assumption is what is known as homoskedasticity, which relates to the vertical distance between the straight line drawn through the data points on the scatter plot and the data points in the sample. The vertical distance between a given data point and the line drawn through the plot is called a residual, and homoskedasticity means that the residuals over the range of the line are consistently spread out. This is easy to see with the scatter plot: if the data around the line grows more or less scattered at any segment, the data is not homoskedastic but heteroskedastic, and thus a simple linear regression model cannot be built on the data set. See Figure 3.24.

image

FIGURE 3.24 Regression graphs exhibiting homoskedasticity (left) and heteroskedasticity (right).

The third assumption is of residual independence, meaning the residuals at any point on the graph are not influenced by the residuals at the preceding point on the graph. This assumption is more difficult to rigorously test for; it requires either plotting the residual values on a separate graph and looking for a pattern (which indicates that the residuals are not independent of each other) or calculating what is known as the Durbin-Watson statistic. A less scientific test for this assumption is to look for randomness in the residuals on the original graph; in other words, if a pattern is apparent within the residuals over the course of the graph, it probably indicates dependence and thus means a linear model is not appropriate for this data. If this assumption doesn’t hold true, then it means the residuals are autocorrelated.

The fourth assumption is not a strict requirement but makes interpreting regression results easier: the residuals should be normally distributed around the line. Essentially, this means that a normal probability distribution function should apply vertically at each point along the line, with the bulk of the residuals falling within one standard deviation of the mean (which is a residual of 0, or the line itself) and very few falling outside of three standard deviations. This is difficult to test for and, in most cases, is not a significant concern.

The most practical approach to building a linear regression model is known as the ordinary least squares (OLS) method, which is used to minimize the sum of the squared vertical distances between the data points in the sample and the line drawn through them. The OLS method minimizes the residuals in order to build a model of the relationship between the independent and dependent variables. Thus, the model output produces a set of equation variables that can be applied to independent variables not present in the data set to produce predicted values of the dependent variable (given that the assumptions detailed above can be expected to hold true for data outside the original set).

image

FIGURE 3.25 An ordinary least squares regression line.

The OLS method produces an equation in the form of y=mx+b, where y is the value of the dependent variable, m is the slope of the linear regression line (in other words, the quantified relationship between the variables), x is the value of the independent variable, and b is the value at which the line intercepts with the y-axis.

Running an OLS regression by hand is impractical and unnecessary; most desktop spreadsheet software can automatically execute an OLS regression even on a fairly large data set, and statistical software packages can handle OLS regression with data points numbering in the millions. The inputs for any OLS calculation are the dependent variables (x values) and the independent variables (y values).

The important output values from the OLS function are the values for the slope of the line, the y-intercept, and the coefficient of determination, which is almost always represented as R2 in statistical packages. The coefficient of determination measures the strength of the linear relationship between the variables on a scale from 0 to 1; 0 indicates no explanatory relationship between the variables and 1 indicates a perfect explanatory fit between the variables. The more robust a relationship is between the variables, the closer the R2statistic will be to 1.

Once the output from the OLS regression is available, values of the independent variable outside of the range provided in the data set can be substituted into the equation y=mx+b to predict values of the dependent variable. These substitutions, especially when being made to predict user behavior such as revenue generated or user engagement, are usually graphed out, either in spreadsheet software or in a report.

When using only one independent variable in constructing a linear regression model, the approach is known as simple linear regression. When using more than one independent variable, the approach is known as multiple linear regression. Other models exist, too, many of which address cases wherein the assumptions outlined above are invalidated, such as those of heteroskedasticity or of non-normality of residuals.

As new test data is available, the regression model should be updated, especially if the model is not being used to make a singular decision about a product feature but rather for forecasting a metric, such as revenue. When implemented programmatically, regression models can be automated to accommodate a steady stream of new user data.

Logistic regression

Logistic regression is an extremely robust and flexible method for dichotomous classification prediction; that is, it is used to predict for a binary outcome or state, such as yes/no, success/failure, and will occur/won’t occur. Logistic regression solves many problems faced in freemium product development that linear regression can’t, because rather than predicting a numerical value (e.g., a user’s total lifetime revenue), it predicts a discrete, dichotomous value (e.g., the user will spend money or not spend money on the product). For this reason, logistic regression might more accurately be called logistic classification.

Problems around health issues are often given as examples for which logistic regression is appropriate, such as whether or not a person has a specific disease or ailment, given a set of symptoms. But the examples of logistic regression’s applicability for freemium product development are abundant and obvious because user segmentation is such an important part of the successful implementation of the freemium model. In order to optimize the user experience within the context of the freemium model, a user’s tastes and behavioral peculiarities must be accommodated, and doing so early in the product’s use allows for the greatest degree of optimization. Logistic regression is perhaps one of the best ways of undertaking such classification.

Similar to linear regression, logistic regression produces a model of the relationship between multiple variables. Logistic regression is suitable when the variable being predicted for is a probability on a binary range from 0 to 1.

Linear regression wouldn’t be appropriate in such cases because the independent variable values are constrained by 0 and 1; movement beyond the dependent values provided in the sample data set could produce impossible results (below 0 or above 1). A probability curve on a binary scale must therefore be sigmoid shaped (s-shaped) and mathematically constrained between 0 and 1, which the logistic regression model provides. See Figure 3.26.

image

FIGURE 3.26 A linear regression equation on a linear scale (left) and a logistic regression equation on a probability scale (right).

A perfectly shaped S on the probability curve in a logistic regression corresponds to a perfectly straight line in linear regression; in order to test the residual distance from the curve in the logistic regression to assess the fit of the model, the data must be transformed. This is done by converting the probabilities to odds, creating a logistic function from the odds and, instead of fitting the curve using the lowest value of the residuals, iteratively testing different parameters until a best fit for the log odds is found (called the maximum-likelihood method).

The maximum-likelihood method is computationally intensive and, although it can be performed in desktop spreadsheet software, it is best suited for statistical software packages. The output of logistical regression is reported in terms of odds ratios, which is the numerical odds (bounded by 0 and infinity) of the binary, dependent variable being true, given a one-unit increase in the independent variable.

Compared to the results of a linear regression, which might read, “A one-unit increase in day 1 user retention correlates with a 10-unit increase in lifetime user revenue,” the results of a logistic regression would read, “A one-unit increase in day 1 user retention correlates with an increase by a factor of 10 in the odds that the user will eventually spend money in the product (versus not spend money).”

Because logistic models are inherently heteroskedastic, and thus the maximum-likelihood method does not seek to minimize variance in the model, there exists no measure of fit in logistic regression analogous to the R2 statistic in linear regression. There do exist, however, several pseudo-R2 statistics that convey the same basic information—goodness of fit—as does R2 and are formulated along the same scale of 0 to 1 (although in some cases, exact values of 0 or 1 may not be possible). Some common pseudo-R2 statistics reported by statistical packages include McFadden’s R2, McFadden’s Adjusted R2, Efron’s R2, and the Cox-Snell R2. As with the OLS R2 statistic, the closer a pseudo-R2 value is to 1, the better the model fits the data.

User segmentation

User segmentation is a technique used to personalize and optimize the product experience for different use cases and tastes; it involves separating users into groups based on predefined characteristics and exposing each group to the product experience that most strongly resonates with the group. User segmentation is one of the primary means by which freemium products optimize the user experience at the level of the individual user, and it is an important strategy for effectively monetizing a product within the constraints of the 5% rule.

User segmentation can be undertaken in myriad ways. One approach is to apply a semi-permanent “tag” to a user, usually within the analytics infrastructure, which adjusts the user’s experience according to an optimized rule set associated with that tag. The most common such tags are related to the user’s geography (such as country) and device; an obvious and straightforward example of user segmentation to optimize the user experience is localization, whereby all text in a product is automatically displayed in the official language of the user’s country.

Another approach is to segment users based on past interactions with the product in order to conform the product’s content and design to the user’s tastes. For instance, a news reader might arrange stories based on a user’s specific browsing habits. Behavioral segmentation is the most sophisticated and data-intensive implementation of the method, but it also has the potential to optimize the user experience to the greatest degree by utilizing data at the granularity of the individual user.

Data plays a central role in user segmentation: segments are built using models that gauge how different groups can best be catered to and what metric thresholds should define those groups. As a product evolves, so should user segments; product use cases can change over time, and static user segments are generally useful only for reporting or when those user segments represent demographic characteristics that are unlikely to ever change.

As new data becomes available, the models that define user segments must reassess their prior determinations to ensure that relevant optimization takes place. The need for timely data must be balanced against resource limitations and the pace of the product development schedule, but models that determine user segments— especially if those user segments have a meaningful impact on the user experience—should be revisited as often as is practical to ensure that they haven’t fallen out of date.

Behavioral data

Behavioral data is the richest form of insight a product team can utilize to optimize the experience at the individual’s level. Qualitative, explicit feedback from users is helpful in the product development process, and users can often articulate the shortcomings of a product very eloquently, but behavioral data communicates a more sincere truth about users’ needs that aren’t being fully satisfied.

Behavioral data fits nicely into user segmentation models because user segments are meant to facilitate behavior. A common user segment class is engagement. Users may be divided into groups ranging from those representing very highly engaged users to those representing users considered highly likely to abandon the product (churn) in the near term. With such a segment class, user behavior is the only valid means of defining the group a user belongs to, usually through number and length of product sessions. A product team might use such a segment class to dynamically reintroduce various product features to lightly engaged users or to prompt highly engaged users to invite their friends into the service. Whatever the decision, a user segment defined by engagement—a fundamental behavioral measure of delight—can only be accomplished using behavioral data.

A user segmentation model built around behavioral data likely uses a predetermined set of requirements to establish the different groups into which users are funneled. Such a requirements set should be the product of an analysis on product use and an acknowledgement of the organic way that use cases have emerged. An exploratory data analysis of, say, a new product feature might involve looking for breaking points between patterns of use and then using those breaking points to define a set of user groups. Once those user groups are defined, the breaking points are set as thresholds for grouping users and users are segmented accordingly.

The purpose of behavioral user segmentation is to influence future behavior, which can best be done with a granular, individual approach (rather than a broad, one-size-fits-all approach). Continuing from the engagement example, a product team might look at how patterns of use have naturally evolved in a product and create a set of groups (based on a combination of engagement, revenue, and usage metrics) that acknowledges those patterns, such as highly engaged, mildly engaged, and likely to churn. Users would then be organized into those groups based on the defined usage thresholds, and the product team’s agenda would include initiatives to influence the behaviors of each group into a desired result.

Inflection metrics are very helpful in setting goals for behavioral influence and defining user segments. An inflection metric is a measure of the point at which a user is considered likely to achieve a desired later result—convert, retain, invite a friend into the product, etc. An example of an inflection metric in a freemium news reader might be number of feeds subscribed to, where the product team has observed that users who subscribe to at least 10 feeds become far more likely to purchase a premium subscription to the service than do users at engagement levels below 10 feeds. In other words, the tenth news feed represents a threshold at which conversion becomes probable. This threshold represents an inflection point on the user’s trajectory toward conversion.

Inflection metrics are not absolutes. In the case above, the inflection metric doesn’t mean that a user will convert on the tenth RSS news feed subscription, or that all users with 10 RSS news feed subscriptions ultimately convert. The metric simply represents a strong indication that a user will convert; the purpose of such an indication is to provide a concrete, easily articulated goal for the product team, in terms of influencing product behavior.

If the tenth RSS news feed subscription is seen as a goal, the product team can endeavor to influence every user to subscribe to at least 10 news feeds. This creates an obvious line about which users can be organized into groups: users who have subscribed to at least 10 news feeds and users who have not. The users who have not yet subscribed to 10 news feeds will likely be exposed to a different product experience than those who have, with the experience featuring vigorous inducements to subscribe to additional news feeds until they reach their tenth subscription.

Inflection metrics provide clear guidelines for testing and updating behavioral models. An inflection metric might represent the basis over which a product feature is prioritized, given that feature’s impact on progress toward the inflection metric. When inflection metrics can be established as critical goals, prioritizing a product feature backlog becomes a quantitative and objective, rather than an intuitive, process: whichever features best position the product to achieve the inflection metric are of the highest priority.

Behavioral objectives, especially inflection metrics, must be revisited as the product evolves and its feature set grows. New users across long time horizons bear little relation to each other in freemium products; an inflection metric for early adopters doesn’t necessarily hold true for users who adopt the product in a more mature state. The models that define inflection points and user segments don’t need to be dynamic and automated, but they should be timely enough for product teams to feel comfortable that they’re striving to achieve ambitious yet achievable goals.

Demographic data

Demographic data plays an important role in user segmentation strategy, but it is not nearly as valuable as behavioral data. Behavioral data provides richer, more individualized insight into how a specific user interacts with a product from a perspective more conducive to increased monetization. Demographic data points are broad and difficult to infer specific conclusions from and demographic data tends to not shift over time with the evolution of the product. Whereas behavioral data adapts to new product features being introduced, demographic data say, a user’s country of origin—remains constant.

That said, demographic data used in conjunction with behavioral data can contribute new dimensions to analysis initiatives and therefore provide value. What’s more, demographic data is available as soon as a user interacts with a product for the first time, whereas meaningful behavioral data must accumulate through product use. As a set of first-pass regression points, demographic data allows for determining an initial segmentation of a user, which can be revisited once behavioral data related specifically to that user’s tastes is collected. The timeliness and immediacy of demographic data in itself is of value; the sooner a user can be profiled, the sooner the user’s experience can be optimized.

Additionally, while behavioral data provides context around intent (especially with regard to monetization), demographic data provides context around ability. And though optimizing a product for user tastes has more potential in terms of personalizing the product experience and optimizing for revenue, understanding a user group’s ability to make purchases can help in making prioritization decisions.

Certain geographies don’t monetize well in freemium products, especially when those freemium products haven’t been localized for those regions. Likewise, monetization patterns between various devices and product platforms differ. A user’s demographic profile can contribute useful insight into the user’s propensity to monetize. Location is the most basic and generic user demographic information available and should be accessible on almost any platform. Location can be used to estimate disposable income and linguistic and cultural norms that can affect how the product is used. Localization is a particular concern with respect to freemium products, given concerns over scale: the amount of text present in a product’s user interface can significantly impact how well it is received in markets for which it has not been localized.

Localization is a time-consuming and, in some cases, expensive endeavor that must be undertaken with a full understanding of the potential return on investment. Such an expense should be assumed only when the size of the applicable segment is known, which is possible only through tracking geographic user data and researching the product’s potential addressable market in a given region.

Device data can be very instructive when segmenting users, especially on stratified platforms like mobile, where the quality of devices spans a broad range. Device data should inform product feature launch decisions—specifically, which user segments to launch for—because compatibility and general performance problems can have a negative impact on the reception and growth of a product. This impact is most notable in highly social products with active communities and in products launched on platforms for which user reviews affect discoverability. The quality of a user’s device can also serve as an indicator for disposable income and thus the propensity (and ability) to spend money in a freemium product, although it is a weaker signal than some behavioral indicators.

Demographic data can become more useful when it is available at a deeper level, such as when a user has connected to a product through a social network’s API and the user’s behavioral data from that network is accessible. This level of granularity of demographic data—age, gender, education level, employment, general interests, etc. —can be used to emulate behavioral data, which combines the benefits of user tastes with the immediacy of demographic data.

Better still, social networking connectivity also provides information about a user’s friends and whether or not they are also using the product. Friends’ use of a freemium product can be a potent leading indicator of a propensity to make purchases in the product, especially if those friends have made purchases in the past. Social networking data can represent a goldmine of demographic information (albeit unverifiable) about a user that can add nuance to segmentation filters.

All told, while it is valuable in its own right, demographic user data is put to best use when combined with behavioral data. Broad-stroke generalizations about users from a specific geography or who use a specific device aren’t independently actionable; they merely describe the state of the user base. Information about how a user interacts with a product is far more valuable than information about the user in a real-world context, especially in freemium products where monetization and engagement are driven by personal delight.

Predicting user segments

User segmentation based on the current state of users is valuable for the purposes of reporting, where it can be used to gauge, over time, the success of new product features in engaging and monetizing users. But current state user segmentation can’t help in the feature pipeline prioritization process, and it can’t help determine how features should be implemented. To assist in these initiatives, segments must be based on predictions for a future state, most often the end state.

An example of an end state, predictive user segment is the total revenue segment class, where users are grouped based on how much total revenue they are expected to spend in the product over their lifetimes. Such a prediction, if accurate, could help in optimizing the experiences of the users predicted to contribute the most revenue, as well as helping the product team provide a better product experience to all users by using monetization as a proxy for delight. In short, current state user segmentation, while immediately accurate (because it merely articulates the narrative captured by the product’s data infrastructure), can do little to direct the future of the product toward an optimized user experience.

Predicting a user’s future state requires knowledge about their current state. Demographic and behavioral data can be used to build a profile of the user in the user’s current state, and certain aspects of the user’s behavioral history might be able to be extrapolated into the future. But the best way to predict a user’s future behavior (and thus the user’s future state) is to understand how similar users behaved. This is how regression models are implemented: historical data is used to estimate the relationship between variables, and that relationship is applied to the user’s current state to project a future state.

Freemium products are designed for longevity; a long lifespan goes hand-in-hand with long-tail monetization. Current state user segments are most useful for the purpose of reporting and as inputs to models that facilitate estimating end state segmentation; in turn, end state segments are only useful if user behavior can be influenced. Every user interacts with a freemium product for the first time as a non-engaged, non-paying user, someone touring the product but not yet committed to using it with any enthusiasm. Some product teams neglect to address the malleability of that user’s trajectory through the product; they consider a first-time user’s relationship with the product as predestined and immutable. But users do respond to personalized, optimized experiences, and a user’s future state in a product can be molded in a way that benefits both parties. Users provide a vast amount of information to product teams about what they want, but they will only do so if they believe that information is being acknowledged. Churn is the result of product teams ignoring what users tell them, through the users’ own behaviors, about how they want the product to be developed.

User segmentation prediction models need not be tightly integrated into the analytics system; if enough demographic and behavioral data about current user state is available, then models can be built independently and the results can exist separately from the ever-growing product data set. More important than current data in user segmentation prediction models are relationships that are abstract enough to be generally true, short of drastic changes in the product or the user base. If, for instance, monetization patterns in the earliest sessions of product use are strong determinants of end revenue state, then constantly measuring the degree of that relationship is a less impactful use of resources than is using that data to influence early monetization behavior.

Flexibility is an important characteristic of user segmentation models. If models cannot be easily updated with new assumptions or historical data, they aren’t likely to be revisited once new information becomes available. The best models are built as frameworks that can adapt to new parameters without being programmatically reconstructed. And as end state data accumulates—that is, as users churn out—the model should be tested against not only current state inputs but also end state outputs. Users abandoning the product leave a trail of data artifacts as they do so; these artifacts can be used to help hone the product team’s understanding of the relationship between current state and end state. Such artifacts are perhaps more valuable than timely current state data, if only for being more scarce.

Actual end state data is the only standard against which the accuracy of predictive user segmentation models can be judged. But numeric accuracy isn’t necessarily the highest concern with predicting a user end state; if models are directionally accurate and accurately assess the magnitude of a relationship between variables, they can be used to prioritize feature development.

Only a wizard could predict a user’s end state in terms of revenue to the penny, and the pursuit of wizardry through programmatic statistical prediction faces rapidly diminishing returns. An understanding of the dynamics of a product is enough to make decisions not for the sake of matching reality to predictions but for enhancing the user’s experience to the utmost degree.