Measuring Impact - Refining the Product - Designing for Behavior Change: Applying Psychology and Behavioral Economics (2013)

Designing for Behavior Change: Applying Psychology and Behavioral Economics (2013)

Part V. Refining the Product

You’ve built one heck of a product. It looks good, it’s engaging, and it incorporates the best of behavioral science. After testing and gathering feedback on the wireframes and prototypes, and building the first versions of the actual product, you’re ready to deploy it. The following chapters cover what comes next: figuring out how much of an impact the product actually has, and generating ideas on how to improve it (see Figure 27).

In terms of the iterative product development process first presented in the Preface, these chapters are about impact, insight, and measuring changes to the product. Chapter 12 gathers data on the product’s current impact and sets a benchmark for future changes to the product. Chapter 13gathers additional data on user behavior to find the places in the application where users get stuck and develop potential solutions. Chapter 14 further develops those potential solutions and integrates them into the next round of the product development cycle. Those changes to be measured could be fundamental changes to the outcome, actor, or action, or additions to the behavioral plan or user stories that are added to the product’s backlog of tasks to complete.

covers the impact assessment, insight and ideas, and moving deeper into the spiral with changes to the product and measurements of each change in future iterations

Figure 27. Part V covers the impact assessment, insight and ideas, and moving deeper into the spiral with changes to the product and measurements of each change in future iterations

Chapter 12. Measuring Impact

Opower, an energy-efficiency company based in Arlington, Virginia, runs some of the largest experiments in the world about how products can change behavior. Millions of people have participated in their studies—simply by opening a letter from their utility company or fiddling with their thermostat.

Opower is best known for delivering monthly reports to utility customers, showing them how their energy usage stacks up against their (anonymous) neighbors. It’s a well-studied technique in social psychology called peer comparisons, which we discussed a bit in Chapter 10. Figure 12-1shows an example of one of its comparisons.

Opower energy report, comparing the reader’s home heating usage to that of his neighbors

Figure 12-1. Opower energy report, comparing the reader’s home heating usage to that of his neighbors

A host of government, private, and academic publications have shown that Opower’s simple comparisons help consumers cut their energy bills by roughly 2% on average ([ref4]). That may seem small, but it adds up to over 2.6 terawatt-hours of electricity—enough to power 300,000 homes for a year, or roughly $300 million in consumer savings on energy bills ([ref145]).

Opower constantly runs experiments to measure their impact and tests ways to improve it further. Rigorous measurement and testing has been a key factor in their success.

This chapter and the next two provide the tools you need to measure your product’s current impact and improve that impact in the future.

ANYONE CAN MEASURE IMPACT

When you read the title of this chapter, do you imagine arcane symbols and inscrutable formulas? That’s not what you’ll find. Instead, you’ll find common-sense explanations of how you can measure your product’s impact.

For software products, there are numerous powerful and user-friendly tools that handle the underlying math and statistics for you. For most impact experiments, that’s all you really need. You usually don’t need an econometrician to understand if your product is working or not and how to improve it.

Some techniques are more advanced though; I’ll mention that up front and explain what is going on in nontechnical terms. If you don’t have a statistical background, and you decide that you need those techniques, that’s a point at which you need to get some expert assistance. For readers with a statistical background, those sections will quickly tell you which tool to pull from your toolbox so you can get to work.

Why Measure Impact?

The ultimate goal of this book is to help you design products that support behavior change. This chapter is about measuring how effective the product really is, right now. Naturally, both you and I know that the product is impeccable. But there are a few good reasons to carefully and precisely measure its impact anyway:

To share with the finance department

Precise measurements of product impact allow a company to determine its return on investment. That’s important for the company’s internal finance department—to justify future design and engineering work—and for any external funders (from grant makers to venture capitalists).

To share with the team

One of the most rewarding experiences I have in my daily work is when I send out messages to the staff about positive research results—showing how we’re quantifiably helping people. The thank you notes are deep and heartfelt. Seeing the real impact the product has on users helps the folks in the trenches, who develop the product day in and day out, to see the value of their work.

To share with the world

When you have something that works, show it off. You know you want to anyway. Just saying that your product works isn’t news; no one has a reason to trust you. If you can demonstrate your impact, especially with an independent, third-party experiment, that’s news.

To improve the impact

With a benchmark of the product’s current impact, the team can experiment (formally or informally) with changes. The team can then tell whether the changes helped or hurt, and come up with even more ideas to move the product forward.

To learn what’s hindering impact

Properly deployed, measurements of impact can help the product team understand where, exactly, users are facing behavioral challenges and where the product is failing to deliver on its promise.

To settle arguments

Rather than relying on the loudest member of the team, or the Highest Paid Person’s Opinion (HiPPO), impact measurements turn arguments about what the product should do and what it should look like into a dispassionate look at the data. Instead of arguing about what might work, see what does work with quick tests. Computer scientist and Admiral Grace Hopper said it well: “One accurate measurement is worth a thousand expert opinions.”

Now that you’re suitably convinced, let’s talk about how to actually do it.

Where to Start: Outcomes and Metrics

Make Sure the Target Outcome Is Clear

The first step in measuring the impact of your product is to be absolutely clear on the impact you care about (i.e., the intended outcome of the product). In Chapter 4 and Chapter 5, we identified the target outcome, actor, and action that would determine whether the product was a success or not. Here’s a quick refresher on defining the outcome:

§ The outcome is what will be different in the real world when the product is successful.

§ The outcome should be something tangible, and not something in the user’s head. Often the user’s head is just a proxy for what the company really cares about—a tangible outcome that occurs because the user’s knowledge or emotions have changed. For example: lower BMI, or time spent exercising each week, instead of knowledge about the importance of exercise and maintaining one’s weight.

§ The outcome should be something unambiguously measurable. For example, your product is supposed to decrease government corruption. But what is corruption exactly?

§ The outcome should be able to signal success. If you can say, “Well, if X didn’t happen, the product would still be a success,” then X is not the outcome we’re looking for.

§ The outcome should be able to signal failure. You must be able to think of a plausible case in which the outcome would indicate that the product is failing.

Vanity metrics—metrics that make a company feel good but don’t give an accurate sense of whether the product and company are on the right track—fail on these criteria. For example, consider page views—the stereotypical vanity metric. Let’s say a company had zero revenue from its flagship consumer product but had lots of page views on the product’s website. That usually wouldn’t be considered a success. (And, vice versa: if the product brought in lots of revenue, but for some reason it had few page views, that would be considered a success.)

Define Metrics for the Outcome and Action

With a clear outcome in mind, you should define two metrics: one for the outcome and one for the action. You need to know, unambiguously, whether the target outcome has occurred (and at what level) and whether the user has taken the action that is supposed to be causing the outcome.

The outcome metric should flow directly from the target outcome itself. It’s how you determine if the outcome is there or not. You should define and write down a formula, even a dirt simple one, which says how the outcome is measured. Here are some easy ones:

§ Company income = money received from clients over the course of a month

§ User weight = body weight without shoes, measured in the morning after breakfast

And here’s one that is a bit more complex:[131]

§ Neighborhood connectivity = number of times users attend social gatherings with their neighbors over the course of a month

The metric should make clear what is measured, how it is measured, and for how long it is measured. For example, one way to define company revenue for a particular product is: money received (not money booked) because of product sales (not investment earnings or sales of fixed company assets), over a 30-day period.

Why so specific? Because if the value of the metric changes over time, we need to be sure that it changed because of your product, and not because the metric’s definition was unclear, and two different ways of measuring it over time caused a fake change in the data.

Ideally, the metric should be:[132]

Accurate

It actually measures the outcome you want to measure.

Reliable

If you measure the exact same thing more than once, you’ll get the exact same result.

Rapid

You can quickly tell what the value of the metric is. Rapidness encourages repeated measurement, and makes it easier to see if a change in the product was effective.

Responsive

The metric should quickly reflect changes in user behavior. If you have to wait a month until you can measure the impact of a change (even if it only takes a minute to measure it; i.e., even if it is rapid), that’s 29 days wasted that you could have been learning and improving the product.

Sensitive

You can tell when small changes in the outcome and behavior have occurred. For the developers among you: floating-point values are great; Booleans are not.

Cheap

Measuring the outcome multiple times shouldn’t be costly for the organization or it will shy away from measuring the impact of individual changes to the product and have difficulty improving it.[133]

Quite a lot there, right? Yes, but that doesn’t mean you need to obsess over the perfect metric. What we’re really looking for is a quick check of sufficiency. Treat this list like a checklist—for a given outcome metric, ask the following: is it specific enough that there won’t be too much disagreement when it’s measured? Is it reliable enough that it won’t fool the team into thinking the product isn’t working when it actually is?

Defining a metric for the action is similar. The action metric tells you whether (and to what degree) the user is taking the target action, the action that is supposed to drive the desired outcome. If the desired outcome is a specific BMI level, and the action is exercise, a sample metric would be: how much is the user exercising, and how often? A good action metric must pass the same tests as the outcome metric: accurate, reliable, rapid, responsive, etc. Here are some good and bad action metrics:

Action: User Exercises

Bad metric

User exercises = how much the user reports walking each day. This is a bad metric because (a) the user may not know how far he or she is walking without the help of a pedometer or other tracker and (b) the user may stretch the truth.

Good metric

User exercises = how much a pedometer automatically tracks the user walking each day

Action: User Studies New Language

Bad metric

User studies = an expert evaluation of language proficiency on a lengthy written exam. This metric is problematic because it focuses on the intended outcome rather than on the activity that we assume, rightly or wrongly, leads to that outcome. It also takes a long time to measure and can’t be measured frequently (without annoying the users!).

Good metric

User studies = time spent within the application, or number of exercises completed with a minimal level of accuracy

Clearly, there are trade-offs when creating outcome and action metrics. The most accurate metric may take too long to gather, or the cheapest metric may not be reliable. Again, there’s no need to obsess over it—we’re looking for an action metric that is sensitive enough to tell you when there is a problem, and accurate enough to not mislead the team.

Set the Thresholds for Success and Failure

There are always opportunities for new products, new problems to solve, and new markets to tap. At some point, the company needs to determine whether the product is “good enough” to move on to something else, or that it is failing so badly that it should be discontinued or major corrective action should be taken. To support that assessment, the company should determine the specific threshold for “success” and “failure,” up front.

For a new company, new product, or new market, it can be difficult to know what should be labeled a success or a failure. The company can turn either inward (this is what we need in order to sustain the business and not pursue the next most promising product), or outward (this is what othersimilar products are achieving; if we aren’t achieving that, we won’t gain traction in the market). I haven’t seen any hard-and-fast rules here—it’s up to the company to decide what it needs to accomplish with the product.

The thresholds of success and failure should be set before they are measured (and ideally, before the product is even built, as we talked about briefly in Chapter 5). Many of us have been in companies where we’ve spent months (or years!) developing a product, deployed it to the market, and gotten a lackluster response. Then, the spin starts: looking for things that are positive, looking for things that indicate it might possibly get better if the team put in another few months of work. That’s a moment of truth, when a dispassionate, spin-free analysis is needed. Maybe it’s just not the right product, maybe it’s actually good enough and the team is being too hard on itself. Setting up clear definitions of success and failure beforehand helps combat the temptation to spin the numbers after the fact.

Defining success and failure ahead of time doesn’t mean that we can’t change the goal posts now and then. As we learn more about the market, product, and the company’s other opportunities, our understanding of what’s “good enough” will change. But make sure the team knows about the change, and understands the rationale. No one likes a moving target—especially when that means making the job harder for the team midstream and without explanation (raising the standard) or it seems like accepting failure (lowering the standard).

How to Measure Those Metrics

You know what you want to measure. Now, you need to measure it: by instrumenting the product or otherwise gathering data about your users’ behavior and the target outcome. How you gather this data depends on the type of product and behavior change you’re working with: whether the target behavior is within, or outside, of the product.

Measuring Behaviors Within the Product

If the behavior that the product is trying to change is part of the product itself, you’re in luck. There are tools to help you gather the data. For example, let’s say your application aggregates user contacts and helps users keep in touch with them regularly, like Contactually.[134] The behavior change problem entails helping your users figure out how to best organize their contacts within the app.

You can code your product to automatically record the actions that users take (organizing contacts) to see if they are successful. In that case, you can code your product to automatically record the action and outcome or push events out to a third-party platform like KISS Metrics or Mixpanel[135] (that’s what Contactually uses). That’s the ideal. When your product is online, you can even gather the data in real time and see what’s going on immediately.

Measuring Behaviors Outside of the Product

It can be much more challenging if the behavior change problem is outside of the product. First, look for ways to pull in existing real-world data. At HelloWallet, one of our primary goals is to help people save money for the future. But they can’t actually do that within our application—they move money into their savings accounts through their bank. Early on in our product development, we realized that we needed to ask our users for read-only access to their bank account information. With the bank account information, we could provide them with better guidance, and, very importantly, we could tell if our guidance was actually working or not.

You’ll need to be creative, and search for datasets you can pull in. Opower’s main product, described at the beginning of this chapter, is a piece of paper—a physical mailer sent to utility customers. There’s no way to reliably measure people’s real-world behavior with that mailer. But they have built relationships with the utility companies to access utility records on how much energy people actually use. And, with that data, they can reliably tell what impact their mailers are having on behavior.

TOOLS FOR GATHERING DATA WITHIN THE PRODUCT

To measure your product’s impact, you’ll need to look beyond basic tools that track page views and conversions. Often the impact you’re looking for isn’t just a simple event in the application, like a page view. For example, if your product helps users form the habit of updating their budget each month, measuring the habit means more than the pages they’ve seen. Second, you’ll need access to raw, per-person data for statistical modeling. Third, in order to assess changes in impact, you’ll probably need to run A/B tests.

If you’re not familiar with them, A/B tests take a randomly selected group of users and show them one version of the product (version “A”), and show another randomly selected group another version (version “B”). Tools that support A/B testing or its cousin, multivariate testing, will advertise that fact; the mechanics of A/B testing and multivariate testing are discussed in the sectionDetermining Impact: Running Experiments.

There are a variety of tools that can handle the A/B testing for you. For example, Google Analytics has taken the old Google Website Optimization package (with A/B testing support), and integrated it as the Google Analytics Content Experiments Interface.

Getting individual-level data—what each person is doing in the system—requires more horsepower; it’s something that Google Analytics doesn’t provide. There’s an open source version of Google Analytics that does what Google Analytics does (though a few versions behind), and provides that individual-level data: Piwik.[136] It can be a bit clunky, but it gets the job done if you know how to analyze the raw database records.

Other tools, such as KISS Metrics, provide per-person tracking and also provide a nice GUI for doing some of the analyses you need. You’ll need to access the raw data (available via Amazon’s S3 service) to really dig into the details. Companies can also readily implement their own per-person tracking by pushing the events that occur within the system out to a database for later analysis.

Your company may need to consider adding functionality to the app to make real-world measurement possible. Let’s say you have an app that helps people eat healthier. It provides meal plans for easy and healthy home cooking, so users don’t need to eat out as much. That’s great, but how do you know if the product is successful? Creating a meal plan isn’t enough. You need to know if people are actually acting on that advice. One way to measure behavior outside of the product (actual use of the meal plan) would be to add a feature to link to the person’s grocery store loyalty card. The grocery store knows what the user is buying, and has a financial incentive to have people buy more there, instead of eating out. Users can be rewarded for following the meal plan, and get greater insight into what they are eating.[137] There’s a benefit for the user, the grocery store, and for you—since you’ll be able to measure impact.

Sometimes however, there simply isn’t a dataset you can draw upon, or the dataset is too imperfect or infrequent to use. For example, let’s say your application encourages people to vote. The act of voting is outside of the product, and it takes months to get official data on whether someone voted or not.

In such cases, where you really don’t have a way to regularly gather real-world data, there’s a three-part strategy for benchmarking your product’s impact:

§ Benchmark the impact your product has on an intermediate user behavior that you can measure regularly, even though it isn’t the final real-world outcome you really care about.

§ Determine how to accurately measure the real-world outcome at least once.

§ Build a bridge between the intermediate user behavior you measure regularly and the real-world outcome you care about.

The data bridge is basically a second benchmark—connecting the regularly measured behavior (usually in the app) and the irregularly measured real-world outcome. To make the explanation clearer, let’s start with the first benchmark: the link between your product and regularly measured in-product behaviors. We’ll return to the data bridge after we’ve learned the basic benchmarking procedure.

Determining Impact: Running Experiments

You know what you want to measure and can measure it. Let’s determine the impact your product is currently having.

The gold standard for measuring the impact of a product is to employ randomized controlled experiments, otherwise known as A/B tests or split tests. You select a random sample of potential users then divide the sample into a control group and a treatment group at random. The control group doesn’t receive the product (or, if the team is developing a new feature, the control group doesn’t receive that feature). The treatment group receives the new application (or feature). You would then measure the outcome for each group. The impact of the application is simple to calculate, once you know you have a strong signal from your data:

Average outcome with treatment group – average outcome with control group

And yes, that’s it. Basic experiments are very easy and straightforward to analyze if they are designed correctly. The beauty of experiments is their simplicity and power. It allows you to focus on what you care about (impact) and not worry about all of the other things that can give misleading results or lead to endless arguments about interpreting the data. For example:

§ “Aren’t the people who appear to benefit from your product just the people who would have done well on their own?” Impact experiments show what happens above and beyond what people would normally do; you measure “normal” behavior with the control group, and subtract it out.

§ “Couldn’t the good results you’re seeing be caused by something else?” Impact experiments show the unique impact of the product, above and beyond external impacts (i.e., anything else that might be happening at the same time to cause good outcomes). The difference between the treatment group (external impacts and product impact) and the control group (external impacts only) is the unique product impact.

While this type of experiment is excellent for telling you what impact you’re having, it can’t tell you, at least not directly, why the product has that impact. We’ll talk about that in Chapter 13.

In order for your experiment to be successful, there are a few things you need to be careful about:

Make sure you have enough people

How do you know how many people are enough? There are books on the topic, but for most people a simple online calculator is enough. There are two versions of the calculation: to estimate how many people you need before you run the experiment (a “sample size” or “power calculation” test), and to determine if you really can tell the two apart after the experiment (“statistical significance” test). I will explain how these tests work in the common scenarios.

Ensure a real random assignment

Make sure that the treatment and control groups are actually randomly assigned. Use a random number generator, and generate a new number for each person. If you have an existing list of people, and it “looks” random, it almost never actually is—there’s some ordering there, but you can’t know how it influences the results. This is the most common way I’ve seen product-based experiments go awry, and it completely destroys the experiment. As a sanity check, you can verify that the random assignment process was done correctly by checking whether the two groups have similar averages on things that the product couldn’t impact—like age or gender. If so, great. If not, it wasn’t a true random assignment.

Make sure you’re only varying one thing

Only the product should vary across the two groups. Don’t treat the two groups differently or measure their outcomes differently in any other way. You can test multiple versions of the product at once (the only thing you’re varying is the product) with an “A/B/C test” or a multivariate test; they are discussed in “Running Multiple Versions at Once.”

Compare results for everyone

Make sure that when you compare the two groups, you compare all of the people in each group. In the treatment group, for example, there will be some people who are offered the product, but don’t actually use it. Count the nonusers, too; otherwise the results mix up the effect of the product, with the effect of who chooses to use it or not.

Know who you’re working with

Check whether users have previously had experience with the product. For example, let’s say you’re trying out a product or feature on a population that already has experience with a previous version of the product. The test is still valid, but that limits how much you can generalize the results of the test to people who have never had experience with the product.

Measure the same thing for both groups

If you are testing the impact of the entire product (instead of just a new feature), and the product tracks the outcome, how do you know the outcome for the control group? They don’t have the product! That’s a challenge. There are two main options: find a way to measure the outcome outside of the product (and use the same method for both the treatment and control), or use a crippled product with the control group that only tracks the outcome, and nothing else. Or, if that fails, you have to use statistical models to estimate causal impact (that’s covered later on in this chapter).

How Many People Do You Need?

When you’re benchmarking your product, you’re trying to figure out how much impact it normally has on your target outcome. The quality of your benchmark depends on how much information you have: you need to pass a certain threshold in terms of the number of people you examine for the results to be meaningful.

Here’s a simple example of why the number of people in the experiment is important. Let’s say your product helps people eat less ice cream. You try it with one person. Poof! He doesn’t eat any more ice cream! The second person, well, he doesn’t do so well. He actually eats more ice cream and hates you for it. The third person cuts back a lot. So does the fourth, and so on. That’s actually a pretty normal variation across people.

Overall, the product is a success. It helps people cut their (unwanted) ice cream habit by 50%. If you’d only looked at the first person, though, you’d have thought it was magic. If you only looked at the first two people, you’d be really confused. If you looked at the first four, then the picture would be clearer—overall, it seems to help, but there are exceptions. Adding additional people makes the picture clearer and clearer. But by the time you add the thousandth person, that person’s data isn’t going to help much. You’ll already have a very good idea of what the impact of the product is. You don’t need more people.

In many cases, it’s straightforward and easy to estimate how many people you need in an experiment, or the desired “sample size.” Intuitively, your desired sample size depends on how big of an impact you’re looking for, relative to the baseline outcome, and how much noise there is clouding that impact.

Here are the specific numbers you’ll need:

§ The average outcome for people who don’t have the product (or who don’t have the new feature you want to test). That’s the baseline.

§ The variance (difference from the average) in outcomes among people without the product (or new feature). That’s the noise.

§ The smallest meaningful impact on the outcome. Or, how much of change in outcomes do you need for you to justify rolling out the new product or feature to more people? Be conservative: hopefully the product will have a much larger impact, but here you want the smallest change that would tell you it’s a success.

You’ll plug these values into an online calculator, like the ones on the DSS Research site.[138] If the product’s outcome can have many evenly spaced values, like weight, height, or number of cigarettes smoked in the month, then you’ll use a calculator that can handle the average value for the population. If the product’s outcome can only be one of two things, like either the patient is alive or dead, then you’ll use a calculator that handles percentages.

You’ll be asked for two parameters that indicate how sensitive the test should be:

§ Confidence level or, equivalently, an alpha error level (alpha error level = 1 – the confidence level). Usually the default confidence level is 95% (alpha of 5%). That roughly means you can expect to incorrectly say there is an impact when really there isn’t one, 5% of the time. That’s a false positive.

§ Statistical power, or, equivalently, a beta error level (beta error level = 1 – statistical power). Usually the default statistical power is 80% (beta of 20%); that roughly means you can expect to incorrectly say there isn’t an impact when really there is one 20% of the time. That’s a false negative.

These default parameter values are built around the assumption that you want to be very careful not claiming that there’s an impact when there isn’t. It’s important, but not as important, to avoid missing an impact that really is there. Each study is different, but those are pretty good assumptions to use when you’re testing whether your product works or not. It will be very embarrassing (and costly for the engineering team) to claim you’ve found an answer, pursue it, then find out it was a mirage. So, generally, keep these defaults.

You may also be confronted with a question of whether you want a “one-sided” or “two-sided test.” In a two-sided test, you’re looking to see whether the product causes any change, positive or negative, in outcomes. In a one-sided test, you’re assuming that if the product has an effect, it will be positive. If the effect is actually negative, the test won’t work correctly. There’s always debate around this issue, but I prefer to take the more open minded route—a two-sided test that could show me the product actually makes things worse.

Once you enter these values into the calculator (and checked that the confidence level, statistical power, and type of test are correct), the calculator will tell you how many people you need in the treatment and control groups. If you have more, that’s great. If you have fewer, look for more users!

Instead of an online calculator, you can also use any statistical package for this, like R, Stata, or SPSS.[139]

Is There Really an Impact?

If you have a strong signal in the data, then it’s dead simple to determine the impact: subtract the average impact of the control group from the average impact of the treatment group.

What does it mean to “have a strong signal,” though? Researchers refer to this as statistical significance, but I prefer the more obvious description: there’s a strong indication that the two groups you’re looking at really do have different outcomes (i.e., that the results can be reasonably trusted).

The trustworthiness of your results depends on the same things that we looked at when calculating the desired sample size—the impact of the product relative to the baseline, and the amount of noise in the data—plus the number of people who were actually in the test. When we calculated the desired sample size, we were effectively trying to figure out, beforehand, how many people we would need in order to trust the results. Now that we have the results, we do a quick test to double check that we really did get what we expected.

For most experiments, there are two tests of statistical significance, depending on how you measure the outcome (just like when we were determining the sample size). If the outcome is something binary (people log into the application or not), then you’ll run a test on the percentage of people in each group that have the outcome. If the outcome is a floating-point number or integer, then you’ll determine the strength of the signal based on average outcomes.[140]

Unfortunately, I’ve too often seen people check a raw impact number without looking at whether the data can be trusted. They can get really excited about unreliable results, and, most importantly, they take the wrong lesson from the test and waste time building the wrong new features. So, save time and resources. Check that the signal is strong. It’s not hard, and it needs to be integrated into the daily routine of the company.

Running Multiple Versions at Once

Earlier I spoke about a single treatment and a control group. However, you can have many different treatment groups running at the same time—if you want to test the impact of multiple versions of the product (or product feature). When running multiple tests (aka A/B/C test):

§ Calculate the number of people you need for each treatment. Add them all up, plus the control group.

§ You don’t need, necessarily, to keep the treatment and control groups all the same size. (It makes things simpler but can also mean you unnecessarily include extra people in the test.) As a general rule, you need more people where you think the difference from the control group is small. Otherwise, you’ll end up with some tests that give you a solid result, and others that are inconclusive. Run the power calculation I described to determine how many people you need in each group.

§ Always assign people into one and only one group.

This type of experiment allows only one thing to change at a time—having version A of the product versus version B (versus C, versus D, etc.) versus not having the product at all. If you want to test multiple interacting variables at once, you’ll need a multivariate testing tool. Multivariate tools allow you, for example, to test multiple buttons with a call to action, and multiple blocks of text explaining the action, all at the same time.

Underneath the hood, multivariate testing is just a different experimental design—it’s a type of randomized control trial experiment, like A/B testing. However, analyzing the results of those studies is more difficult.[141] So, use an off-the-shelf tool instead, like Optimizely or Maxymiser (they also do the simpler version, A/B testing). These tools take care of the math for you.

How Do You Do Random Assignment in Practice?

Previously I mentioned that you should do a random assignment to determine what group each person goes into. How do you actually do that in practice?

The simplest answer is that you can use a tool to take care of it for you. In various languages, there are packages that make testing easier, such as Ruby on Rail’s “Vanity” package[142] or the Javascript library Genetify.[143] Additional packages that don’t rely on (much) coding and that you can just hook up an existing website include Optimizely and Google Analytics’s Content Experiments.[144]

But it’s not difficult to do yourself. It just depends if you already know who your users are, or not. Start with the number of people you need in each test. Let’s say you have one control and three treatments, each of which needs 250 people, for a total of 1,000. Divide the number of people in each group by the total. Here: 0.25, 0.25, 0.25, .0.25. The resulting proportions must add up to 1. Take your proportions and label them on a 0–1 scale, creating a “roulette wheel” of probabilities like this:

0.00–0.25: Control

0.25–0.50: Treatment 1

0.50–0.75: Treatment 2

0.75–1.00: Treatment 3

For each user, you’re going to spin this virtual roulette wheel (by generating a random number) and seeing which section it lands on.

You Already Know Your Users

If you already have a list of users you want to assign, then it’s easy. Generate a random number for each person on the list, 0–1 (exclusive of 1). If the number is in the first section of the roulette wheel, assign the person to the control group. If it’s in the second section, assign the person to the first treatment group, etc. The person falls into a particular section when this is true: lower bound of section ≤ random number < upper bound of section.

You Don’t Know the Users Yet

If you don’t already have a list of users in hand, then there’s a slightly different approach. You’ll need this if you are testing new users in real time (i.e., people are constantly signing up for your system, and you assign them to the experiment once they join).

You should either pick a fixed number of people you want to include in the experiment, and wait until the target is reached, or pick a specific time at which you’ll look for the results. In either case, you assign people in the exact same way as before: generate a random number for each person. Then, assign them to a group according to the roulette wheel. Make sure that repeat visitors keep their original assignment; don’t reassign them to a new group each time they come back (for a web app, set a cookie on their browser that will tell you they’ve been there before).

If you don’t pick a specific stopping point, then there’s trouble. If, instead, you constantly check the results of the experiment, waiting for something that looks promising, you’ll be misled. You’ll increase the likelihood that you’ll say there’s a difference between the two groups when there actually isn’t one ([ref118]).[145] The problem is that things tend to jump around a while until they settle down and give you a solid answer. Even if you pass a “solid signal” test (statistical significance), it can still jump around. If you accept the first promising result that comes along, you’ll often get the wrong answer.

So, pick a point at which you need the answer: either a specific number of people or a specific date. Make that the final stopping point for the test, instead of getting too excited along the way and killing the experiment early.

More Advanced Experimental Designs

Staggered Rollout

Let’s say you have a product that your users want, and you can’t withhold it from them. This happens in many international development projects, where the funder strongly believes in the success of the project before it is tested and feels it would be morally wrong to withhold the product from potential recipients ([ref104]). You can still do an experiment and measure the true impact of the product—with a staggered rollout.

In a staggered rollout, everyone gets the product (or new product feature); they just get it at different times. Take the full set of people who want it, then randomly assign them to an “early” or “late” group. Track the outcomes for everyone from the moment the early group gets the product. The treatment group is the “early” people; the control group is the “late group,” from the time when the first folks receive the product until they do. Staggered rollouts are thus limited in how long they can run—the experiment ends when the late group also gets the product. But they have the benefit of not leaving anyone out. Everyone gets the product, eventually.

A clever way to do a staggered rollout is to ask for people to pre-commit to buy or receive the product when it’s released. Then, use a rolling schedule for the release—only make enough units of the product, or give out enough access credentials, to supply to a randomly selected subset of the enrollees. Then, later, supply it to the rest of the people who signed up.

Matching and Quasi-Experiments

Where random sampling is not possible, special techniques can match individuals from a treatment group to similar people in another group. The matching procedure can be used to approximate the experimental environment (i.e., “quasi-experiments”). These are second-best options to running an experiment but are better than nothing—and statistically can yield very similar, solid results if done correctly. You’ll need an econometrician, though.

Build It In, Hook It Up

Ideally, companies should build the ability to run experiments into their application, or hook up an off-the-shelf product to help them do so. This is the best way to learn, and avoiding friction in the deployment of an experiment will speed up the process of refining and improving the effectiveness of the application. Even if the final impact your company wants to measure is outside of the product, like exercise, enabling experiments within the application will be immensely helpful for quickly iterating and testing the steps that are within the app. The assignment process itself is generally very easy—as the description of the roulette wheel indicates—and is something that most development shops can integrate into their code (that’s what we did at HelloWallet).

Determining Impact: Unique Actions and Outcomes

Experiments are the best general-purpose and accurate way to measure the impact of your product. But there’s an important case in which they aren’t needed to accurately gauge impact. That’s when there is no conceivable way that the outcome would occur without the product being there.

For example, imagine a new and highly effective cancer treatment. A team is developing a product to make people aware of it: the target outcome is for people to use the new cancer treatment. Without that awareness, no one would know. There’s no comparison group needed—any impact that occurs is because of the product.

Similarly, it’s easy to measure the baseline impact of the product when the action only exists in the product itself—which often occurs where the behavior change process entails the user learning to use a new product. For example, remember Speek from Chapter 8? People don’t make Speek conference calls (and provide revenue to the company for premium features) without the application itself. If you want to know the impact of the product as a whole on Speek calls and on company revenue, just measure them. That’s the benchmark you can use to compare against future changes. After that baseline has been established, you’ll still need to run experiments (or use other means) to gauge the impact of new features and other change to the application to distinguish the impact of the new feature from that of the existing functionality.

Other Ways to Determine Impact

Experiments take care of all of the nasty details of figuring out whether the application, or something else, changed the user’s behavior and outcomes. The random assignment process, properly done, ensures that nothing is different between the two groups except for what they want: the product itself. So, any difference in outcomes is caused by the application.

As an academic, I could make the case that experiments are really the only way to measure causal impact, because of these benefits. But in real-world products, that’s unrealistic and too restrictive. If you aren’t using experiments, then you have to face the nasty details of estimating the causal impact of the application head on. It can certainly be done, but it should be done with open eyes.

The easiest and most common way to look at impact is a pre-post analysis.

A Pre-Post Look at Impact

In a pre-post analysis, you look at user behavior and outcomes before and after a significant change. For example, if users on average walked 500 steps a day before using your product and 1,500 steps a day after using it for one month, then the product may have increased their walking by 1,000 steps a day.

In a pre-post analysis, you take the difference you see, and then you try to adjust it for all of the other things that could have caused the change that weren’t part of your product. This can be done informally, or formally. The formal version requires running a multivariate statistical analysis like estimating a regression model. The informal version means carefully thinking through what else could have impacted the users and their behavior.

Personally, while I was trained in the formal, econometric approach, I find that starting with an informal analysis is immensely valuable (even if you later do the econometric analysis as well). Also, you’ll probably need a stats person to handle the econometric analysis, but anyone can do an informal analysis to help gauge how important further analysis is, and as a reality check on the stats. So, here’s how to run an informal analysis of a pre-post study.

You have measurements of user behavior and real-world outcomes before and after a change: either when you gave the users the product for the first time, or added a new feature, etc. Subtract the pre from the post: that’s your working impact number. You also should have a sense of how big of a change you need for you to care. If you get people to walk two more steps a day, is that relevant? No. Maybe you only care if the product can get people to walk at least 100 more steps a day—it’s not much, but at least it’s something to build upon. That “when do I care?” number is your threshold.

Now, look for non–product-related things that would have caused the impact you’re seeing. With pre-post studies, there are a few very common factors. I’ll use the example of an exercise tracker to make things concrete:

Time

Would the time of year, day of month, day of week, time of day, etc. matter for this outcome? For example, if you saw that users were walking more in the spring than in the dead of winter, would that surprise you? No. So, it’s unlikely that your product would have caused a change you see in walking between winter and spring.

Experience

Let’s say you launched you product last month. You’ve just added a new feature that puts a smiley face on the tracker when users do well. In a pre-post study, it will be difficult to know if the smiley face caused increased exercise, or if users just gained experience with the product from its initial launch and slowly exercised more because of that experience (rather than the smiley face). Gradual changes over time are often caused by experience; sharp changes in behavior are more likely to be caused by a change in the product or another external “event” you can search for.

Data availability or quality

Let’s say that in the new release of the tracker, you added a smiley and someone in the engineering department fixed some bugs in the analysis of accelerometer data. Walking is up! Hmm. That could be because of the smiley, or because you’re simply getting better data about the users. I’ve found that data quality issues in particular are often invisible and therefore often misleading—someone changed something, and didn’t think it was important or didn’t want to admit the previous problem. Like product changes, data quality and data availability changes are sharp, sudden changes, so they are very hard to distinguish in a pre-post study.

Composition of the population

Let’s say that with the new release of the tracker, you’ve added lots of new features and made a big announcement. You see that average walking is up! Excellent. But that may be because the product caused people to walk more, or it may be that the announcement of the new features caused new users to join who were already walking more—and the new users brought up the average. This occurs in sudden ways (like product announcements) or through the slow addition and attrition of users over time. You should counteract this by looking at a specific group of people before and after the change.

In each case, you’re looking for a gut check—is this a big deal? Measuring walking behavior in the dead of winter versus in spring is a big deal. Measuring it from one Tuesday to the next usually isn’t (barring holidays). A big deal is anything that looks like it’s going to have a large impact on behavior relative to what you’re seeing pre-post and relative to the threshold at which you care. If the combination of many small things push the likely impact of your product below the threshold at which you care, then you can usually stop—and move on to something more promising.

If the pre-post impact is so large that nothing else seems to explain it other than the product, excellent. You should check your work with a statistical model and prepare to be surprised; but if you can’t, at least you’ve gotten an initial estimate of impact.

This informal analysis feeds into formal statistical modeling. Each of the factors you identify that might be important become variables in the model—things that you are trying to control for, in order to isolate the unique impact of your product. You’ll need to identify data that measures them, and run the model itself. That’s beyond the scope of this work—but a good stats person can help.

Seem complex? It can be. That’s why experiments are wonderful, because they remove these complexities. But we can’t always run them or get enough users into the system to get a solid result. And so we sometimes must use pre-post analyses. When the product has a big effect, and there aren’t many other things going on that confuse the result, that’s enough to get a signal for further product development. Another option is cross-sectional multivariate analysis, which is up next.

A Cross-Sectional or Panel Data Analysis of Impact

In a cross-sectional analysis, you look for differences among groups of users at a given point. You want to see how their usage of the product impacts their behavior and outcomes, after taking into account all of the other things that might be different about the users. For example, you might look at the impact among frequent users of the application versus infrequent users. As with pre-post analyses, I usually start with an informal, logical analysis, then feed that understanding into a formal statistical or machine learning model if it looks like there’s enough of an impact from the product to care.[146] Cross-sectional analyses usually pull together diverse groups of people; in order for the analysis to be valid, you’ll need to control for all of the factors that make those groups different other than the product.

As before, there are some common differences you need to take into account. Most importantly is this: why are some people more frequent users than others? Age, income, prior experience with the behavior, prior experience with the product’s medium (mobile versus web), self-confidence, sufficient free time, etc. All of these are factors that affect the users’ behavior above and beyond the product.

If there aren’t obvious candidates that explain the difference in behavior across users, then take the list of factors you generate, and plug them into a statistical model. Again, it’s beyond the scope of this work, but a good stats person can help.

In addition to cross-sectional and pre-post analyses, one can (and should) also look at models that examine changes in behavior and outcomes among many users, over time. These models, using “panel” datasets (or time-series cross sectional datasets with many people, but shorter time frames), provide a much more fine-grained look at behavior. They can pull out impacts of the product that pre-post and cross-sectional models can’t, because they can control for other differences across the individuals. However, they require much more data and statistical knowledge.

What Happens If the Outcome Isn’t Measurable Within the Product?

You can safely skip this section if users take action directly in your product and you can easily measure the outcome there.

As we already discussed, sometimes the target outcome, and even the target action, may not be directly measureable in the product. For example, think about a website that helps users set up an urban vegetable garden with video tutorials. The target outcome is more vegetable gardens; the target action is that users set them up (rather than, for example, contractors being paid to set them up).

Each person who uses the urban garden site is tracked with a cookie or authenticated login. Each step in the “how to set up an urban garden” tutorial is tracked. When users complete the tutorial, are they “done”? Did they complete the action? No. The action that the company wants to drive is setting up vegetable gardens, not completing a tutorial about setting up vegetable gardens. The difference between those two could be slight, or it could be massive if no one actually follows through. Without further information, there’s no way the company can know if the product is successful at driving behavior change. Similarly, it has no way of knowing that the product has caused more vegetable gardens to be set up than there otherwise would have been.

So, what can a company do? If the action or outcome is not directly measurable with the product, then a data bridge is needed. A data bridge is something that convincingly connects the real-world outcome with behavior within the product. There are two basic strategies for building a data bridge:

Build it yourself

Find a reliable way to measure the target action and outcome. Then, build a model of what behavior in the product relates to that action and outcome.

Cheat

Find an academic researcher who has already established the link between something you can reliably measure in the application and the real-world outcome. For example, there are numerous studies that document “overreporting” (lying) about voting when people actually don’t vote ([ref169]). If there isn’t an existing research paper on the topic, work with researchers to generate one (ideas on how to partner with researchers are discussed in Chapter 15).

For the rest of this discussion, I’ll assume that you haven’t been lucky enough to find an existing research paper or interested researcher to do the work for you—and so you have to build the data bridge yourself.

image with no caption

Figure Out How to Measure the Outcome and Action, By Hook Or By Crook (But Not By Survey)

For our sample company that’s encouraging users to set up urban vegetable gardens, it will need to measure the number of vegetable gardens, plain and simple. An obvious route would be to ask the participants, with a survey. Not ideal. Surveys are good for gathering facts when people have an incentive to actually answer the survey but don’t have an incentive to lie. Imagine that the vegetable garden company asked users of its website, after a week, whether they set up a garden. Most people wouldn’t answer—especially those who didn’t set one up. Some would answer truthfully. And some would answer with “what will be truthful when they get around to it” (i.e., they’ll tell a white lie). The company can’t really know which is which; at least, not without doing additional field research to verify if people aren’t telling the truth.

If the company asked users about their intention to set up a garden, that would be even worse. Users would be sorely tempted just to give the answer that’s expected of them (“Yes, of course!”); that’s called the social desirability bias in surveys ([ref65]). Or, people might honestly believe that they will set up a garden, but never get around to it. One can try to reduce this bias in surveys with carefully worded questions, but it’s difficult to know the success of that effort without verification.

Direct observation is often the best option: either observing the number of vegetable gardens themselves, or other things that uniquely indicate that the action was taken, like the number of people buying special vegetable garden supplies within a city. The company doesn’t need to measure every single action or every time the outcome changes—it just needs to be able to measure a few times so it can understand the relationship between the product, the action, and the outcome. So, a small pilot study where an intern goes out and manually counts the number of vegetable gardens in an area is fine.[147]

Building a data bridge follows the same rules as creating a benchmark for the product, described earlier in this chapter. In this case, you’re looking for the causal relationship between something easily measurable in the application, and a real-world outcome that’s hard to measure, but is what you really care about. If the real-world outcome is unique to the product (i.e., if no one normally creates vegetable gardens in the area you care about), then you can do a simple observation of the real-world outcome, after people use your product, as your metric. If the real-world outcome has multiple possible causes, then you’ll need to use an experiment, statistical model, or pre-post analysis.[148]

In either case, there are three factors to keep in mind when measuring the real-world outcome itself. These determine how sturdy the resulting data bridge will be:

Representativeness

You want to observe cases that are representative of what “normally” happens; if you decide to count vegetable gardens in Portland (very rainy), and most of your app’s users are in Phoenix (rather dry), that won’t help you generalize much about vegetable garden creation. The most solid results come from taking your user base and randomly selecting some of them to directly observe.

Getting enough data points

You need to ensure you have enough information to get a solid signal about what the real-world outcome really is. For example, if you make only one observation of whether people make a vegetable garden after saying they will in the app, that’s not going to tell you much about whetherother people will. You need the general percentage of people who create gardens when they say they will. So, how many observations are enough? There’s no hard-and-fast rule—it depends how important the accuracy of the estimate is to the company. For experiments, we discussed in detail how you compute sample sizes. If you’re not using experiments, you can use online tools for computing “confidence intervals,”[149] which tell you how confident you can be in your estimate; if you build a statistical model of the relationship, that will also provide you with confidence intervals.

Getting a baseline

Sometimes things happen in the real world that have nothing to do with your product. I know, it’s hard to believe. Some people will create vegetable gardens on their own, even without the vegetable garden app. So, when you’re observing your real-world outcome, include some cases in which people don’t use the app. This is important if you’re doing a simple model in Microsoft Excel, or if you’re running a full experiment to build your data bridge.

If these options fail, and there’s really no way to measure the product’s real-world outcome, then the rest of this discussion about impact can’t help. That signal—what’s actually happening in the world—is essential for keeping the whole process honest.

Find Cases Where You Can Connect Product Behavior to Real-World Outcomes

Now you have measurements of actions taken within the product and of real-world outcomes (though perhaps imperfect measurements). How can you connect the two? You can connect them at the individual level, or at an aggregated level. At the individual level, for example, the urban gardening app could ask for users’ names and addresses to connect their behavior in the product to whether or not they actually have a vegetable garden (send the intern to their home, and mark it down in the record). Getting data about individual users is the ideal—as long as the data meets the standards (representative, sufficient in size, and with a clear baseline).

Alternatively, the action and outcome can be measured as an aggregate: a known geographic area or a known group of people. If you know that a certain set of users in the product correspond to the known area or group (even if you don’t know who is who), and you can measure the actions and outcomes reliability in that area or group, you’re in business. As we’ll see, it’ll be more challenging to figure out exactly what is going on, but you can do it.

Build the Data Bridge

A data bridge brings together something you know and can measure frequently—user behavior within the application—with something you have only measured a few times—the impact of the product on the real-world target outcome. It allows you to estimate how much the target outcome has probably changed based on behavior within the product. You’ll estimate that relationship by running a pilot project that gathers both datasets:

1. Take a circumstance in which you can reliably connect user behavior in the product with the real-world outcome or action, as we’ve just described.

2. Measure the causal impact of the product on the real-world outcome or action—using an experiment (ideal), statistical model, pre-post analysis, etc.

3. Analyze the various user behaviors that occur within the application, and identify one or more that is strongly related (correlated) to the application’s causal impact. If a statistician is available, use a mediation analysis.

4. When the indicative user behavior occurs within the product, build a model (in Excel, or in a statistical package) of how much that changes the target outcome. That’s the data bridge.[150]

5. In the future, whenever you see the behavior in the product, use your model to estimate the likely impact on the target outcome.

For example: the urban gardening site runs a pilot study, where it takes two sets of randomly selected people and offers its program to one group and not the other. Some of the people in the first group completed the training program; some did not. An intern visits the homes of everyone in the study and measures the truth. The company finds that 65% of people who were offered the program created a garden, and 90% of those who were offered the program and completed their training within the application created gardens. Meanwhile, 15% of those who weren’t offered it nevertheless created a garden. Those three stats provide a basic understanding of how to interpret user behavior on the website in the future.

The company would improve the chance that a person will set up a garden by 50 percentage points (from 15% to 65%) if it offers the person its training program. It will improve the chance that the person will set up a garden even more if it can convince him to complete the training program.[151] The company can get a precise estimate of that impact using what’s known as a mediation analysis on the experiment.

A quick review: if your target outcome is something outside of the product, and not directly measureable, then you’ll need to build a data bridge. The easiest way to do that is to find an existing research study that documents the relationship you’re looking for—like between the intention to plant a garden and the actual act of doing so. If not, look for a case in which your team can directly observe the users’ behavior, and compare the things they do or say in the product to what they actually do in the real world. That’s your data bridge. In the future, you can use that relationship to estimate how much of an impact you’re having based on what you see in the application, and iteratively improve your product for greater impact.

On a Napkin

Here’s what you’ll need to do

§ Define two metrics that say how you’ll measure the product’s target outcome and target action.

§ If the outcome and action are within the application, great. Measure them directly. If not, try to find creative ways to get that data from someone else or build it into your application.

§ If both of those fail, you’ll need to build a model of how behavior with the product affects the outcome and action: a data bridge.

§ Measure the impact itself. The gold standard is A/B testing; you can use off-the-shelf tools that don’t require a stats person.

How you’ll know there’s trouble

§ The team can’t decide on a reliable, accurate metric of the product’s outcome, or doesn’t define success and failure by that metric.

Deliverables

§ A clear measurement of the product’s impact!


[131] Inspired by Neighborsations, http://www.neighborsations.com/.

[132] There are a variety of perspectives on what makes a good metric, but no generally accepted and applied definition. These are characteristics that I’ve found to be important.

[133] It may take an up-front investment (that’s not cheap) to make reoccurring measurement cheap. We want to set up data collection that will be cheap and easy to check whenever there is a change to the application. Survey data, for example, is often “cheap” to measure the first time, but the cost usually remains the same with each iteration (and survey data is plagued with biases; discussed under the section “figure out how to measure”). Ideally, we want automatically gathered administrative data—that is collected from the original source without the need for human intervention or extra costs. Asking people what they spend money on is a survey. Their actual credit card transactions are administrative data.

[134] http://www.contactually.com

[135] https://www.kissmetrics.com/; https://mixpanel.com/

[136] http://piwik.org/

[137] Most of us forget or don’t even think about what we’re eating. See [ref181] and http://www.mindlesseating.org/ for humorous and disturbing examples.

[138] https://www.dssresearch.com/KnowledgeCenter/toolkitcalculators/samplesizecalculators.aspx

[139] I like to use R, which is open source and extremely powerful. In R, you can use the functions power.t.test(), when working with average values, and power.prop.test(), when dealing with percentages.

[140] In R, that’s the prop.test() function for proportions and the t.test(), or a regression function for numerical values. If the outcome is ordinal (the possible values are in order, but the spacing between them may be irregular and they aren’t directly comparable) things are a bit trickier. Get a good stats book, find a statistics person, or tweak the measurement so that the result is binary, floating point, or integer.

[141] I can’t just point you to the right function in R, sorry.

[142] http://vanity.labnotes.org/. Vanity allows the sample size to grow until you observe a statistically significant difference, which can undermine the test and give false positives. A better way to handle an unknown number of users is described in “You don’t know the users yet”. My thanks to Katya Vasilaky for mentioning this problem.

[143] https://github.com/gregdingle/genetify/wiki.

[144] https://www.optimizely.com/; https://www.google.com/analytics/

[145] My thanks again to Katya Vasilaky for the reference and description of the problem.

[146] There are often many possible changes to the product you want to analyze—so focusing too long on features that don’t appear to change behavior in practically significant ways means you’re wasting time that could be used more valuably elsewhere. This is a difference from academic social science work—in that researchers usually devote a significant amount of time to a single question; because of a lack of data, they usually don’t have a long list of alternative questions that can be explored immediately.

[147] By the way, if the area is large, I imagine that the best way to do this would be access government or commercial satellite imagery. Professional geographers have worked out amazing algorithms to automatically detect vegetation cover, and even the type of vegetation. The GeoEye satellite that is used by Google Earth, for example, measures down to increments of 16 inches.

[148] To clarify—at this point we’re just talking about how to measure the real-world outcome. That forms half of the data you need in order to run an experiment, do a pre-post analysis, or build a statistical model of the relationship between the real-world outcome and user actions in the application. That process is what actually creates the data bridge, and is covered later on. But it helps to plan ahead for the type of analysis you will be running, to ensure you’re gathering the right data you need when measuring the real-world outcome.

[149] For example, you can use http://easycalculation.com/statistics/population-confidence-interval.php for calculating confidence intervals of proportions (percent of people creating vegetable gardens) and http://easycalculation.com/statistics/confidence-limits-mean.php for calculating confidence intervals of quantities (number of pounds lost after an exercise program). Penn State has a nice summary of the underlying math here: https://onlinecourses.science.psu.edu/stat200/node/46.

[150] In the simplest case, you might look at the simple linear relationship between the real-world impact and user behavior in the product. But there’s no reason to limit the analysis to a linear relationship. You want to build a model that most accurately describes the relationship between behavior in the product and outcomes in the real world.

[151] Exactly how much additional improvement would occur would requires additional analysis, to separate out the self-selection into the program from the program’s causal impact.