Learning and Refining the Product - Refining the Product - Designing for Behavior Change: Applying Psychology and Behavioral Economics (2013)

Designing for Behavior Change: Applying Psychology and Behavioral Economics (2013)

Part V. Refining the Product

Chapter 14. Learning and Refining the Product

Determine What Changes to Implement

At the end of each cycle of product release and measurement, the team will have gathered a lot of data about what users are doing in the product and potential improvements to it. Obstacles to behavior change, the subject of the last chapter, are only one source of those product improvements. Business considerations and engineering considerations must also be reviewed. It’s time to collect the potential changes from these diverse sources and see what can be applied to the next iteration of the product. I think of it as a three-step process:

1. Gather lessons learned and potential improvements to the product.

2. Prioritize the potential improvements based on business considerations and behavioral impact.

3. Integrate potential improvements into the appropriate part of product development process.

Gather

First, look at what you learned in the last two chapters about the current impact of the product and obstacles to behavior change. What did users struggle with? Where was there a significant drop-off among users? Are users returning to the application, or only trying it once or twice? Why does that appear to be happening?

§ Start by picking the low-hanging fruit. List the clear problems with a crisp follow-up action; for example, no one knows how to use page Y.

§ Then, write down the lessons that are more amorphous; for example, users don’t trust the product to help them to change behavior. Maybe the team has started thinking about potential solutions, but there’s more work to be done. The next step is to further investigate what’s going on and settle on a specific solution to resolve the problem.

Next, gather lessons about the core assumptions of the product:

§ Does the target action actually drive the real-world outcomes that the company seeks? For example, maybe walking a bit more each day isn’t enough to reduce heart disease among the target population, and a stronger intervention is needed.

§ Are there other actions that appear to be more effective? These lessons can come from the causal map (optionally) developed in the last chapter. Could the product pivot to a different action that is more effective?

§ Are there major obstacles in the user’s life, outside of the product, that need to be addressed? Looking again at the causal map, what major factors that are currently outside of the product’s domain are counteracting the influence of the product? If exercising more leads the person to also drink more alcoholic beverages (as a “reward”), is that defeating the product’s goals? To design for behavior change, we care about the net impact of the product, not just the intended consequences. Is there anything the product can do about that countervailing force, or is it just a fact of life?

Finally, look beyond the specific behavioral obstacles and impact studied in the last two chapters. The team has probably generated numerous ideas for new product features or even new products. Collect them. Other parts of the company will also suggest changes to the product as well: changes designed to increase sales, improve product branding, resolve engineering challenges, and so on. Behavioral considerations are just one (vital!) element in the larger review process.

Lessons and proposed improvements can come at different times during the product development cycle—from early user research to usage analysis after the product is released. Some lessons will only come at the end, during a formal sprint review or a product post mortem. I suggest creating a common repository for them, so that ideas don’t get lost. That can be someone’s email box, a wiki, or a formal document of lessons. In an agile development environment, they should be placed in a project backlog.

Prioritize

In any product-development process, there is a point at which the team needs to decide what to work on in the future. In an agile environment, that process occurs frequently, at the start of each sprint. In a sequential development world, it often occurs separately from the release schedule, as managers plan for future iterations of the product. In either case, the team needs to prioritize the long list of specific changes to the product and problems to investigate further.

The prioritization process should estimate the behavioral impact of major changes to the product: how will the change affect user behavior, and how will that affect the product’s outcomes in the real world? Since the product is designed for behavior change, these behavioral impacts will likely have knock-on effects on sales or the quality of the company brand. Naturally, the prioritization will also incorporate business considerations (will the change directly drive sales or company value?), usability considerations (will it make the users happy and reduce frustrations, hopefully driving future engagement and sales?), and engineering considerations (how hard will it be to implement the change?).

The team should ground its assessment of behavioral impact in real data: the drop-off numbers at each step of the user’s progression and the causal map from Chapter 13 allow the team to make a quick estimate of how large of an impact a change in the product should have. That helps the team answer: how big of a problem does this change really address? What very rough change in the target behavior and outcomes do we expect from it? Even if the proposed change to the application wasn’t driven by a behavioral concern—for example, if it came from a client request during a sales conversation—it should be evaluated for its possible behavioral impact. It may have the added benefit of helping improve user success at the target behavior, or it may distract the user and undermine the product’s effectiveness.

The weight of each of these considerations—business, behavioral, engineering, etc.—in the company’s prioritization will vary, and there’s no hard-and-fast rule.

Integrate

Your company has a prioritized list of changes to the product (including open questions that need to be answered) and a sense of how difficult each piece is to develop. Now, separate out changes that require adjusting core assumptions about product and its direction, from less fundamental changes that keep the same direction. If the change entails targeting a different set of users (actors), a different target action, or especially, a different real-world outcome, then those go into the first bucket. If the change entails a new product or new product feature, where there are major unknowns, also in the first bucket. Everything else can go into the second bucket.

Here’s one of the few places I take a strong stand on the product-development process—items in the first group, with changes to core assumptions or major new features, need to be separately planned out by the product folks before they are given to the rest of the team. Even in an agile development process, core product planning shouldn’t be done in parallel with the rest of the process; that’s the same dictum that Marty Cagan, in Inspired (SVPG Press, 2008), gives in his analysis of product management. It’s just too much to determine what to build and how to build it at the same time.

When designing for behavior change, core changes to the product require updating the behavioral plan. They may also require updating the product’s outcomes, actions, and actors. In other words, they require another cycle of the full discovery or design process, starting with Chapter 4 orChapter 6 of this book. Everything else can go directly into a user story (agile) or product spec outline, starting with Chapter 9.

Each time the core assumptions—actor, action, and outcome—are changed, they should be clearly documented, as described in Chapter 4 and Chapter 5. Then the behavioral plan should be updated. This formalism helps the team pull problems into the present—rather than let them lurk somewhere in the future, only to be found after significant resources have been developed. Making the assumptions and plan clear up front is intended to trigger disagreements and discussion (if there are any underneath). It’s better to find those disagreements sooner rather than later.

When a core problem—like a particular step in the sequence of user actions is confusing users—there’s a natural tendency to settle on a proposed solution, and just get it done (i.e., a “fix” is often implemented before vetting and testing the problem). But human psychology is tremendously complex, and trying to build a product around it is inherently error prone. There’s no reason to think that the proposed solution is going to be any more free of unexpected problems than the previous solution. The discovery process—documenting the outcome, action, and actor, and then developing the behavioral plan—is one way to draw out the unexpected and provide opportunities to test assumptions early. It’ll never be perfect, but it’s a whole lot better than just shooting from the hip.

Measure the Impact of Each Major Change

Each major change to the product should be tested for its impact on user behavior; measuring changes in impact should become a reflex for the team. It’s not always easy to stomach, but it’s necessary. That way, the team is constantly learning and checking its assumptions about the users and the product’s direction. As we’ve seen, small changes in wording and the presentation of concepts can have major impacts on behavior; if we’re not testing for them, we can easily and unintentionally undermine the effectiveness of our product. But without a reflex to always test, testing the marginal impact of changes can raise all sorts of hackles and resistance. Let’s look at some of the issues that can arise and how to handle them.

Most tests will (and should) come back showing no impact

Many people get frustrated at test results that come back with no clear difference between the versions they’re testing and call that “failing.” Assuming you designed and ran the test correctly, a “no difference” result should be celebrated. It tells you that you weren’t changing something important enough for the test to have found a difference. Either be happy with what you have, or try something more radical. It saves you from spending further time on your current approach to improving the product.

What’s a well-designed test? It’s one where you’ve defined success and failure beforehand. It’s not one where you go searching for statistical significance (or a “strong” qualitative signal). For example, let’s say you have a potential new feature/button color/cat video. How much of an impact does it need to have before you care? If you improve impact by 20%, is that your threshold for success? Is it worthwhile to work on this further if you’re only getting a 2% boost? That definition of success and failure, along with the amount of noise in the system, determines how many people you need in the test. If you get a result of “no difference” from the test, that doesn’t necessarily mean “there’s no effect”; it means there’s no effect that you should care about. You can move on.

A/B tests in particular seem to mean you’re showing a “bad” version of the app to some people

If you have a good UX team, then most of the time, no one really knows if a change in the app will improve it. You can’t accurately predict whether the new version will be better or worse. You usually are showing a “bad” version; the problem is that you don’t know which one it is! Our seemingly solid hunches are usually random guesses, especially when we have a good design team. There are two reasons why.

First, a good UX team will deliver an initial product that is well designed, and will deliver product improvements that are also well designed. We all make mistakes, but a good design team will get you in the right ballpark with the first try. By definition, further iterations are going to have a small impact relative to the initial version of the product. Don’t be surprised that new versions have similar results (impact, etc.) to earlier versions—celebrate the fact that the earlier version was a good first shot.

Second, human behavior is just really confusing. As we’ve seen repeatedly throughout this book, we just can’t forecast exactly how people will react to the product. In familiar situations, we can and should use our intuition about a set of changes to say which one is likely to be better—like when we’re applying common lessons we’ve learned in the past. But when you have a good design team, the common lessons have already been applied. You’re at the cutting edge, and so your intuition can’t help anymore. That’s why you need to test things, and not rely (solely) on your intuition.

Does planning for tests imply you’re not confident in the changes you’re proposing?

This is another issue I’ve heard, and it’s a really tricky one. You naturally expect that any changes that you’re planning to make to the product will improve it. But that’s often not the case (since it’s hard to make a good product better, and human behavior is inherently complex).

That sets up a problem of cognitive dissonance, though. It’s very uncomfortable to think that some of the changes you’ve carefully planned out, thought about, and decided will help are actually going to do nothing—and you don’t know which ones those are! It would be like you’re admitting a lack of confidence in the changes that you’ve already advocated. So, a natural (but dangerous) response is to plough ahead and decide that testing is not needed.

There’s no simple solution to address this situation—the need to confidently build something you shouldn’t actually be confident in. The best approach that I’ve come across is to move the testing process out of the reach of that cognitive dissonance. Make testing part of the culture of the organization; make it a habit that’s followed as standard procedure and not something that the organization agonizes over and debates each time a new feature is added.

Alrighty. Those are three of the major issues I’ve confronted as teams explore testing incremental changes to their product. Thankfully, it’s not hard to actually measure incremental impact. If you created a benchmark of the product’s impact in Chapter 12, then all you need to do is to reapply the same tools here: experiments, pre-post analyses, and statistical models. Shifting from epistemology to the practicalities of testing, the next sections describe how each method can be used to measure incremental impact.

How to Run Incremental A/B Tests and Multivariate Tests

If you’re changing the content of a screen or adding/removing user interactions, then it’s straightforward to run an experimental test to see the impact of the change:

1. Create two or more parallel versions of that part of the product: the existing version (control) and new version(s) (treatment).

2. Estimate how many people you need in each group using online power calculation tools, as described in Chapter 12.

3. Randomly assign users, also as described in Chapter 12.

4. Run the test and measure the target behavior and outcome.

5. Check that the signal is strong enough to make a solid comparison, using one of the online tools listed in Chapter 12, or using a statistical package like R. If so, compare the averages for each group.

And that’s it. That tells you the impact of the change. An experiment like this is the best, most reliable way to measure the effect of a feature. If you find that the new feature or change hurt the impact of the product, then the team will need to determine if it makes sense to cancel the change. If it helped, celebrate!

However, it can be very costly or difficult to run A/B tests when you’re making a small change to the system: you may not have enough users to get a strong signal, or it may be costly for engineering reasons to keep two versions of the feature (old and new) at the same time. In that case, there are less expensive (but less precise) tools one can use.

How to Compare Incremental Pre-Post Results

Whenever the engineering team makes changes to the application, the company can look at user behavior before and after. As discussed in Chapter 12, numerous other factors could cause changes in behavior, so the results of a pre-post test need to be interpreted carefully. But if there is a major difference between the two versions, and nothing else that appears to have changed for the users at the same time, then a simple pre-post comparison is good enough. It can provide a reliable and easy way to gauge if the change helped or not.

How to Find Incremental Effects in Statistical Models

Another way of looking at the effect of a change in the application is to rerun the statistical models used to establish the product benchmark in Chapter 12. You do this by:

1. Comparing people who used the new changed feature versus those who didn’t

2. Comparing the impact of the application before and after the change

In the first case, look for people who simply didn’t see the change in the application, because they didn’t log in, for example. If the reason people didn’t see the change in the application is completely random, then you have a natural experiment and can treat the change like an A/B test. Check that the signal is strong, and then just compare the average behavior of each of the two groups.

If you don’t have a natural experiment, then you look for statistical controls that account for other reasons that those people didn’t see the change in the application (like their being overall less interested in the application, less likely to take the action at all, etc.). The challenge is that it is very difficult to control for all of the possible reasons that someone wouldn’t interact with the changed part of the application. So, just like with pre-post analysis, there’s a risk that you’ll get misleading results. But if there is a major change in the outcome, and nothing else appears to explain it, the statistical model can point you in the right direction even if it isn’t exactly perfect.

In the second case, when you compare the impact before and after the change, you’re effectively running a pre-post analysis (but with a statistical model to control for additional factors). You’ll need to think through the other things in the application and in the users’ daily lives that could have been different before and after the feature was changed. It’s certainly risky, but it may be the only practical option. The same rule applies: if it’s a big change (one that you actually care about), and nothing else seems to explain it after a careful analysis, then the model can point you in the right direction.

Running Qualitative Tests of Incremental Changes

I didn’t mention qualitative research in Chapter 12, when we were establishing a benchmark of the impact of product on user behavior and real-world outcomes. That’s because it’s difficult to generate a repeatable, reliable ROI metric of real-world impacts using most qualitative methods. But qualitative research can be quite valuable when you want to quickly judge how users are responding to a change in the application.

Put the revised application in front of users during user interviews, user testing (speak out-loud methods), or even focus groups. If you get a clear signal about whether the change has caused problems, you’ve just saved a lot of time. You can get feedback and insight in a fraction of the time it would take to test the product change with an experiment or pre-post analysis. While I am a big proponent of experiments (and statistical modeling), the benefit in terms of speed and depth of understanding from qualitative testing is too much to ignore. Of course, the team should have already performed a round of qualitative testing on the prototypes before the change was made to the product itself, too.

Deploying Multiarmed Bandit Techniques

There’s been a lot of interest and attention given recently to multiarmed bandit techniques. These procedures dynamically adjust what content is shown to users based on the content’s past performance. The process starts with two (or more) alternative versions of a page or product and an initial estimate of how effective each version of the page is for driving a target outcome, like conversions. As users reach the page, most of them (say 90%) receive the version that appears to be most effective. The rest of the users receive the other version(s), just in case the initial estimate of its relative effectiveness was too low. The system recalculates the effectiveness of the various versions on the fly. As more users come into the system, most are directed to whichever version is currently seen as most effective. It exploits the best performing version by giving it lots of users,explores other versions just in case it’s wrong, and learns over time. The technique is dirt simple to code and can be found in this provocative 2012 blog post by Steve Hanov, “20 lines of code that will beat A/B testing every time” (http://stevehanov.ca/blog/index.php?id=132).

Multiarmed bandits are generally good at driving users toward the version of the page that is most effective. That’s excellent. But that benefit comes at a cost. The first cost is time. Because the algorithm basically assumes it knows what’s right based on early indication (and, to be fair, it often is), it takes a long time for the exploration part to test that assumption (i.e., it takes a much longer time to determine if one version of the page is actually better, or it just looked like it was better initially). The second cost is related: risk. If the early data gives a false signal, or the type of users in the system changes midstream, it takes a much longer time to discover that error and show the right version of the product to users than an A/B test would.

Both A/B tests and multiarmed bandit are forms of experiments, and both are scientifically valid. Both can tell you the impact of a change in your product. I’ll leave the choice of which to use up to you, depending on how confident you are that one version of the page or product clearly is better than another, and that the multiarmed bandit algorithm will be able to pick up that difference early on. There’s a nice summary of the pros and cons of multiarmed bandit techniques versus A/B testing on the Visual Website Optimizer Blog ([ref31] [http://bit.ly/19RJU2r]).

When Is It “Good Enough”?

Ideally, the outcome of any product development process, especially one that aims to change behavior, is that the product is doing its job and nothing more is needed. When the product successfully automates the behavior, builds a habit, or reliably helps the user make the conscious choice to act, then the team can move on. There are always other products to build. And, for commercial companies, there are always other markets to tap. So, how can the team tell when it is good enough?

Return to the product’s target outcome, and try to stop thinking about the product itself. The target outcome should be measurable, by definition! What’s the target level (or change in the target) that the company decided would count as success? If the product currently reaches that threshold, wonderful. Forget the product’s bugs. Forget the warts in the design. Move on. If the current product doesn’t yet meet that threshold, what is the best alternative use of the team’s time? If the alternative is more beneficial to the target outcome and can be achieved with similar resources, the team should switch its focus.

Expending effort on building a product or anything else warps our judgment of value.[153] Designing for behavior change is about final, objective outcomes—so that means taking a dispassionate look at what’s really in the best interest of the company and users. Ideally, the person taking that dispassionate look wasn’t involved in building the current product at all. Sometimes it means letting a somewhat broken product remain somewhat broken so the team can work on something else, in the name of helping your users.

How to (Re-)Design for Behavior Change with an Existing Product

Thus far, I’ve presented the process of designing for behavior change in terms of what it takes to build a new product or product feature. What should you do if you already have a product and are just starting to formally target and assess its impact on user behavior? The process isn’t so different than if you were starting with a new product, but you have much more information to start with!

Here’s how to do it:

1. Document the target outcome, actor, and action. (See Chapter 4 and Chapter 5 on how to record existing targets or develop new ones.)

2. Develop a rough behavioral plan using the product’s current sequence of steps—what does the product actually encourage people to do?

3. Instrument the product to measure user behavior at each step of the way, if it hasn’t already been instrumented.

4. Dig into the data (Chapter 12 and Chapter 13), to see what your users are doing, what impact the product is currently having, and what obstacles users are facing.

5. Generate ideas for product changes, large and small. Prioritize them as necessary.

6. In parallel, sketch out a blue-sky version of the product that wasn’t constrained by the current implementation. Knowing what you know, how would you design the product for maximum impact? You don’t need to do this too formally—just run through the “On a Napkin” exercises given in Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8.

7. Check if that blue-sky idea is either:

a. So promising that it might warrant a new development effort. If so, test the idea first, don’t build it yet. After the initial testing, if everything pans out, consider switching products. I know that’s not what product teams want to hear (see “the Ikea effect”), but it might be the best thing for users and company.

b. Promising and similar enough to the current product such that incremental changes can be made to the current product to try out the ideas and improve the impact over time. Add the new ideas to the list of proposed changes, and prioritize them (along with all of the other proposed changes) as necessary.

8. Take the list of prioritized changes, integrate them into the product development process, and get crackin’.

Alternatively, you can take a more focused, but less open-minded, “fix what’s broken” approach. In that case, you’d start with the discovery process (Chapter 4 and Chapter 5) to make sure everyone is on the same page about what the product is supposed to do. Then, you’d go straight into a diagnosis of known problems—using the Create Action Funnel to identify the underlying psychology that’s driving the problems. That’s described in Chapter 13. And, with the problem diagnosed, you’d jump to Table 10-1 to identify behavioral tactics that can resolve the problem.[154]

If you have a product with a narrow problem, then the “fix what’s broken” approach is great. Otherwise, I suggest going with the more detailed approach; that helps you test your assumptions and potentially reenvision the product to make it far more effective.

On a Napkin

Here’s what you’ll need to do

§ Gather together all of the proposed product changes—changes to improve the behavioral impact of the product and other changes suggested by sales, marketing, or other parts of the company.

§ Prioritize the changes based on the company’s and user needs and their likely impact on the user behavior.

§ Measure the impact of each major change to the product, using the same tools outlined in Chapter 12. Make incremental measurement part of the culture of the company.

§ For existing products, start with discovery, then skip the design stage and dive into the data to see where refinements should be made.

How you’ll know there’s trouble

§ Major changes are planned for the product without assessing their likely impact on user behavior.

§ The team is afraid to test the new feature, because the tests usually come back negative or testing would imply a lack of confidence.

Deliverables

§ A new and (hopefully) improved product!


[153] One example: the so-called Ikea effect ([ref142]). If you put together a lopsided, ugly bookshelf from Ikea, you’ll think it’s much more valuable than anybody else’s identical, lopsided, ugly bookshelf from Ikea.

[154] This approach is similar to the (much more detailed and thorough) diagnosis phase that ideas42, the leading behavioral economics consultancy in the United States, uses to start its design process.