Take an Experimental Approach to Product Development - Exploit - Product Details Lean Enterprise: How High Performance Organizations Innovate at Scale (2015)

Product Details Lean Enterprise: How High Performance Organizations Innovate at Scale (2015)

Part III. Exploit

Chapter 9. Take an Experimental Approach to Product Development

The difficulty in defining quality is to translate future needs of the user into measurable characteristics, so that a product can be designed and turned out to give satisfaction at a price the user will pay.

Walter Shewhart

Up to now, we have spent the whole of Part III showing how to improve the speed at which we can deliver value to customers. In this chapter, we switch focus to discuss alignment—how to use the capability we have developed to make sure we are building the right things for customers, users, and our organization.

In Chapter 7, we showed how to use the Cost of Delay to prioritize work. In an organization where IT is essentially a service provider, this is an effective way to avoid working on low-value tasks that consume precious time and resources. However, in high-performance organizations, projects and requirements are not tossed over the wall to IT to build. Rather, engineers, designers, testers, operations staff, and product managers work in partnership on creating high-value outcomes for customers, users, and the organization as a whole. Furthermore, these decisions—made locally by teams—take into account the wider strategic goals of the organization.

In Chapter 6 we described the Improvement Kata, an iterative approach to process improvement in which we set target conditions for the next iteration and then let teams decide what work to do in order to achieve those target conditions. The key innovation we present in this chapter is to use the same process to manage product development. Instead of coming up with requirements or use cases and putting them into a backlog so that teams build them in priority order, we describe, in measurable terms, the business outcomes we want to achieve in the next iteration. It is then up to the teams to discover ideas for features which will achieve these business outcomes, test them, and build those that achieve the desired outcomes. In this way, we harness the skill and ingenuity of the entire organization to come up with ideas for achieving business goals with minimal waste and at maximum pace.

As an approach to running agile software development at scale, this is different from most frameworks. There’s no program-level backlog; instead, teams create and manage their own backlogs and are responsible for collaborating to achieve business goals. These goals are defined in terms of target conditions at the program level and regularly updated as part of the Improvement Kata process (see Chapter 6). Thus the responsibility for achieving business goals is pushed down to teams, and teams focus on business outcomes rather than measures such as the number of stories completed (team velocity), lines of code written, or hours worked. Indeed, the goal is to minimize output while maximizing outcomes: the fewer lines of code we write and hours we work to achieve our desired business goals, the better. Enormous, overly complex systems and burned-out staff are symptoms of focusing on output rather than outcomes.

One thing we don’t do in this chapter (or indeed this book) is prescribe what processes teams should use to manage their work. Teams can—and should—be free to choose whatever methods and processes work best for them. Indeed, in the HP FutureSmart program, different teams successfully used different methodologies and there was no attempt to impose a “standard” process or methodology across the teams. What is important is that the teams are able to work together effectively to achieve the target conditions.

Therefore, we don’t present standard agile methods such as XP, Scrum, or alternatives like Kanban. There are several excellent books that cover these methods in great detail, such as David Anderson’s Kanban: Successful Evolutionary Change for Your Technology Business,150 Kenneth S. Rubin’s Essential Scrum: A Practical Guide to the Most Popular Agile Process (Addison-Wesley), and Mitch Lacey’s The Scrum Field Guide: Practical Advice for Your First Year (Addison-Wesley). Instead, we discuss how teams can collaborate to define approaches to achieve target conditions, then design experiments to test their assumptions.

The techniques described in this chapter require a high level of trust between different parts of the organization involved in the product development value stream, as well as between leaders, managers, and those who report to them. They also require high-performance teams and short lead times. Thus, unless these foundations (described in previous chapters in this part) are in place, implementing these techniques will not produce the value they are capable of.

Using Impact Mapping to Create Hypotheses for the Next Iteration

The outcome of the Improvement Kata’s iteration planning process (described in Chapter 6) is a list of measurable target conditions we wish to achieve over the next iteration, describing the intent of what we are trying to achieve and following the Principle of Mission (see Chapter 1). In this chapter, we describe how to use the same process to drive product development. We achieve this by creating target conditions based on customer and organizational outcomes as part of our iteration planning process, in addition to process improvement target conditions. This enables us to use program-level continuous improvement for product development too, by adopting a goal-oriented approach to requirements engineering.

Our product development target conditions describe customer or business goals we wish to achieve, which are driven by our product strategy. Examples include increasing revenue per user, targeting a new market segment, solving a given problem experienced by a particular persona, increasing the performance of our system, or reducing transaction cost. However, we do not propose solutions to achieve these goals or write stories or features (especially not “epics”) at the program level. Rather, it is up to the teams within the program to decide how they will achieve these goals. This is critical to achieving high performance at scale, for two reasons:

§ The initial solutions we come up with are unlikely to be the best. Better solutions are discovered by creating, testing, and refining multiple options to discover what best solves the problem at hand.

§ Organizations can only move fast at scale when the people building the solutions have a deep understanding of both user needs and business strategy and come up with their own ideas.

A program-level backlog is not an effective way to drive these behaviors—it just reflects the almost irresistible human tendency to specify “the means of doing something, rather than the result we want.”151

Getting to Target Conditions

Goal-oriented requirements engineering has been in use for decades,152 but most people are still used to defining work in terms of features and benefits rather than measurable business and customer outcomes. The features-and-benefits approach plays to our natural bias towards coming up with solutions, and we have to think harder to specify the attributes that an acceptable solution will have instead.

If you have features and benefits and you want to get to target conditions, one simple approach is to ask why our customers care about a particular benefit. You may need to ask “why” several times to get to something that looks like a real target condition.153 It’s also essential to ensure that target conditions have measurable acceptance criteria, as shown in Figure 9-1.

Gojko Adzic presents a technique called impact mapping to break down high-level business goals at the program level into testable hypotheses. Adzic describes an impact map as “a visualization of scope and underlying assumptions, created collaboratively by a cross-functional group of stakeholders. It is a mind-map grown during a discussion facilitated by answering the following questions: 1. Why? 2. Who? 3. How? 4. What?”154 An example of an impact map is shown in Figure 9-1.

leen 0901

Figure 9-1. An example of an impact map

We begin an impact map with a program-level target condition. By stating a target condition, including the intent of the condition (why we care about it from a business perspective), we make sure everyone working towards the goal understands the purpose of what they are doing, following the Principle of Mission. We also provide clear acceptance criteria so we can determine when we have reached the target condition.

The first level of an impact map enumerates all the stakeholders with an interest in that target condition. This includes not only the end users who will be affected by the work, but also people within the organization who will be involved or impacted, or can influence the progress of the work—either positively or negatively.

The second level of an impact map describes possible ways the stakeholders can help—or hinder—achieving the target condition. These changes of behavior are the impacts we aim to create.

So far, we should have said nothing about possible solutions to move us towards our target condition. It is only at the third level of the impact map that we propose options to achieve the target condition. At first, we should propose solutions that don’t involve writing code—such as marketing activities or simplifying business processes. Software development should always be a last resort, because of the cost and complexity of building and maintaining software.

The possible solutions proposed in the impact map are not the key deliverable. Coming up with possible solutions simply helps us refine our thinking about the goal and stakeholders. The solutions we come up with at this stage are unlikely to be the best—we expect, rather, that the people working to deliver the outcomes will come up with better options and evaluate them to determine which ones will best achieve our target condition. The impact map can be considered a set of assumptions—for example, in Figure 9-1, we assume that standardizing exception codes will reduce nonstandard orders, which will reduce the cost of processing nonstandard transactions.

For this tool to work effectively, it’s critical to have the right people involved in the impact-mapping exercise. It might be a small, cross-functional team including business stakeholders, technical staff, designers, QA (where applicable), IT operations, and support. If the exercise is conducted purely by business stakeholders, they will miss the opportunity to examine the assumptions behind the target conditions and to get ideas from the designers and engineers who are closest to the problem. One of the most important goals of impact mapping is to create a shared understanding between stakeholders, so not involving them dooms it to irrelevance.

Once we have a prioritized list of target conditions and impact maps created collaboratively by technical and business people, it is up to the teams to determine the shortest possible path to the target condition.

This tool differs in important ways from many standard approaches to thinking about requirements. Here are some of the important differences and the motivations behind them:

There are no lists of features at the program level

Features are simply a mechanism for achieving the goal. To paraphrase Adzic, if achieving the target condition with a completely different set of features than we envisaged won’t count as success, we have chosen the wrong target condition. Specifying target conditions rather than features allows us to rapidly respond to changes in our environment and to the information we gather from stakeholders as we work towards the target condition. It prevents “feature churn” during the iteration. Most importantly, it is the most effective way to make use of the talents of those who work for us; this motivates them by giving them an opportunity to pursue mastery, autonomy, and purpose.

There is no detailed estimation

We aim for a list of target conditions that is a stretch goal—in other words, if all our assumptions are good and all our bets pay off, we think it would be possible to achieve them. However, this rarely happens, which means we may not achieve some of the lower-priority target conditions. If we are regularly achieving much less, we need to rebalance our target conditions in favor of process improvement goals. Keeping the iterations short—2–4 weeks initially—enables us to adjust the target conditions in response to what we discover during the iteration. This allows us to quickly detect if we are on a wrong path and try a different approach before we overinvest in the wrong things.

There are no “architectural epics”

The people doing the work should have complete freedom to do whatever improvement work they like (including architectural changes, automation, and refactoring) to best achieve the target conditions. If we want to drive out particular goals which will require architectural work, such as compliance or improved performance, we specify these in our target conditions.

Performing User Research

Impact mapping provides us with a number of possible solutions and a set of assumptions for each candidate solution. Our task is to find the shortest path to the target condition. We select the one that seems shortest, and validate the solution—along with the assumptions it makes—to see if it really is capable of delivering the expected value (as we have seen, features often fail to deliver the expected value). There are multiple ways to validate our assumptions.

First, we create a hypothesis based on our assumption. In Lean UX, Josh Seiden and Jeff Gothelf suggest the template shown in Figure 9-2 to use as a starting point for capturing hypotheses.155

leen 0902

Figure 9-2. Jeff Gothelf’s template for hypothesis-driven development

In this format, we describe the parameters of the experiment we will perform to test the value of the proposed feature. The outcome describes the target condition we aim to achieve.

As with the agile story format, we summarize the work (for example, the feature we want to build or the business process change we want to make) in a few words to allow us to recall the conversation we had about it as a team. We also specify the persona whose behavior we will measure when running the experiment. Finally, we specify the signal we will measure in the experiment. In online controlled experiments, discussed in the next section, this is known as the overall evaluation criterion for the experiment.

Once we have a hypothesis, we can start to design an experiment. This is a cross-functional activity that requires collaboration between design, development, testing, techops, and analysis specialists, supported by subject matter experts where applicable. Our goal is to minimize the amount of work we must perform to gather a sufficient amount of data to validate or falsify the assumptions of our hypothesis. There are multiple types of user research we can perform to test our hypothesis, as shown in Figure 9-3.156 For more on different types of user research, read UX for Lean Startups (O’Reilly) by Laura Klein.

leen 0903

Figure 9-3. Different types of user research, courtesy of Janice Fraser

The key outcome of an experiment is information: we aim to reduce the uncertainty as to whether the proposed work will achieve the target condition. There are many different ways we can run experiments to gather information. Bear in mind that experiments will often have a negative or inconclusive result, especially in conditions of uncertainty; this means we’ll often need to tune, refine, and evolve our hypotheses or come up with a new experiment to test them.

The key to the experimental approach to product development is that we do no major new development work without first creating a hypothesis so we can determine if our work will deliver the expected value.157

Online Controlled Experiments

In the case of an internet-based service, we can use a powerful method called an online controlled experiment, or A/B test, to test a hypothesis. An A/B test is a randomized, controlled experiment to discover which of two possible versions of a web page produces better outcome. When running an A/B test, we prepare two versions of a page: a control (typically the existing version of the page) and a new treatment we want to test. When a user first visits our website, the system decides which experiments that user will be a subject for, and for each experiment chooses at random whether they will view the control (A) or the treatment (B). We instrument as much of the user’s interaction with the system as possible to detect any differences in behavior between the control and the treatment.

MOST GOOD IDEAS ACTUALLY DELIVER ZERO OR NEGATIVE VALUE

Perhaps the most eye-opening result of A/B testing is how many apparently great ideas do not improve value, and how utterly impossible it is to distinguish the lemons in advance. As discussed in Chapter 2, data gathered from A/B tests by Ronny Kohavi, who directed Amazon’s Data Mining and Personalization group before joining Microsoft as General Manager of its Experimentation Platform, reveal that 60%–90% of ideas do not improve the metric they were intended to improve.

Thus if we’re not running experiments to test the value of new ideas before completely developing them, the chances are that about 2/3 of the work we are doing is of either zero or negative value to our customers—and certainly of negative value to our organization, since this work costs us in three ways. In addition to the cost of developing the features, there is an opportunity cost associated with more valuable work we could have done instead, and the cost of the new complexity they add to our systems (which manifests itself as the cost of maintaining the code, a drag on the rate at which we can develop new functionality, and often, reduced operational stability and performance).

Despite these terrible odds, many organizations have found it hard to embrace running experiments to measure the value of new features or products. Some designers and editors feel that it challenges their expertise. Executives worry that it threatens their job as decision makers and that they may lose control over the decisions.

Kohavi, who coined the term “HiPPO,” says his job is “to tell clients that their new baby is ugly,” and carries around toy rubber hippos to give to these people to help lighten the mood and remind them that most “good” ideas aren’t, and that it’s impossible to tell in the absence of data which ones will be lemons.

By running the experiment with a large enough number of users, we aim to gather enough data to demonstrate a statistically significant difference between A and B for the business metric we care about, known as the overall evaluation criterion, or OEC (compare the One Metric That Matters from Chapter 4). Kohavi suggests optimizing for and measuring customer lifetime value rather than short-term revenue. For a site such as Bing, he recommends using a weighted sum of factors such as time on site per month and visit frequency per user, with the aim being to improve the overall customer experience and get them to return.

Unlike data mining, which can only discover correlations, A/B testing has the power to show a causal relationship between a change on a web page and a corresponding change in the metric we care about. Companies such as Amazon and Microsoft typically run hundreds of experiments in production at any one time and test every new feature using this method before rolling it out. Every visitor to Bing, Microsoft’s web search service, will be participating in about 15 experiments at a time.158

USING A/B TESTING TO CALCULATE THE COST OF DELAY FOR PERFORMANCE IMPROVEMENTS

At Microsoft, Ronny Kohavi’s team wanted to calculate the impact of improving the performance of Bing searches. They did it by running an A/B test in which they introduced an artificial server delay for users who saw the “B” version. They were able to calculate a dollar amount for the revenue impact of performance improvements, discovering that “an engineer that improves server performance by 10 msec more than pays for his fully-loaded annual costs.” This calculation can be used to determine the cost of delay for performance improvements.

When we create an experiment to use as part of A/B testing, we aim to do much less work than it would take to fully implement the feature under consideration. We can calculate the maximum amount we should spend on an experiment by determining the expected value of the information we will gain from running it, as discussed in Chapter 3 (although we will typically spend much less than this).

In the context of a website, here are some ways to reduce the cost of an experiment:

Use the 80/20 rule and don’t worry about corner cases

Build the 20% of functionality that will deliver 80% of the expected benefit.

Don’t build for scale

Experiments on a busy website are usually only seen by a tiny percentage of users.

Don’t bother with cross-browser compatibility

With some simple filtering code, you can ensure that only users with the correct browser get to see the experiment.

Don’t bother with significant test coverage

You can add test coverage later if the feature is validated. Good monitoring is much more important when developing an experimentation platform.

An A/B Test Example

Etsy is a website where people can sell handcrafted goods. Etsy uses A/B testing to validate all major new product ideas. In one example, a product owner noticed that searching for a particular type of item on somebody’s storefront comes up with zero results, and wanted to find out if a feature that shows similar items from somebody else’s storefront would increase revenue. To test the hypothesis, the team created a very simple implementation of the feature. They used a configuration file to determine what percentage of users will see the experiment.

Users hitting the page on which the experiment is running will be randomly allocated either to a control group or to the group that sees the experiment, based on the weighting in the configuration file. Risky experiments will only be seen by a very small percentage of users. Once a user is allocated to a bucket, they stay there across visits so the site has a consistent appearance to them.

MAKING IT SAFE TO FAIL

A/B testing allows teams to define the constraints, limits, or thresholds to create a safe-to-fail experiment. The team can define the control limit of a key metric before testing so they can roll back or abort the test if this limit is reached (e.g., conversion drops below a set figure). Determining, sharing, and agreeing upon these limits with all stakeholders before conducting the experiment will establish the boundaries within which the team can experiment safely.

Users’ subsequent behavior is then tracked and measured as a cohort—for example, we might want to see how many then make it to the payment page. Etsy has a tool, shown in Figure 9-4, which measures the difference in behavior for various endpoints and indicates when it has reached statistical significance at a 95% confidence interval. For example, for “site—page count,” the bolded “+0.26%” indicates the experiment produces a statistically significant 0.26% improvement over the control. Experiments typically have to run for a few days to produce statistically significant data.

Generating a change of more than a few percent in a business metric is rare, and can usually be ascribed to Twyman’s Law: “If a statistic looks interesting or unusual it is probably wrong.”

leen 0904

Figure 9-4. Measuring changes in user behavior using A/B testing

If the hypothesis is validated, more work can be done to build out the feature and make it scale, until ultimately the feature is made available to all users of the site. Turning the visibility to 100% of users is equivalent to publicly releasing the feature—an important illustration of the difference between deployment and release which we discussed in Chapter 8. Etsy always has a number of experiments running in production at any time. From a dashboard, you can see which experiments are planned, which are running, and which are completed, which allows people to dive into the current metrics for each experiment, as shown in Figure 9-5.

leen 0905

Figure 9-5. Experiments currently running at Etsy

ALTERNATIVES TO A/B TESTING

Although we spend a lot of time on A/B testing in this chapter, it is just one of a wide range of experimental techniques for gathering data. User experience designers have a variety of tools to get feedback from users, from lo-fi prototypes to ethnographic research methods such as contextual enquiry, as shown in Figure 9-3. Lean UX: Applying Lean Principles to Improve User Experiencediscusses a number of these tools and how to apply them in the context of hypothesis-driven development.159

Prerequisites for an Experimental Approach to Product Development

Convincing people to gather—and then pay attention to—real data from experimentation, such as A/B testing, is hard enough. But an experimental, scientific approach to creating customer value has implications for the way we do work, as well as for the way we think about value. As Dan McKinley of Etsy points out,160 experimentation can’t be bolted on to a waterfall product development process. If we get to the end of several weeks (or months) of work and attempt an experiment, there’s a very good chance we’ll find the huge batch of work we did either has zero effect or makes things worse. At that point we’ll have to throw it all away because there is no way to accurately identify the effect of each specific change introduced.

This is an extremely painful decision, and in practice many teams succumb to the sunk cost fallacy by giving undue weight to the investment made to date when taking this decision. They ignore the data and deploy the product as is because shelving the work is considered a total failure, whereas successful deployment of anything into production is perceived as success—so long as it is on time and on budget.

If we’re going to adopt a thorough experimental approach, we need to change what we consider to be the outcome of our work: not just validated ideas but the information we gain in the course of running the experiments. We also need to change the way we think about developing new ideas; in particular, it’s essential to work in small batches and test every assumption behind the idea we are validating. This, in turn, requires that we implement continuous delivery, as described in Chapter 8.

Working in small batches creates flow—a key element of Lean Thinking. But small batches are hard to achieve, for both philosophical and technical reasons. Some people have a problem with taking an incremental approach to creating products. A common objection to an experimental approach is that it leads to locally optimal but globally suboptimal decisions, and that it compromises the overall integrity of the product, murdering a beautiful holistic vision by a thousand A/B tests.

While it is certainly possible to end up with an ugly, overcomplex product when teams fail to take a holistic approach to user experience, this is not an inevitable outcome of A/B testing. Experimentation isn’t supposed to replace having a vision for your product. Rather, it enables you to evolve your strategy and vision rapidly in response to real data from customers using your products in their environment. A/B testing will not be effective in the absence of a vision and strategy. Product managers, designers, and engineers need to collaborate and apply the lessons of design thinking in order to take a long-term view of the needs of users and establish a direction for the product.

What Is Design Thinking?

Tim Brown, CEO and President of IDEO and one of the key figures in design thinking, says, “As a style of thinking, it is generally considered the ability to combine empathy for the context of a problem, creativity in the generation of insights and solutions, and rationality to analyze and fit solutions to the context.” We discuss design thinking and Lean UX further in Chapter 4.

There are two further obstacles to taking an experimental approach to product development. First, designing experiments is tricky: we have to prevent them from interfering with each other, apply alerts to detect anomalies, and design them to produce valid results. At the same time, we want to minimize the amount of work we must do to gather statistically significant data.

Finally, taking a scientific approach to customer and product development requires intensive collaboration between product, design, and technical people throughout the lifecycle of every product. This is a big cultural change for many enterprises where technical staff do not generally contribute to the overall design process.

These obstacles are the reason why we strongly discourage people from adopting the tools discussed in this chapter without first putting in place the foundations described in the earlier chapters in Part III.

INNOVATION REQUIRES A CULTURE OF EXPERIMENTATION

Greg Linden, who developed Amazon’s first recommendations engine, came up with a hypothesis that showing personalized recommendations at checkout time might convince people to make impulse buys—similar to the rack at the checkout lane in a grocery store but compiled personally for each customer by an algorithm. However, a senior vice-president who saw Greg’s demo was convinced it would distract people from checking out. Greg was forbidden to do any further work on the feature.161

Linden disobeyed the SVP and put an A/B test into production. The A/B test demonstrated such a clear increase in revenue when people received personalized recommendations at check-out that the feature was built out and launched with some urgency.

Is it even conceivable that an engineer at your company could push an A/B test into production in the face of censure by a senior executive? If the experiment’s data proved the executive wrong, how likely is it that the feature would be picked up rather than buried? As Linden writes, “Creativity must flow from everywhere. Whether you are a summer intern or the CTO, any good idea must be able to seek an objective test, preferably a test that exposes the idea to real customers. Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule.”

A culture based on measurement and experimentation is not antithetical to crazy ideas, divergent thinking, and abductive reasoning. Rather, it gives people license to pursue their crazy ideas—by making it easy to gather real data to back up the good crazy and reject the bad crazy. Without the ability to run cheap, safe-to-fail experiments, such ideas are typically trampled by a passing HiPPO or by the mediocrity of decision-by-committee.

One of the most common challenges encountered in software development is the focus of teams, product managers, and organizations on managing cost rather than value. This typically manifests itself in undue effort spent on zero-value-add activities such as detailed upfront analysis, estimation, scope management, and backlog grooming. These symptoms are the result of focusing on maximizing utilization (keeping our expensive people busy) and output (measuring their work product)—instead of focusing on outcomes, minimizing the output required to achieve them, and reducing lead times to get fast feedback on our decisions.

Conclusion

Most ideas—even apparently good ones—deliver zero or negative value to users. By focusing on the outcomes we wish to achieve, rather than solutions and features, we can separate what we are trying to do from the possible ways to do it. Then, following the Principle of Mission, teams can perform user research (including low-risk, safe-to-fail online experiments) to determine what will actually provide value to customers—and to our organization.

By combining impact mapping and user research with the Improvement Kata framework presented in Chapter 6, we can scale agile software delivery and combine it with design thinking and an experimental approach to product development. This allows us to rapidly discover, develop, and deliver high-value, high-quality solutions to users at scale, harnessing the skill and ingenuity of everybody in the organization.

Questions for readers:

§ What happens at your organization when a substantial amount of effort has been invested in an idea that turns out to provide little value to users or the organization, or even to make things worse?

§ Have the expected customer outcomes for the features you are working on been quantified? Do you have a way to measure the actual outcomes?

§ What kind of user research do you perform on prototypes before releasing them more widely? How might you get that feedback more quickly and cheaply?

§ When was the last time you personally observed your product used or discussed in real life?

§ Can you think of a cheap way to test the value of the next piece of work in your backlog?

150 [anderson]

151 [gilb-88], p. 23.

152 See [yu], [lapouchnian], and [gilb-05] for more on goal-oriented requirements engineering.

153 This is an old trick used by Taiichi Ohno, called “the five whys.”

154 [adzic], l. 146.

155 [gothelf], p. 23.

156 This diagram was developed by Janice Fraser; see http://slidesha.re/1v715bL.

157 In many ways, this approach is just an extension of test-driven development. Chris Matts came up with a similar idea he calls feature injection.

158 http://www.infoq.com/presentations/controlled-experiments

159 [gothelf]

160 http://slidesha.re/1v71gUs

161 http://bit.ly/1v71kmW