Interview with Stefan H. Thomke, Author of "Experimentation Works: The Surprising Power of Business Experiments"

By: Braden Kelley

I had the opportunity recently to interview fellow author Stefan H. Thomke, the William Barclay Harding Professor of Business Administration at Harvard Business School to talk with him about his new book Experimentation Works: The Surprising Power of Business Experiments, to explore the important role that experimentation plays in business and innovation.

Why is there a business experimentation imperative?

My book Experimentation Works is about how to continuously innovate through business experiments. Innovation is important because it drives profitable growth and creates shareholder value. But here is the dilemma: despite being awash in information coming from every direction, today’s managers operate in an uncertain world where they lack the right data to inform strategic and tactical decisions. Consequently, for better or worse, our actions tend to rely on experience, intuition, and beliefs. But this all too often doesn’t work. And all too often, we discover that ideas that are truly innovative go against our experience and assumptions, or the conventional wisdom. Whether it’s improving customer experiences, trying out new business models, or developing new products and services, even the most experienced managers are often wrong, whether they like it or not. The book introduces you to many of those people and their situations—and how business experiments raised their innovation game dramatically.

What makes a good business experiment, and what are some of the keys to successful experiment design?

In an ideal experiment, testers separate an independent variable (the presumed cause) from a dependent variable (the observed effect) while holding all other potential causes constant. They then manipulate the former to study changes in the latter. The manipulation, followed by careful observation and analysis, yields insight into the relationships between cause and effect, which ideally can be applied and tested in other settings. To obtain that kind of learning—and ensure that each experiment contains the right elements and yields better decisions—companies should ask themselves seven important questions:

Does the experiment have a testable hypothesis?
Have stakeholders made a commitment to abide by the results?
Is the experiment doable?
How can we ensure reliable results?
Do we understand cause and effect?
Have we gotten the most value out of the experiment?
And finally, are experiments really driving our decisions?

Although some of the questions seem obvious, many companies conduct tests without fully addressing them.

Here is a complete list of elements that you may find useful:

Hypothesis

Is the hypothesis rooted in observations, insights, or data?
Does the experiment focus on a testable management action under consideration?
Does it have measurable variables, and can it be shown to be false?
What do people hope to learn from the experiments?

Buy-in

What specific changes would be made on the basis of the results?
How will the organization ensure that the results aren’t ignored?
How does the experiment fit into the organization’s overall learning agenda and strategic priorities?

Feasibility

Does the experiment have a testable prediction?
What is the required sample size? Note: The sample size will depend on the expected effect (for example, a 5 percent increase in sales).
Can the organization feasibly conduct the experiment at the test locations for the required duration?

Reliability

What measures will be used to account for systemic bias, whether it’s conscious or unconscious?
Do the characteristics of the control group match those of the test group?
Can the experiment be conducted in either “blind” or “double-blind” fashion?
Have any remaining biases been eliminated through statistical analyses or other techniques?
Would others conducting the same test obtain similar results?

Causality

Did we capture all variables that might influence our metrics?
Can we link specific interventions to the observed effect?
What is the strength of the evidence? Correlations are merely suggestive of causality.
Are we comfortable taking action without evidence of causality?

Value

Has the organization considered a targeted rollout—that is, one that takes into account a proposed initiative’s effect on different customers, markets, and segments—to concentrate investments in areas when the potential payback is the highest?
Has the organization implemented only the components of an initiative with the highest return on investment?
Does the organization have a better understanding of what variables are causing what effects?

Decisions

Do we acknowledge that not every business decisions can or should be resolved by experiments? But everything that can be tested should be tested.
Are we using experimental evidence to add transparency to our decision-making process?

Is there anything special about running online experiments?

In an A/B test, the experimenter sets up two experiences: the control (“A”) is usually the current system—considered the champion—and the treatment (“B”) is some modification that attempts to improve something—the challenger. Users are randomly assigned to the experiences, and key metrics are computed and compared. (A/B/C or A/B/n tests and multivariate tests, in contrast, assess more than one treatment or modifications of different variables at the same time.) Online, the modification could be a new feature, a change to the user interface (such as a new layout), a back-end change (such as an improvement to an algorithm that, say, recommends books at Amazon), or a different business model (such as an offer of free shipping). Whatever aspect of customer experiences companies care most about—be it sales, repeat usage, click-through rates, or time users spend on a site—they can use online A/B tests to learn how to optimize it. Any company that has at least a few thousand daily active users can conduct these tests. The ability to access large customer samples, to automatically collect huge amounts of data about user interactions on websites and apps, and to run concurrent experiments gives companies an unprecedented opportunity to evaluate many ideas quickly, with great precision, and at a negligible cost per additional experiment. Organizations can iterate rapidly, win fast, or fail fast and pivot. Indeed, product development itself is being transformed: all aspects of software—including user interfaces, security applications, and back-end changes—can now be subjected to A/B tests (technically, this is referred to as full stack experimentation).

What are some of the keys to building a culture of large-scale experimentation?

Shared behaviors, beliefs, and values (aka culture) are often an obstacle to running more experiments in companies. For every online experiment that succeeds, nearly 10 don’t—and in the eyes of many organizations that emphasize efficiency, predictability, and “winning,” those failures are wasteful. To successfully innovate, companies need to make experimentation an integral part of everyday life—even when budgets are tight. That means creating an environment in which employees’ curiosity is nurtured, data trumps opinion, anyone (not just people in R&D) can conduct or commission a test, all experiments are done ethically, and managers embrace a new model of leadership. More specifically, companies have addressed some of these obstacles in the following ways:

They Cultivate Curiosity

Everyone in the organization, from the leadership on down, needs to value surprises, despite the difficulty of assigning a dollar figure to them and the impossibility of predicting when and how often they’ll occur. When firms adopt this mindset, curiosity will prevail and people will see failures not as costly mistakes but as opportunities for learning. Many organizations are also too conservative about the nature and amount of experimentation. Overemphasizing the importance of successful experiments may inadvertently encourage employees to focus on familiar solutions or those that they already know will work and avoid testing ideas that they fear might fail.

They Insist That Data Trump Opinions

The empirical results of experiments must prevail when they clash with strong opinions, no matter whose opinions they are. But this is rare among most firms for an understandable reason: human nature. We tend to happily accept “good” results that confirm our biases but challenge and thoroughly investigate “bad” results that go against our assumptions. The remedy is to implement the changes experiments validate with few exceptions. Getting executives in the top ranks to abide by this rule is especially difficult. But it’s vital that they do: Nothing stalls innovation faster than a so-called HiPPO—highest-paid person’s opinion. Note that I’m not saying that all management decisions can or should be based on experiments. Some things are very difficult, if not impossible, to conduct tests on—for example, strategic calls on whether to acquire a company. But if everything that can be tested online is tested, experiments can become instrumental to management decisions and fuel healthy debates.

They Embrace a Different Leadership Model

If most decisions are made through experiments, what’s left for managers to do, beyond developing the company’s strategic direction and tackling big decisions such as which acquisitions to make? There are at least three things:

Set a grand challenge that can be broken into testable hypotheses and key performance metrics. Employees need to see how their experiments support an overall strategic goal.

Put in place systems, resources, and organizational designs that allow for large-scale experimentation. Scientifically testing nearly every idea requires infrastructure: instrumentation, data pipelines, and data scientists. Several third-party tools and services make it easy to try experiments, but to scale things up, senior leaders must tightly integrate the testing capability into company processes.

Be a role model. Leaders have to live by the same rules as everyone else and subject their own ideas to tests. Bosses ought to display intellectual humility and be unafraid to admit, “I don’t know…” They should heed the advice of Francis Bacon, the forefather of the scientific method: “If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts, he shall end in certainties.”

What do you see as the difference between failures and mistakes?

Experimenting with many diverse—and sometimes absurd-seeming—ideas is crucial to innovation. But when people experiment more rapidly and more frequently, inevitably most tests fail. Only 10 to 20 percent of Bing’s and Google’s online experiments generate positive results: the challenger (B, the treatment) performs better than the champion (A, the status quo). Such early failures are not only desirable, but also necessary, since experimenters can quickly eliminate unfavorable options and refocus their efforts on more promising alternatives (often building on ideas that were initially unsuccessful). By confirming that the champion still rules, management can also shift its attention to other novel business tactics instead of engaging in fruitless debates over a challenger. In other words, “failing early and often” is not only desirable but also a natural byproduct of an experimentation program. Failures should not be confused with mistakes. Unlike failures, mistakes produce little new or useful information and are therefore without value. A poorly planned or badly conducted experiment, for instance, might force researchers to repeat the experiment. Another common mistake is not learning from past experience. When Amazon designed and built its first fulfillment centers, the company learned many lessons about the optimal location, layout, and material flows. Today, it has more than seventy-five centers in North America, and if it got major aspects of a new center wrong, we’d consider the project an example of poor execution. That is, we’d regard the project’s shortcomings as a repetition of past mistakes, not failures that have explicit learning objectives.

Why is it important to have an experiment pipeline?

To fuel a test pipeline, people have to constantly come up with new ideas, user problems, and need areas. Ideas can come from talking to users, from using a product, or from past experiments. Teams can also ask for surveys, lab tests, or other qualitative research and receive input from customer services on pain points and user preferences. Each team should find its way to manage the hypotheses generating process and to fill the test pipeline.

What does it take to become a more mature experimentation organization?

Most organizations that I’ve seen, non-digital or digital, B2C or B2B, have gone through maturation journey that consists of five stages:

Awareness: Management knows that business experiments matter to innovation; “I experiment” generally means “I am trying something new,” but there is no process, rigorous framework, or testing tool for carrying it out; people have difficulty linking cause and effect; what they know has come primarily from experience, intuition, and informal observation.
Belief: Management accepts that a more disciplined approach to establishing cause and effect is needed; a rigorous framework and testing tools are adopted in small and specialized groups; measurement begins, but the impact on managerial decision making is small. Company leadership still views experimentation as peripheral to the organization.
Commitment: Management pledges to make experimentation core to learning and decision making; more resources are set aside; some innovation decisions and product roadmaps now require input from disciplined experiments (“Show me the experiment”); the positive impact on business outcomes can be measured.
Diffusion: Management realizes that large-scale testing is key to having a business impact; they roll out formal companywide programs and standards; the scientific method via experimentation is spread throughout the organization; people have broad access to training, checklists and case studies; managers consider experiments instrumental to achieving their business goals and require them before making decisions.
Embeddedness: Disciplined business experimentation becomes deeply rooted and democratized; teams (and people) are empowered and capable of designing and running their own experiments; the tools are accessed by a large percentage of an organization’s employees; experimentation becomes like running the numbers, and capabilities are continuously improved.

Maturity stages can overlap but are characterized by a different set of actions. The number of experiments can increase significantly as an organization matures.

What are some myths about experimentation that you’d like to dispel?

Becoming an experimentation organization will undoubtedly cause frictions, as for every action there will be an opposing reaction. The causes that I’ve come across cover a broad spectrum: inertia, anxiety, incentives, hubris, perceived costs and risks, and so on. But I have also found that managers aren’t always aware of the power of business experiments described in my book. This failure to understand and appreciate their true benefits has given rise to seven myths that undermine innovation.

Here is one example: A few years ago, I gave a presentation on business experimentation to a large audience of executives and entrepreneurs. The audience was intrigued until one participant, the founder and CEO of a national restaurant chain, energetically voiced his opposition to subjecting his employees’ ideas to rigorous tests. He strongly believed that innovation is about creativity, confidence, and vision and, in a loud voice, proclaimed: “Steve Jobs didn’t test any of his ideas.” His message was unambiguous: a greater focus on experiments will backfire, put great ideas at risk of being prematurely dismissed, and will ultimately kill intuition and judgment.

But, I countered, it’s not about intuition versus experiments; in fact, the need each other. Intuition, customer insights, and qualitative research are valuable sources for new hypotheses, which may or may not be refuted—but hypotheses can often be improved through rigorous testing. The empirical evidence shows that even experts are poor at predicting customer behavior; in fact, they get it wrong most of the time. Wouldn’t it be preferable to know what does and does not work early and focus resources on the most promising ideas? After some participants sided with this reasoning, he gradually relented. (Curiously, I later found out that his company had been a user of a popular tool for running rigorous in-restaurant experiments, yet he was unaware of it.) With respect to his comment about Steve Jobs, it’s remarkable how many people believe that their intuition and creativity can match Jobs’ track record—until they don’t. Incidentally, let me dispel another myth: Apple does run experiments.

Thanks for all that Stefan! I hope everyone has enjoyed this peek into the mind behind the interesting book Experimentation Works!

About the Author

Braden Kelley is a Design Thinking, Innovation and Transformation Consultant, a popular innovation speaker and workshop leader, and helps companies plan organizational changes that are more human and less overwhelming. He is the author of Charting Change from Palgrave Macmillan and Stoking Your Innovation Bonfire from John Wiley & Sons. Braden has been advising companies since 1996, while living and working in England, Germany, and the United States. Braden earned his MBA from top-rated London Business School. Follow him on Twitter and Linkedin.