Avoiding the Perils of A/B Split Testing

A/B testing is widely used in product development, popularized as a fundamental component of the Lean Startup  framework, and providing a scientific way of validating product and business improvements. The concept is simple… put some customers in the new experience, compare the results against customers that didn’t get the new experience, and better metrics validates the improvement. In reality, this process of validation is very complicated and there is no shortage of hazards leading you to poor outcomes.

Creating Information out of Data is Hard

IMVU had a culture of data-validated decisions from almost day one, and as a result we made it easy for anybody to create their own split test and validate the business results of their efforts. It took minutes to implement the split test and compare oh so many metrics between the cohorts. All employees had access to this system and we tested everything, all the time. A paper released in 2009,  Controlled experiments on the web: survey and practical guide, reinforced that split testing was the undisputed arbiter or truth. We were clearly on the right path. 

While the ability to self-assess progress created a very empowering culture, we were largely ill-equipped to understand the nuances of what the data actually meant. Years later we would start to better understand, we don’t know how much we don’t know.

First Know Why

The first opportunity to make a mistake with split testing is deciding to test in the first place. When creating a split test has a very low barrier, it is easy to err on the side of just testing everything so that you can have the data if you need it. But every test has a lot of hidden costs than come from false-positives, clarification of data, shiny-object distractions, inconsistent customer experiences, and additional opportunities for introducing bugs.

Recognizing that being a split test packrat has a real cost, there should be some requirement for incurring this cost. Are very least, answering the question, “What are the significant changes that will be made as a result of this test?” Additional pre-test work to specify what will be measured, and what results will determine success or failure can also go a long way towards ensuring time spent testing is valuable.

Test Implementation is a Project

IMVU had a great framework to make test implementation a seemingly simple task, with a few lines of code of creating a branch for the test experience, and leaving the current experience as the control. Again, this made creating tests seem deceptively easy, and left openings for measuring the wrong thing.

Often a split test is a cross-functional effort, with an engineer handling the implementation and the customer being any combination of a product manager, acquisition team, marketing representative, revenue officer, or generally interested party. In some cases, the interpretation of test data is done by another person altogether. Correctly understanding what the internal customer wants to know, capturing the right data, and converting that data into information ends up with many points of communication that must be accurate to deliver a valid test.

For example, the acquisition team wants to test a new landing page, simply reordering the registration fields because they think it will improve the registration completion rate. The engineer realizing this is a no-brainer takes the 15 minutes before lunch to create the quick test, two paths and the test is running. However, the registration page has both manual registration and sign in with a social network account, so the test is including a lot of users that are social logins, irrelevant to the registration fields. This subtle nuance means that the impact of the registration field changes will likely be lost as the irrelevant data acts as a damper. What the customer wanted to know isn’t what the test is answering, and it’s likely that nobody on the project knows there is an error.

The ease of creating a split test should not be conflated with delivering quality results from a test. Doing it right is a project and requires investment of resources consistent with any other project.

WTF Do These Results Actually Mean?

Assuming you were diligent in your experiment design, you captured all of the relevant data, and you avoided some of the common errors of A/B testing, you now need to make sense of the data. In the best cases, you’re looking at something like “the registration landing page increased conversions from 1.83% to 2.01%”, in the worst cases you find something like “customers are engaging with messaging feature 17% longer… but their lifetime value has dropped by 4%”, and now there is work to put together a narrative that explains the perplexing results.

In 2012 I read a paper, Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained, and I had what I like to call an, “oh shit” moment. Highly controlled experiments, run by companies with world-class, dedicated analytics teams were getting perplexing results that required substantial research to understand what was actually happening. What chance did we have of getting this right when we are running 15+ experiments a week with training consisting of a one page internal wiki version of, “A/B Testing for Dummies”?

The tl;dr summary of the paper, without deep consideration for the “why” behind the change in metrics, positive results may be antithetical to what you are actually trying to achieve.

The up-front work to limit the scope of the experiment and how it will be measured / interpreted can help, assuming you have the self control to ignore the data outside of scope. Often these perplexing results require follow-up experiments to better isolate cause and effect. I also highly recommend talking to customers – often qualitative insights from hearing their experiences can often help make sense of what the quantitative results were hiding.

You’re Biased. No, Really, You Are

I’m sure there are a lot of great reasons we humans are wired to think the way we do, and this wiring probably served us very well in many situations. However, humans also come standard with cognitive biases, built-in tendencies to make irrational decisions. Unfortunately, putting a bunch of effort into building something and then getting a giant pile of metrics is a perfect enabler for a cognitive biases and craptastic decisions.

While numerous biases are working against you, with a buffet of metrics one of the most common is the Texas sharpshooter fallacy, in which the all of the test metrics that are improvements over the control metrics are used to demonstrate the success of the test. With a 95% confidence rate, 1 out of 20 metrics tracked are expected to show a false positive improvement, so even an A/A test (two separate cohorts with identical experiences) would likely show “improvements”. Before we eliminated the practice of metric-sniping at IMVU, it wasn’t uncommon to hear somebody say something like, “my pet project to streamline registration didn’t change registration, but it does deliver a 5% improvement in [the completely unrelated] customer lifetime value, so we should keep it.”

There are process controls that can help reduce the potential impact of various biases, in particular around defining and constraining each test. However, being aware of these biases and encouraging a culture consistent with the dialectical method can help make better product decisions, even beyond interpreting test results.

Talk to Your Customers!

One of the biggest risks that come from over-reliance on split testing is seeing it as a more convenient method of getting customer feedback. Why spend 30 minutes on the phone with one customer when you can simply measure the actual actions of thousands of customers?

Looking at data and sending surveys may seem like an efficient use of time, but that highly structured approach is unlikely to surface critical customer insights. Metrics and surveys will often answer the “what”, but almost always miss the “why”, the most critical driver of valuable insights. There is no substitute for talking to your customers.

In the words of Steve Blank, “Get Out of the Building.”


I’m interested in hearing other stories where split testing has made an impact, either positive or negative. Please share a comment if you have one!

2 Replies to “Avoiding the Perils of A/B Split Testing”

  1. I am in the enterprise search business. We are able to test different search settings with A/B testing.

    One example is partial match settings. When we test we might test if the partial match entry criteria is 3 keywords or 5.

    We can then look at the metrics and determine which produced more clicks or which produced a great query clickthru.

    I believe that in this example the metrics are king!

  2. A tip of the hat to one of the biggest ever: New Coke. Bill Cosby later complained that Coke made him look like an idiot. But that’s a whole other story. Coke had rock-solid data from A/B tests against Pepsi, on which they spent staggering amounts of money. But they didn’t see that “Which tastes better?” wasn’t the same question as “Should we change Coke to taste like this?” They got their heads handed to them.

Leave a Reply

Your email address will not be published. Required fields are marked *