Controlled Experiments aka A/B Testing


Controlled experiments are the beating heart of scientific inquiry, good science is also crucial for understanding your business.

Countless amazing discoveries have been unearthed through the scientific method, and in the modern era, the practice is well suited to help develop business intelligence. Most notably, it is applied through A/B testing.

Let’s start with the fundamentals of experimental and research design.

Lingo

The terminology varies depending on discipline, but they all refer to the same items.

We seek to find the effects of treatment variables on some quantifiable measurement.

In biology, we could call this either the treatment group/experimental group or the independant variable. Augmenting some single feature will allow us to assay the resulting dependant variable, a response vector.

BIG N

We want to analyze the largest sample size (N) of a population possible, as this is our best methodology to approximate a representation of the true population values. We can build a model that closely resembles a normal distribution, a fundamental assumption of most statistical tests.

Gold Standard

Bias needs to be limited. There are countless entry points for a researcher’s bias to seep into the analysis, from initial design, to subsequent statistical results. Currently, the best way to reduce bias, is to follow the ‘Gold Standard’ of clinical research, a double-blinded randomized controlled trial, where researchers and experimental participants are randomized to a treatment group, and unaware of their status. Blinding the research is a good idea, because it reduces their unconcious desire to find the results they ‘want’ to find.

There is another option, a TRIPLE-blinded study, where even the analysts are blinded while conducting the statistical end, a process made easier with modern programming languages to utilize masking.

We also want to REPEAT the experiment, to reduce the probability of mistakes, ensure repeatability, and reduce the probability of Type I and Type II errors related to the null hypothesis.

Randomized Block Design

This is essentially an extension of the classical gold standard. Rather than just utilizing a simple control/placebo/experimental group, we subdivide our groups along (hopefully) logical and statistically relevant groups. The easiest example for people to imagine, is along lines of sex. Variability within the male or female group would be less than between, given a physiological difference between these groups. Randomized Block Design(RBD) is useful for testing a broad range of experimental treatment conditions across different groups, and is particularly useful for something like market research, economics, or in the case of traditional data science, in A/B testing.

A/B Testing

In A/B testing, a common example of applying experimental design and terminology would be a novel feature added to a page in treatment group B, not present in control group A, this would be our independant variable. We would assay the response variable, such as sales, and term this a ‘Conversion Rate’ as the metric to be considered.

We aren’t limited to adding a single experimental group, as long as we keep our constants(base conditions) of our groups the same. This allows us to test many new features through a single trial, important when time and resources are so valuable.

By toggling a single feature type over many groups with the same base conditions we can make a stronger argument for causality, that our feature is the driving force behind our augmented responses.

The base condition I’m referring to, is the entire construct of our control group, what we compare the results of our treatments to. Ideally, this is a baseline value that represents our current business model/website structure.

Behavioural Considerations / Risks

Self-Reporting is dangerous. It’s been shown historically to be wildly inaccurate, particularly if you’re asking anything remotely personal or sensitive about the individual.

The best defense against the phenemenon, is to ensure that privacy and anonymous states are not only present, but that the user or participant is -confident- in their anonymity.

Let’s say that you have a hypothesis that users of a certain age group or weight class will respond to a new feature differently, so you ask them to self report some value. Chances are, you will have many users self-report a false value, and no matter how well intentioned or designed your experiment is, you’re going to get bad data.

Written on December 2, 2017