Simulated Bootstrapped A/A tests

As Data Scientists at Twitch, we design, run, and analyze A/B tests on a daily basis. Many of our metrics, like hours watched or number of chat messages per channel, have very long-right-tailed skewed distributions. We needed a way to deal with designing A/B tests on skewed metrics, so we adapted the simulated bootstrapped A/A test methodology from Trusted Online Controlled Experiments by Kohavi et al. In this blog post, we will discuss what a simulated bootstrapped A/A test is, why it helps when there are skewed metrics, and most importantly, we have included the code that will allow you to run your own simulated bootstrapped A/A tests on your data.

What do you need to know before you read this article?

You can open the links below in a separate tab to follow along as you read.

What’s an A/B test?
What’s a t-test?
What’s a p-value?
What’s a skewed distribution?

Why simulated bootstrapped A/A tests?

Skewed metrics present challenges to analyzing experiments with a classic t-test. There may be problems with:

Variance - highly skewed metrics have a high variance, which may be prohibitive to get enough power for a test.
Outliers - data may exhibit outliers that need to be controlled before calculating test results.
Splits - while many of our experiments run at a 95/5 split, a more even split (e.g. 50/50 split) may be necessary when metrics are skewed.

To better understand how skew may affect experiment results and whether the above issues need to be mitigated, we implemented the methodology adapted from Chapter 19 in the book Trusted Online Controlled Experiments by Kohavi et al., where we run bootstrapped A/A tests on the experiment metric with historical data *before *performing an A/B test.

The simulated A/A test on historical data can uncover potential problems with using the intended metric before it is time to analyze an A/B test, and has zero engineering cost (as long as tracking is already implemented). In this post, we will share the Python code so that you can also run this methodology and assess the validity of your metrics.

What is a simulated bootstrapped A/A test?

Before running an A/B test on a metric, especially a newly designed metric with a lot of skew, it can be a good idea to run a series of simulated A/A tests on historical data to make sure that the t-test calculation is valid.

To start, create a data set of historical data that simulates the enrollment condition of the test for some period of time, along with the experiment metric. For example, to simulate a 2-week experiment on channels that broadcast Fortnite with the metric hours watched, we would pull a data set for a past 2-week period containing channel_id and hours watched in that period for those channels that broadcast Fortnite.

Once the data set is created, the second step is to generate a p-value distribution from bootstrapped A/A tests. This is where we split the data randomly into variant / control in many different ways. Then, we run the t-test on each split with the following hypothesis:

H0: average metric in group A = average metric in group B
H1: average metric in group A ≠ average metric in group B

Each of these t-tests on a randomly generated split generates a p-value, and we plot all the p-values in a histogram, which should exhibit a uniform distribution.

In an A/A t-test that satisfies the underlying statistical assumptions for a t-test, the resulting p-value distribution should be uniform (so, a p-value < 0.05 should occur 5% of the time). If the p-value distribution is not uniform, it shows that the testing methodology is flawed, and points to a violation of assumptions.

A uniform p-value distribution looks like:

A non-uniform p-value distribution looks like:

The reason that the distribution should be uniform comes from the definition of p-value, which states the probability of finding a difference in the samples given the null hypothesis is true. Since this is an A/A test, the null hypothesis is true. If there was no problem with the t-test, we would expect to see a uniform distribution (all pillars would be of similar height in the above chart) of the p-values. In other words, since the chart is divided into 20 bins, we would expect to see count = ~100 (2000 A/A tests/ 20 bins = 100) in each of the bins, namely the first bin (representing p<0.05) should have a frequency of 5% = 100. Therefore, if the p-value distribution for the A/A test is not uniform, the t-test will not properly measure the probability of whether the alternative hypothesis is, in fact, true.

While sometimes p-value distributions are obviously not uniform, other times you may need to statistically confirm whether a distribution is uniform. You can perform statistical tests like the Kolmogorov-Smirnov (KS) test or the Chi-squared test to check. We will be using the KS test in our python code below to check whether a p-value distribution is uniform.

How to run a simulated bootstrapped A/A test

1. Create a data set containing unit of analysis in experiment and its associated metric.

You can use your own data set. If you don’t have one handy, we have provided the code to generate a dummy data set so that you can practice. This code generates three different metrics - one that is neither skewed nor has outliers (hours_watched), one that contains outliers (hours_watched_outlier), and one that is skewed (hours_watched_skewed).

Here is the code to generate dummy data set:

import pandas as pd
import seaborn as sns
from scipy import stats
import numpy as np
import statistics as s
import random as r
from scipy.stats import skewnorm
from scipy.stats.mstats import winsorize

np.random.seed(42)

df = pd.DataFrame()
df['user_id'] = np.random.choice(40000, 40000, replace=False)
df['hours_watched'] = np.random.random(40000)
df['hours_watched_outlier'] = np.random.random(40000)
df['hours_watched_outlier'].iloc[0] = 1000

df['hours_watched_skewed'] = np.random.random(40000)
# winsorizing at 99th %ile to get skewed data
df['hours_watched_skewed'] = winsorize(
                                      df['hours_watched_skewed'],
                                      limits=[0.001, 0.0])
df['hours_watched_skewed'] = 1/df['hours_watched_skewed']

df.head()

Distribution of hours_watched

sns.histplot(df['hours_watched'], bins=20, kde=False)

Distribution of hours_watched_outlier

sns.histplot(df['hours_watched_outlier'], bins=20, kde=False)

Distribution of hours_watched_skewed

sns.histplot( df['hours_watched_skewed'], bins=20, kde=False)

2. Create a p-value distribution from A/A tests.

Follow these steps to generate the p-value distribution:

Split the rows in the data set randomly into 2 cohorts (95/5 or 50/50 or whatever split you would like - for the purposes of this example, we will assume a 95/5 split)
Conduct a t-test to compare average metric for the 2 cohorts, and obtain the corresponding p-value
Obtain 2000 p-values by repeating steps 1 and 2 2000 times
Create a histogram consisting of 20 bins for these 2000 p-values (so, the first bin represents p-values of <0.05)

Here is the code to generate the p-value distribution on your data set:

# returns p-value

def AA_test(df, i):
    m1 = df['mean1'][i]
    m2 = df['mean2'][i]
    s1 = df['std1'][i]
    s2 = df['std2'][i]
    o1 = df['obs1'][i]
    o2 = df['obs2'][i]
    return(stats.ttest_ind_from_stats(mean1=m1, std1=s1, nobs1=o1,
                                      mean2=m2, std2=s2, nobs2=o2,
                                      equal_var=False)[1])

# returns dataframe with p-values

def p_values(dataset, metric, sample_size, control_size, bootstraps):
  # create a dataframe of unique channel ids (or whatever is the grain of your data)
unique_channels = pd.DataFrame(dataset['channel_id'].unique(), columns=['channel_id'])

# defining variables
c_mean = [] # List to store mean of each control bootstrap sample
c_std = [] # List to store standard deviation of each control bootstrap sample
c_obs = [] # List to store number of observations in each control bootstrap sample

t_mean = [] # List to store mean of each treatment bootstrap sample
t_std = [] # List to store standard deviation of each treatment bootstrap sample
t_obs = [] # List to store number of observations in each treatment bootstrap sample

# loop for each bootstrap sample
for _ in range(bootstraps):

  # % of sampling complete
  if (_+1) % 1000 == 0:
    print("% sampling complete.".format(100.0*(_+1)/bootstraps))

  np.random.seed(_)

  # get control dataset
  # control channel count = sample_size * control_size
  control_data = dataset.sample(int(sample_size * control_size))

  # get statistics for the control data
  c_mean.append(control_data[metric].mean())
  c_std.append(control_data[metric].std())
  c_obs.append(len(control_data))

  # get treatment dataset
  # treatment channel count = sample_size - (sample_size * control_size)
  treatment_data = dataset[~dataset.isin(control_data)].dropna().sample(sample_size - int(sample_size * control_size))

  # get statistics for treatment data
  t_mean.append(treatment_data[metric].mean())
  t_std.append(treatment_data[metric].std())
  t_obs.append(len(treatment_data))

df_means = pd.DataFrame({'mean1':c_mean,
                        'mean2':t_mean,
                        'std1': c_std,
                        'std2': t_std,
                        'obs1': c_obs,
                        'obs2': t_obs})

print("Calculating p-values...")
  pvalues = {metric + '_p_values':[]} # change the name as needed

for i in range(bootstraps):
    for k,v in pvalues.items():
        v.append(AA_test(df_means, i))
  df_results = pd.DataFrame(pvalues)
  print("Dataframe with p-values is ready.")
  return df_results

How to interpret a simulated bootstrapped A/A test

Metric - Hours Watched

Let’s use the data set above for our dummy set with the metric hours_watched, and execute the code below to generate the p-value distribution:

# run code blocks 1,2 before running this block

df_results = p_values(df, 'hours_watched', sample_size=df.shape[0], control_size=0.95, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_p_values'], bins=20, kde=False)

We get the following p-value distribution:

Using the code below, we can perform the goodness of fit test for uniformity and check if the above distribution is uniform or not:

stats.kstest(df_results['hours_watched_p_values'], "uniform")

You will see that the outcome of the above is that the p-value > 0.05. Hence, it is safe to say that the distribution is uniform and that the metric is OK to use in an experiment.

Metric - Hours Watched with outlier

Now, let’s repeat this process with metric hours_watched_outlier, which contains several outliers.

df_results = p_values(df, 'hours_watched_outlier', sample_size=df.shape[0], control_size=0.95, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_outlier_p_values'], bins=20, kde=False)

We get the following p-value distribution:

Clearly, the p-value distribution is not uniform. Also, notice that in this distribution, there is a spike at 0.32. This spike indicates that there is an outlier problem.

Metric - Hours Watched with outlier, winsorized

One way to mitigate this problem is to cap values of the metric (also known as Winsorization), by replacing extreme values with a threshold value, or cap. This is different from discarding data; instead, Winsorization caps outliers so they fall at the edge of the main distribution. Besides getting rid of outliers, it has the added benefit of decreasing variance, which allows the test to detect smaller differences in means.

We can Winsorize the hours_watched_outlier metric, using the code below:

df['hours_watched_outlier_winsorized'] = winsorize(df['hours_watched_outlier'], limits=[0.0, 0.01]) # winsorizing at 99th %ile

df_results = p_values(df, 'hours_watched_outlier_winsorized', sample_size=df.shape[0], control_size=0.95, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_outlier_winsorized_p_values'], bins=20, kde=False)

When we plot the p-value histogram of hours_watched_outlier_winsorized, we get a uniform distribution. So, this suggests that the winsorized version of the metric is suitable for use in an experiment.

Metric - Hours Watched with skew

Besides the presence of outliers, a non-uniform p-value distribution can also be a result of skewed metrics. Below is the p-value distribution we get when we run the simulated bootstrap A/A test code for the metric hours_watched_skewed.

df_results = p_values(df, 'hours_watched_skewed', sample_size=df.shape[0], control_size=0.95, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_skewed_p_values'], bins=20, kde=False)

Notice that the distribution is not uniform, with a mass on the left. This means that if we use this metric with a t-test, we would be overestimating significance (since p-value < 0.05 occurs more than 5% of the time). Usually, this happens when the sample size is too small for the magnitude of the skew of the metric. In this case, capping metrics would also help.

Metric - Hours Watched with skew, winsorized

When we cap the metric hours_watched_skewed at the 99th percentile, to hours_watched_skewed_capped, we get the following p-value distribution.

df['hours_watched_skewed_winsorized'] = winsorize(df['hours_watched_skewed'], limits=[0.0, 0.01]) # winsorizing at 99th %ile

df_results = p_values(df, 'hours_watched_skewed_winsorized', sample_size=df.shape[0], control_size=0.95, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_skewed_winsorized_p_values'], bins=20, kde=False)

Notice that this p-value distribution is now uniform (with KS test p-value of 0.71, which is > 0.05), and hence it’s OK to use it in an experiment.

Metric - Hours Watched with skew, control / variant evenly split

Another thing to consider in a skewed metric scenario where the p-value distribution has a mass on the left is to make a more even split. If we repeat the p-value distribution code for the metric hours_watched_skewed with a 50/50 split, we get the following p-value distribution.

# control split changed to 50%
df_results = p_values(df, 'hours_watched_skewed', sample_size=df.shape[0], control_size=0.5, bootstraps=2000)

ax = sns.histplot(df_results['hours_watched_skewed_p_values'], bins=20, kde=False)

So, we can see that this mitigates the problem, and we should suggest running this test at 50/50 to increase the sample size needed to balance the skew of the metric.

Appendix

Here’s the link to the notebook, with step-wise code for this blog. If you enjoyed this, feel free to connect with Jane and Jaspreet on LinkedIn!

Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site, LinkedIn, and Instagram, or check out our Job Openings and apply.

Language