AB Testing: the impact of early peeking¶
This notebook simulates the impact of early peaking on the results of a conversion rate AB test. Early peaking is loosely defined as the practice of checking and concluding the results of an AB test (i.e. based on its p value, statistical significance, secondary metrics etc) before the target sample size and power are reached.
Peeking is discussed extensively in blog posts and some academic research, the purpose of this notebook is to provide snippets of code that allow data analysts to reproduce their AB tests via simulations and measure the effect of peeking on data similar to their own.
The notebook is structured as follows:
- Experiment setup via simulations: true power, sample size and type I error
- The effect of early peeking: impast of frequency and time of peeking
- Visual interpretation of the effect of peeking
- Peeking threshold boundaries: can we make early decisions when the p-values excede a certain threshold ?
A non exhaustive list of references¶
- Merritt Aho: AB testing peeking deep dive
- Deliveroo
- Lucid Chart: the fatal flaw of ab test peeking
- How Etsy handles peeking
- Evan Miller: how not to run an AB test-
- Bayesian AB testing is not immuned to peeking
- KDD Paper: peeking
- Skyscanner Engineering
Summary of results¶
These simulations confirm that early peeking leads to significant degradations in true type I errors in case of the null hypothesis being true. Waiting for longer before peeking reduces this effect but may still result in negative consequences
One trade-off is to decide on decision boundaries when peeking, based on the cost/benefit of stopping futile or promising experiments early. These simulations show that in cases where agility and speed of experimentation are important, such trade-offs can be highly beneficial. They come at the expense of type I error inflation, which can be compensated by increasing the overall sample size slightly.
Experiment 1) Two groups, same average, 50/50%, early peeking and stop¶
Here we will use two normal distributions with the same mean and variance.
- $X0$ : $N \sim (\mu = 0.8, \sigma = 0.3)$
- $X1$ : $N \sim (\mu = 0.81, \sigma = 0.3)$
- $X2$ : $N \sim (\mu = 0.8, \sigma = 0.3)$
Recall that
- $H_0$: $\mu_1 = \mu_2$ and $H_1$: $\mu_1 \neq \mu_2$
- Type 1 error is calculated as: $P(\text{reject} | H_0)$
- Power is calculated as: $P(\text{reject} | H_1)$
- Early peeking leads to the end of the experiment.
Since we simulate experiment using the true distribution of $H_0$ and $H_1$, we are able to calculate the true values of the above. This is only possible via simulations, but can be useful to estimate type I and error of real AB tests if we assume that the population data is similar to the simulated data.
Imports¶
import os
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.power
%matplotlib inline
sns.set() #Making seaborn the default styling
pd.set_option('display.float_format', lambda x: '%.5f' % x)
Population distributions¶
#The population distributions, scale is std
X_1 = stats.norm.rvs(loc = 0.8, scale = 0.3, size = 14000, random_state = None)
X_2 = stats.norm.rvs(loc = 0.81, scale = 0.3, size = 14000, random_state = None)
#X_1 = X_1[(X_1 <= 1) & (X_1 >= 0)]
#X_2 = X_2[(X_2 <= 1) & (X_2 >= 0)]
plt.figure(figsize = (10,5))
sns.distplot(X_1)
sns.distplot(X_2)
plt.xlim(0,2)
plt.title('Distribution of X1 and X2')
plt.show()
Pre-experiment calculations¶
Given we want to measure a difference of 100 basis points, i.e. from 80% to 81%, how many users do we need to achieve 80% power ? Note this is a two sample, two sided z test:
- For a 50/50 split we need 14k users in each group to get 80% power
Calculating power and sample size¶
#Z_crit
alpha = .05
beta = 0.2 #Power = 1-beta
#Paramaters
mu0,mu1 = 0.80,0.81
sigma0, sigma1 = 0.3, 0.3
n0, n1 = 10000,10000
#Intermediate calculations
z_alpha2 = stats.norm.ppf(1-alpha/2, loc=0, scale=1)
z_beta = stats.norm.ppf(1-beta, loc=0, scale=1)
kappa = n0/n1
sigma_pooled = ( np.sqrt(( (n0-1)* sigma0**2 + (n1-1) * sigma1**2 ) / (n0+n1-2)) )
#Manual power
#from http://powerandsamplesize.com/Calculators/Compare-2-Means/2-Sample-Equality
z = (mu1-mu0) / ( sigma_pooled * np.sqrt( (1/n0) + (1/n1) ) )
power_manual = stats.norm.cdf(z - z_alpha2, loc=0, scale=1) + stats.norm.cdf(-z - z_alpha2, loc=0, scale=1)
#Manual sample size given power
sample_size_n1 = (1 + 1/kappa) * ( sigma_pooled * (z_alpha2 + z_beta) / (mu1-mu0) )**2
sample_size_n0 = kappa * sample_size_n1
#Using statsmodel and standardized effect size, difference between the two means divided by the standard deviation. effect_size has to be positive.
power_statsmodel = statsmodels.stats.power.tt_ind_solve_power(effect_size = (mu1-mu0)/sigma_pooled, nobs1 = n1, ratio =n0/n1, alpha = 0.05, power = None, alternative = 'two-sided' )
#print('z_alpha/2 = ',z_alpha2, ' z = ', z)
print('Sample size for 80% power: n0=', round(sample_size_n1), 'and n1 =',round(sample_size_n0))
print('Power for given sample size = ', round(power_manual,4) )
def ttest_two_side_power(mu0, mu1, sigma0, sigma1, n0, n1, alpha = 0.05, beta = 0.2):
'''Manual power calculation for two sided t test power
from http://powerandsamplesize.com/Calculators/Compare-2-Means/2-Sample-Equality
'''
#Intermediate calculations
z_alpha2 = stats.norm.ppf(1-alpha/2, loc=0, scale=1)
z_beta = stats.norm.ppf(1-beta, loc=0, scale=1)
#power
z = (mu1-mu0) / ( sigma_pooled * np.sqrt( (1/n0) + (1/n1) ) )
pwer = stats.norm.cdf(z - z_alpha2, loc=0, scale=1) + stats.norm.cdf(-z - z_alpha2, loc=0, scale=1)
return pwer
ttest_two_side_power(.8, .81, .3, .3, 14000, 14000, alpha = 0.05, beta = 0.2)
Confirming by simulations¶
%%time
#Power from simulations
means0, means1, means2 = [], [], []
t_test_p_H0, t_test_p_H1 = [], []
for i in range(5000):
sample0 = stats.norm.rvs(loc = 0.8, scale = .3, size = n0)
sample1 = stats.norm.rvs(loc = 0.81, scale = .3, size = n1)
sample2 = stats.norm.rvs(loc = 0.8, scale = .3, size = n1)
means0.append(sample0.mean().copy())
means1.append(sample1.mean().copy())
means2.append(sample2.mean().copy())
#Computations for true alpha and power
t_test_p_H0.append(stats.ttest_ind(sample0,sample2)[1].copy())
t_test_p_H1.append(stats.ttest_ind(sample0,sample1)[1].copy())
#Computing true alpha and power from simulations
simulated_alpha = (np.asarray(t_test_p_H0) < alpha).sum() / len(t_test_p_H0)
simulated_power = (np.asarray(t_test_p_H1) < alpha).sum() / len(t_test_p_H1)
print('Manual Power = ', power_manual )
print('Simulated power = ', simulated_power , 'and alpha =', simulated_alpha)
#Not sure why stats model gives different results ?
#Since manual, bootstrap and website all give the same value, we will ignore statsmodel
#print('Stats model power = ', power_statsmodel)
Simulating early peeking¶
Assume that each day, we increase the number of users in the sample by 1,000. What would happen if we peaked only once, on day 5, every day ?
Note that:
- x0 corresponds to the control with mean = 0.8
- x1 corresponds to a variant with true mean = 0.81 (i.e. a true alternative)
- x2 corresponds to a variant with true mean = 0.8 (i.e. a true null)
%%time
days = 14
sims = 500
H0_frame, H1_frame, H2_frame= np.empty(days).reshape(1,days), np.empty(days).reshape(1,days), np.empty(days).reshape(1,days)
#Looping through simulations
for i in range(sims-1):
x0, x1, x2, x3 = np.empty(0), np.empty(0), np.empty(0), np.empty(0)
t_test_p_H0, t_test_p_H1, t_test_p_H2 = np.empty(0), np.empty(0), np.empty(0)
#Looping through days
for d in range(days):
#print(d)
daily_x0 = stats.norm.rvs(loc = mu0, scale =sigma0, size = 1000)
daily_x1 = stats.norm.rvs(loc = mu1, scale = sigma1, size = 1000)
daily_x2 = stats.norm.rvs(loc = mu0, scale = sigma0, size = 1000)
daily_x3 = stats.norm.rvs(loc = mu0-0.03, scale = sigma0, size = 1000)
x0 = np.append(x0, daily_x0.copy())
x1 = np.append(x1, daily_x1.copy())
x2 = np.append(x2, daily_x2.copy())
x3 = np.append(x3, daily_x3.copy())
#print(x0.shape)
t_test_p_H0 = np.append(t_test_p_H0, stats.ttest_ind(x0,x2)[1].copy())
t_test_p_H1 = np.append(t_test_p_H1, stats.ttest_ind(x0,x1)[1].copy())
t_test_p_H2 = np.append(t_test_p_H2, stats.ttest_ind(x0,x3)[1].copy())
#Stacking sim loop results
H0_frame = np.vstack( (H0_frame,t_test_p_H0))
H1_frame = np.vstack( (H1_frame,t_test_p_H1 ))
H2_frame = np.vstack( (H2_frame,t_test_p_H2 ))
#Saving as DataFrame outside of the sim loop
H0_frame = pd.DataFrame(H0_frame)
H1_frame = pd.DataFrame(H1_frame)
H2_frame = pd.DataFrame(H2_frame)
number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]
H0_positive_sims = ((H0_frame<alpha).sum(axis=1)>0).sum()
H1_positive_sims = ((H1_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)
# Under a true H0, how many simulations have 1 or more day with pvalue < alpha
#print('H0 sims with 1 or more days positive =', H0_positive_sims)
# i.e. True type 1 error in this scenario compared to expected
print('True type 1 error when peaking each day:')
print('True =', 100 *H0_positive_sims / number_sims, '%, Expected =', 100 * alpha, '%')
# Under a true H1, how many simulations have 1 or more day with pvalue < alpha
print('')
print('Power when peaking each day:')
print('True = ', 100 * H1_positive_sims / number_sims, '%, Expected =', 100* expected_power, '%')
Visual interpretation : H0 scenario¶
Why does peeking inflate type I error ? Given two identical distributions, i.e. a true null hypothesis, we plot
- Top chart: the number of positive experiments and proportion of total, each day of peeking
- Bottom two charts:
- In grey, experiments that are negative
- In red, experiments that have a p value < 0.05, i.e. false positives
#Calculations for the bar plot
number_positive_tests, number_remaining_tests = [], []
H0_frame_copy = H0_frame.copy()
#Loop through columns
for c in H0_frame_copy.columns:
#Calculate temp results
positive_tests = H0_frame_copy[H0_frame_copy.iloc[:,c]<0.05].index
number_positive_tests.append(positive_tests.shape[0])
number_remaining_tests.append(H0_frame_copy.shape[0])
#Drop the positive tests from frame
H0_frame_copy = H0_frame_copy.drop(index = positive_tests, axis = 0)
#Save in frame and calculate proportions
df_bar_res = pd.DataFrame({'Day':[c for c in H0_frame_copy.columns], 'positive_tests':number_positive_tests, \
'total_tests':2000,'remaining_tests':number_remaining_tests})
df_bar_res['positive/total'] = df_bar_res['positive_tests'] / H0_frame.shape[0]
#Figure setup
fig = plt.figure(figsize=(17,14))
layout = (2, 2)
ax1 = plt.subplot2grid(layout, (0, 0), colspan=2)
ax2 = plt.subplot2grid(layout, (1, 0))
ax3 = plt.subplot2grid(layout, (1, 1))
#Bar plot
df_bar_res.plot(kind = 'bar', x = 'Day', y = 'positive_tests', ax= ax1, ylabel = 'Positive tests', title = 'H0 scenario - Peek every day')
tmp = df_bar_res.plot(kind = 'line', x = 'Day', y = 'positive/total', color = 'red', linestyle = ':',secondary_y = True,ax = ax1)
tmp.set_xlabel('Day')
tmp.set_ylabel('% of total')
# Line plots
alpha_plot = [0.05 for i in range(14)]
for row in H0_frame.head(100).iterrows():
#Non peeking plot
if row[1][13]<0.05:
ax2.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
else:
ax2.plot(row[1], linestyle = '-', color = 'lightgrey')
#Peeking plot
if (row[1]<0.05).sum()>0:
ax3.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
else:
ax3.plot(row[1], linestyle = '-', color = 'lightgrey')
ax2.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax2.legend(loc = 'upper right')
ax2.set_title('Type I errors: Read on day 14')
ax2.set_ylabel('p value')
ax2.set_xlabel('Day')
ax3.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax3.legend(loc = 'upper right')
ax3.set_title('Type I errors: Peek every day')
ax3.set_ylabel('p value')
ax3.set_xlabel('Day')
plt.show()
number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]
H0_positive_sims = ((H0_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)
# Under a true H0, how many simulations have 1 or more day with pvalue < alpha
#print('H0 sims with 1 or more days positive =', H0_positive_sims)
# i.e. True type 1 error in this scenario compared to expected
print('True type 1 error when peaking each day:')
print('True =', 100 *H0_positive_sims / number_sims, '%, Expected =', 100 * alpha, '%')
print('Sum of % of total =',round(df_bar_res['positive/total'].sum(),3))
Visual interpretation : H1 scenario¶
Why does peeking inflate power ? Given two truly differeny distributions, i.e. a true alternative hypothesis, we plot
- Top chart: the number of positive experiments and proportion of total, each day of peeking
- Bottom two charts:
- In grey, experiments that are negative
- In red, experiments that have a p value < 0.05, i.e. true positives
#Calculations for the bar plot
number_positive_tests, number_remaining_tests = [], []
H1_frame_copy = H1_frame.copy()
#Loop through columns
for c in H1_frame_copy.columns:
#Calculate temp results
positive_tests = H1_frame_copy[H1_frame_copy.iloc[:,c]<0.05].index
number_positive_tests.append(positive_tests.shape[0])
number_remaining_tests.append(H1_frame_copy.shape[0])
#Drop the positive tests from frame
H1_frame_copy = H1_frame_copy.drop(index = positive_tests, axis = 0)
#Save in frame and calculate proportions
df_bar_res = pd.DataFrame({'Day':[c for c in H1_frame_copy.columns], 'positive_tests':number_positive_tests, \
'total_tests':2000,'remaining_tests':number_remaining_tests})
df_bar_res['positive/total'] = df_bar_res['positive_tests'] / H1_frame.shape[0]
#Figure setup
fig = plt.figure(figsize=(14,14))
layout = (2, 2)
ax1 = plt.subplot2grid(layout, (0, 0), colspan=2)
ax2 = plt.subplot2grid(layout, (1, 0))
ax3 = plt.subplot2grid(layout, (1, 1))
#Bar plot
df_bar_res.plot(kind = 'bar', x = 'Day', y = 'positive_tests', ax= ax1, xlabel = 'Day of peek', ylabel = 'Positive tests', title = 'H1 scenario - Peek every day')
tmp = df_bar_res.plot(kind = 'line', x = 'Day', y = 'positive/total', color = 'red', linestyle = ':',secondary_y = True,ax = ax1)
tmp.set_ylabel('% of total')
# Line plots
alpha_plot = [0.05 for i in range(14)]
for row in H1_frame.head(100).iterrows():
#Non peeking plot
if row[1][13]<0.05:
ax2.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
else:
ax2.plot(row[1], linestyle = '-', color = 'lightgrey')
#Peeking plot
if (row[1]<0.05).sum()>0:
ax3.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
else:
ax3.plot(row[1], linestyle = '-', color = 'lightgrey')
ax2.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax2.legend(loc = 'upper right')
ax2.set_title('Power: Read on day 14')
ax2.set_ylabel('p value')
ax2.set_xlabel('Day')
ax3.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax3.legend(loc = 'upper right')
ax3.set_title('Power: Peek every day')
ax3.set_ylabel('p value')
ax3.set_xlabel('Day')
plt.show()
#Printing sums and overal stats
number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]
H1_positive_sims = ((H1_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)
# Under a true H1, how many simulations have 1 or more day with pvalue < alpha
print('')
print('Power when peaking each day:')
print('True = ', 100 * H1_positive_sims / number_sims, '%, Expected =', 100* expected_power, '%')
print('Sum of % of total =',round(df_bar_res['positive/total'].sum(),3))