AB Testing: the impact of early peeking

This notebook simulates the impact of early peaking on the results of a conversion rate AB test. Early peaking is loosely defined as the practice of checking and concluding the results of an AB test (i.e. based on its p value, statistical significance, secondary metrics etc) before the target sample size and power are reached.

Peeking is discussed extensively in blog posts and some academic research, the purpose of this notebook is to provide snippets of code that allow data analysts to reproduce their AB tests via simulations and measure the effect of peeking on data similar to their own.

The notebook is structured as follows:

  • Experiment setup via simulations: true power, sample size and type I error
  • The effect of early peeking: impast of frequency and time of peeking
  • Visual interpretation of the effect of peeking
  • Peeking threshold boundaries: can we make early decisions when the p-values excede a certain threshold ?

A non exhaustive list of references

Summary of results

These simulations confirm that early peeking leads to significant degradations in true type I errors in case of the null hypothesis being true. Waiting for longer before peeking reduces this effect but may still result in negative consequences

One trade-off is to decide on decision boundaries when peeking, based on the cost/benefit of stopping futile or promising experiments early. These simulations show that in cases where agility and speed of experimentation are important, such trade-offs can be highly beneficial. They come at the expense of type I error inflation, which can be compensated by increasing the overall sample size slightly.

Experiment 1) Two groups, same average, 50/50%, early peeking and stop

Here we will use two normal distributions with the same mean and variance.

  • $X0$ : $N \sim (\mu = 0.8, \sigma = 0.3)$
  • $X1$ : $N \sim (\mu = 0.81, \sigma = 0.3)$
  • $X2$ : $N \sim (\mu = 0.8, \sigma = 0.3)$

Recall that

  • $H_0$: $\mu_1 = \mu_2$ and $H_1$: $\mu_1 \neq \mu_2$
  • Type 1 error is calculated as: $P(\text{reject} | H_0)$
  • Power is calculated as: $P(\text{reject} | H_1)$
  • Early peeking leads to the end of the experiment.

Since we simulate experiment using the true distribution of $H_0$ and $H_1$, we are able to calculate the true values of the above. This is only possible via simulations, but can be useful to estimate type I and error of real AB tests if we assume that the population data is similar to the simulated data.

Imports

In [1]:
import os
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.power

%matplotlib inline
sns.set() #Making seaborn the default styling 

pd.set_option('display.float_format', lambda x: '%.5f' % x)
/anaconda3/lib/python3.6/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Population distributions

In [2]:
#The population distributions, scale is std
X_1 = stats.norm.rvs(loc = 0.8, scale = 0.3, size = 14000, random_state = None)
X_2 = stats.norm.rvs(loc = 0.81, scale = 0.3, size = 14000, random_state = None)
#X_1 = X_1[(X_1 <= 1) & (X_1 >= 0)]
#X_2 = X_2[(X_2 <= 1) & (X_2 >= 0)]

plt.figure(figsize = (10,5))
sns.distplot(X_1)
sns.distplot(X_2)
plt.xlim(0,2)
plt.title('Distribution of X1 and X2')
plt.show()
/anaconda3/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/anaconda3/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Pre-experiment calculations

Given we want to measure a difference of 100 basis points, i.e. from 80% to 81%, how many users do we need to achieve 80% power ? Note this is a two sample, two sided z test:

  • For a 50/50 split we need 14k users in each group to get 80% power

Calculating power and sample size

In [3]:
#Z_crit
alpha = .05
beta = 0.2 #Power = 1-beta

#Paramaters
mu0,mu1 = 0.80,0.81
sigma0, sigma1 = 0.3, 0.3
n0, n1 = 10000,10000

#Intermediate calculations
z_alpha2 = stats.norm.ppf(1-alpha/2, loc=0, scale=1)
z_beta = stats.norm.ppf(1-beta, loc=0, scale=1)

kappa = n0/n1
sigma_pooled = ( np.sqrt(( (n0-1)* sigma0**2 + (n1-1) * sigma1**2 ) / (n0+n1-2)) )


#Manual power
#from http://powerandsamplesize.com/Calculators/Compare-2-Means/2-Sample-Equality
z = (mu1-mu0) / ( sigma_pooled * np.sqrt( (1/n0) + (1/n1) ) )
power_manual = stats.norm.cdf(z - z_alpha2, loc=0, scale=1) + stats.norm.cdf(-z - z_alpha2, loc=0, scale=1)

#Manual sample size given power
sample_size_n1 = (1 + 1/kappa)  * ( sigma_pooled * (z_alpha2 + z_beta) / (mu1-mu0) )**2
sample_size_n0 = kappa * sample_size_n1

#Using statsmodel and standardized effect size, difference between the two means divided by the standard deviation. effect_size has to be positive.
power_statsmodel = statsmodels.stats.power.tt_ind_solve_power(effect_size = (mu1-mu0)/sigma_pooled, nobs1 = n1, ratio =n0/n1, alpha = 0.05, power = None, alternative = 'two-sided' )

#print('z_alpha/2 = ',z_alpha2, '  z = ', z)
print('Sample size for 80% power: n0=', round(sample_size_n1), 'and n1 =',round(sample_size_n0))
print('Power for given sample size = ', round(power_manual,4) ) 
Sample size for 80% power: n0= 14128 and n1 = 14128
Power for given sample size =  0.6543
In [11]:
def ttest_two_side_power(mu0, mu1, sigma0, sigma1, n0, n1, alpha = 0.05, beta = 0.2):
    '''Manual power calculation for two sided t test power
    from http://powerandsamplesize.com/Calculators/Compare-2-Means/2-Sample-Equality
    '''
    
    #Intermediate calculations
    z_alpha2 = stats.norm.ppf(1-alpha/2, loc=0, scale=1)
    z_beta = stats.norm.ppf(1-beta, loc=0, scale=1)

    #power
    z = (mu1-mu0) / ( sigma_pooled * np.sqrt( (1/n0) + (1/n1) ) )
    pwer = stats.norm.cdf(z - z_alpha2, loc=0, scale=1) + stats.norm.cdf(-z - z_alpha2, loc=0, scale=1)
    
    return pwer


ttest_two_side_power(.8, .81, .3, .3, 14000, 14000, alpha = 0.05, beta = 0.2)    
Out[11]:
0.7964213092264733

Confirming by simulations

In [12]:
%%time

#Power from simulations
means0, means1, means2 = [], [], []
t_test_p_H0, t_test_p_H1 = [], []

for i in range(5000):
    sample0 = stats.norm.rvs(loc = 0.8, scale = .3, size = n0)
    sample1 = stats.norm.rvs(loc = 0.81, scale = .3, size = n1)
    sample2 = stats.norm.rvs(loc = 0.8, scale = .3, size = n1)
    means0.append(sample0.mean().copy())
    means1.append(sample1.mean().copy())
    means2.append(sample2.mean().copy())

    #Computations for true alpha and power
    t_test_p_H0.append(stats.ttest_ind(sample0,sample2)[1].copy())
    t_test_p_H1.append(stats.ttest_ind(sample0,sample1)[1].copy())

#Computing true alpha and power from simulations
simulated_alpha = (np.asarray(t_test_p_H0) < alpha).sum() / len(t_test_p_H0)
simulated_power = (np.asarray(t_test_p_H1) < alpha).sum() / len(t_test_p_H1)
CPU times: user 21.3 s, sys: 107 ms, total: 21.4 s
Wall time: 24 s
In [13]:
print('Manual Power = ', power_manual ) 
print('Simulated power = ', simulated_power , 'and alpha =', simulated_alpha)

#Not sure why stats model gives different results ? 
#Since manual, bootstrap and website all give the same value, we will ignore statsmodel
#print('Stats model power = ', power_statsmodel)
Manual Power =  0.6543457917372257
Simulated power =  0.6408 and alpha = 0.0544

Simulating early peeking

Assume that each day, we increase the number of users in the sample by 1,000. What would happen if we peaked only once, on day 5, every day ?

Note that:

  • x0 corresponds to the control with mean = 0.8
  • x1 corresponds to a variant with true mean = 0.81 (i.e. a true alternative)
  • x2 corresponds to a variant with true mean = 0.8 (i.e. a true null)
In [14]:
%%time
days = 14
sims = 500
H0_frame, H1_frame, H2_frame= np.empty(days).reshape(1,days), np.empty(days).reshape(1,days), np.empty(days).reshape(1,days)

#Looping through simulations
for i in range(sims-1):
    
    x0, x1, x2, x3 = np.empty(0), np.empty(0), np.empty(0), np.empty(0)
    t_test_p_H0, t_test_p_H1, t_test_p_H2 = np.empty(0), np.empty(0), np.empty(0)
    
    #Looping through days 
    for d in range(days):
        #print(d)
        daily_x0 = stats.norm.rvs(loc = mu0, scale =sigma0, size = 1000)
        daily_x1 = stats.norm.rvs(loc = mu1, scale = sigma1, size = 1000)
        daily_x2 = stats.norm.rvs(loc = mu0, scale = sigma0, size = 1000)
        daily_x3 = stats.norm.rvs(loc = mu0-0.03, scale = sigma0, size = 1000)

        x0 = np.append(x0, daily_x0.copy())
        x1 = np.append(x1, daily_x1.copy())
        x2 = np.append(x2, daily_x2.copy())
        x3 = np.append(x3, daily_x3.copy())
        #print(x0.shape)

        t_test_p_H0 = np.append(t_test_p_H0, stats.ttest_ind(x0,x2)[1].copy())
        t_test_p_H1 = np.append(t_test_p_H1, stats.ttest_ind(x0,x1)[1].copy())   
        t_test_p_H2 = np.append(t_test_p_H2, stats.ttest_ind(x0,x3)[1].copy())   
    
    #Stacking sim loop results
    H0_frame = np.vstack( (H0_frame,t_test_p_H0))
    H1_frame = np.vstack( (H1_frame,t_test_p_H1 ))
    H2_frame = np.vstack( (H2_frame,t_test_p_H2 ))

#Saving as DataFrame outside of the sim loop    
H0_frame = pd.DataFrame(H0_frame)
H1_frame = pd.DataFrame(H1_frame)
H2_frame = pd.DataFrame(H2_frame)
CPU times: user 24.4 s, sys: 122 ms, total: 24.6 s
Wall time: 27.5 s

Calculations and insights

Peeking every day

  • Inflates the true type 1 error from 5% to 20%, or a 4x increase
  • Inflates the true power from 80% to 87.5%, or a 9% relative increase
In [15]:
number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]

H0_positive_sims = ((H0_frame<alpha).sum(axis=1)>0).sum()
H1_positive_sims = ((H1_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)    

# Under a true H0, how many simulations have 1 or more day with pvalue < alpha
#print('H0 sims with 1 or more days positive =', H0_positive_sims) 

# i.e. True type 1 error in this scenario compared to expected
print('True type 1 error when peaking each day:')
print('True =', 100 *H0_positive_sims / number_sims, '%, Expected =', 100 * alpha, '%')

# Under a true H1, how many simulations have 1 or more day with pvalue < alpha
print('')
print('Power when peaking each day:')
print('True = ', 100 * H1_positive_sims / number_sims, '%, Expected =', 100* expected_power, '%')
True type 1 error when peaking each day:
True = 23.4 %, Expected = 5.0 %

Power when peaking each day:
True =  88.6 %, Expected = 79.64213092264733 %

Visual interpretation : H0 scenario

Why does peeking inflate type I error ? Given two identical distributions, i.e. a true null hypothesis, we plot

  • Top chart: the number of positive experiments and proportion of total, each day of peeking
  • Bottom two charts:
    • In grey, experiments that are negative
    • In red, experiments that have a p value < 0.05, i.e. false positives
In [16]:
#Calculations for the bar plot        
number_positive_tests, number_remaining_tests = [], []
H0_frame_copy = H0_frame.copy()

#Loop through columns
for c in H0_frame_copy.columns:
    
    #Calculate temp results
    positive_tests = H0_frame_copy[H0_frame_copy.iloc[:,c]<0.05].index
    number_positive_tests.append(positive_tests.shape[0])
    number_remaining_tests.append(H0_frame_copy.shape[0])
    
    #Drop the positive tests from frame
    H0_frame_copy = H0_frame_copy.drop(index = positive_tests, axis = 0)
    
#Save in frame and calculate proportions    
df_bar_res = pd.DataFrame({'Day':[c for c in H0_frame_copy.columns], 'positive_tests':number_positive_tests, \
                       'total_tests':2000,'remaining_tests':number_remaining_tests})
df_bar_res['positive/total'] = df_bar_res['positive_tests'] / H0_frame.shape[0]
 

#Figure setup
fig = plt.figure(figsize=(17,14))
layout = (2, 2)
ax1 = plt.subplot2grid(layout, (0, 0), colspan=2)
ax2 = plt.subplot2grid(layout, (1, 0))
ax3 = plt.subplot2grid(layout, (1, 1))
  
    
#Bar plot
df_bar_res.plot(kind = 'bar', x = 'Day', y = 'positive_tests', ax= ax1, ylabel = 'Positive tests', title = 'H0 scenario - Peek every day')
tmp = df_bar_res.plot(kind = 'line', x = 'Day', y = 'positive/total', color = 'red', linestyle = ':',secondary_y = True,ax = ax1)
tmp.set_xlabel('Day')
tmp.set_ylabel('% of total')

    
# Line plots
alpha_plot = [0.05 for i in range(14)]

for row in H0_frame.head(100).iterrows():
    #Non peeking plot
    if row[1][13]<0.05:
        ax2.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
    else:
        ax2.plot(row[1], linestyle = '-', color = 'lightgrey')
        
    #Peeking plot
    if (row[1]<0.05).sum()>0:
        ax3.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
    else:
        ax3.plot(row[1], linestyle = '-', color = 'lightgrey')
    
        
ax2.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax2.legend(loc = 'upper right')
ax2.set_title('Type I errors: Read on day 14')
ax2.set_ylabel('p value')
ax2.set_xlabel('Day')        
        
ax3.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax3.legend(loc = 'upper right')
ax3.set_title('Type I errors: Peek every day')
ax3.set_ylabel('p value')
ax3.set_xlabel('Day')

plt.show()

number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]

H0_positive_sims = ((H0_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)    

# Under a true H0, how many simulations have 1 or more day with pvalue < alpha
#print('H0 sims with 1 or more days positive =', H0_positive_sims) 

# i.e. True type 1 error in this scenario compared to expected
print('True type 1 error when peaking each day:')
print('True =', 100 *H0_positive_sims / number_sims, '%, Expected =', 100 * alpha, '%')
print('Sum of % of total =',round(df_bar_res['positive/total'].sum(),3))
True type 1 error when peaking each day:
True = 23.4 %, Expected = 5.0 %
Sum of % of total = 0.234

Visual interpretation : H1 scenario

Why does peeking inflate power ? Given two truly differeny distributions, i.e. a true alternative hypothesis, we plot

  • Top chart: the number of positive experiments and proportion of total, each day of peeking
  • Bottom two charts:
    • In grey, experiments that are negative
    • In red, experiments that have a p value < 0.05, i.e. true positives
In [17]:
#Calculations for the bar plot        
number_positive_tests, number_remaining_tests = [], []
H1_frame_copy = H1_frame.copy()

#Loop through columns
for c in H1_frame_copy.columns:
    
    #Calculate temp results
    positive_tests = H1_frame_copy[H1_frame_copy.iloc[:,c]<0.05].index
    number_positive_tests.append(positive_tests.shape[0])
    number_remaining_tests.append(H1_frame_copy.shape[0])
    
    #Drop the positive tests from frame
    H1_frame_copy = H1_frame_copy.drop(index = positive_tests, axis = 0)
    
#Save in frame and calculate proportions    
df_bar_res = pd.DataFrame({'Day':[c for c in H1_frame_copy.columns], 'positive_tests':number_positive_tests, \
                       'total_tests':2000,'remaining_tests':number_remaining_tests})
df_bar_res['positive/total'] = df_bar_res['positive_tests'] / H1_frame.shape[0]
 

#Figure setup
fig = plt.figure(figsize=(14,14))
layout = (2, 2)
ax1 = plt.subplot2grid(layout, (0, 0), colspan=2)
ax2 = plt.subplot2grid(layout, (1, 0))
ax3 = plt.subplot2grid(layout, (1, 1))
  
    
#Bar plot
df_bar_res.plot(kind = 'bar', x = 'Day', y = 'positive_tests', ax= ax1, xlabel = 'Day of peek', ylabel = 'Positive tests', title = 'H1 scenario - Peek every day')
tmp = df_bar_res.plot(kind = 'line', x = 'Day', y = 'positive/total', color = 'red', linestyle = ':',secondary_y = True,ax = ax1)
tmp.set_ylabel('% of total')
    
# Line plots
alpha_plot = [0.05 for i in range(14)]

for row in H1_frame.head(100).iterrows():
    #Non peeking plot
    if row[1][13]<0.05:
        ax2.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
    else:
        ax2.plot(row[1], linestyle = '-', color = 'lightgrey')
        
    #Peeking plot
    if (row[1]<0.05).sum()>0:
        ax3.plot(row[1], linestyle = '-', color = 'red', alpha = .4)
    else:
        ax3.plot(row[1], linestyle = '-', color = 'lightgrey')
    
        
ax2.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax2.legend(loc = 'upper right')
ax2.set_title('Power: Read on day 14')
ax2.set_ylabel('p value')
ax2.set_xlabel('Day')        
        
ax3.plot(alpha_plot, linestyle = ':', linewidth = 3, color = 'darkred', label = '0.05')
ax3.legend(loc = 'upper right')
ax3.set_title('Power: Peek every day')
ax3.set_ylabel('p value')
ax3.set_xlabel('Day')

plt.show()


#Printing sums and overal stats
number_sims = H0_frame.shape[0]
number_days = H0_frame.shape[1]

H1_positive_sims = ((H1_frame<alpha).sum(axis=1)>0).sum()
expected_power = ttest_two_side_power(mu0, mu1, sigma0 , sigma1, x0.shape[0], x0.shape[0], alpha = alpha, beta = beta)    

# Under a true H1, how many simulations have 1 or more day with pvalue < alpha
print('')
print('Power when peaking each day:')
print('True = ', 100 * H1_positive_sims / number_sims, '%, Expected =', 100* expected_power, '%')
print('Sum of % of total =',round(df_bar_res['positive/total'].sum(),3))