Testing non normal distributions

This notebook explores various simulations where we are testing for the difference in means of two independent gamma distributions, by sampling them and computing the means of each sample.

We will compare two main test methods: the t-test and the Mann Whitney test.

Summary of results

These simulations show that

  • When the variance of the two populations is the same, the Mann Whitney test has greater true power but also greater true type 1 error than the t-test.
  • For large sample N = 1000, the minimum true type 1 error for the Mann whitney test is 9%, whereas the t-test has true Type 1 of 5% as required by the experiment setup (reject $H_0$ for p values below 5%)
  • When the variance of two populations is different, then the Mann Whitney test leads to large type 1 error, even when the means are the same. This is expected since the Mann Whitney tests for difference in distributions, not in means.
  • The t test is robust to differences in variance but identical means

Experiment 1) Different means, same variance

Consider two gamma distributions parametrized using k (shape) and scale $\theta$, with parameters

  • X1: gamma with $k = 0.5$ and $\theta = 1$ hence mean $E[X1] = k\theta = 0.5$ and variance $Var[X1] = k\theta^2 = 0.5$
  • X2: gamma with $k = 1.445$ and $\theta = 0.588235$ $E[X2] = .85$ and variance $Var[X2] = .5$

We are interested in testing for a difference in means of samples from X1 and X2. Here the setup is chosen such that X1 and X2 have the same variance, hence the true cohen d distance is 0.5

$$ d = (.85 - .5) / \sqrt{.5} = 0.5$$

We will compare two testing methods: the two sample t-test and the Mann Whitney non parametric test, and simulate the true Type I and Power of these tests for different sample size (assuming we reject null hypothesis for $p$ value < 0.05)

  • $H_0: \mu_{X_1} = \mu_{X_2} = 0.5$
  • $H_1: \mu_{X_1} \neq \mu_{X_2}$

The true type 1 error is calculated as: $P(\text{reject} | H_0)$

The true power is calculated as: $P(\text{reject} | H_1)$

Where we simulate thousands of experiment using the true distribution of $H_0$ and $H_1$

Sources:

Population distributions

In [24]:
import os
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set() #Making seaborn the default styling 
In [25]:
#The population distributions
X_1 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = 100000, random_state = 0)
X_2 = stats.gamma.rvs(a = 1.445, loc = 0, scale = 0.588235, size = 100000, random_state = 0)

plt.figure(figsize = (7,7))
sns.distplot(X_1)
sns.distplot(X_2)
plt.xlim(0,5)
plt.title('Distribution of X1 and X2')
plt.show()
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [22]:
N_range = [10, 50, 200, 1000]

fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (15,15))
axes = axes.flatten()
ax = 0

for N in N_range:

    means1, means2 = [], []
    medians1, medians2 = [], []
    t_test_p_H0, mannwhitney_p_H0 = [], []
    t_test_p_H1, mannwhitney_p_H1 = [], []

    for i in range(5000):
        sample0 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = N)
        sample1 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = N)
        sample2 = stats.gamma.rvs(a = 1.445, loc = 0, scale = 0.588235, size = N)
        means1.append(sample1.mean().copy())
        means2.append(sample2.mean().copy())
        
        #Computations for true alpha and power
        t_test_p_H0.append(stats.ttest_ind(sample0,sample1)[1].copy())
        mannwhitney_p_H0.append(stats.mannwhitneyu(sample0,sample1)[1].copy())
        t_test_p_H1.append(stats.ttest_ind(sample1,sample2)[1].copy())
        mannwhitney_p_H1.append(stats.mannwhitneyu(sample1,sample2)[1].copy())
        
    means1 = np.asarray(means1)
    means2 = np.asarray(means2)
    skew, kurtosis = stats.skew(means1), stats.kurtosis(means1) 
    
    #Computing true alpha and power
    t_test_alpha = (np.asarray(t_test_p_H0) < 0.05).sum() / len(t_test_p_H0)
    mannwhitney_alpha = (np.asarray(mannwhitney_p_H0) < 0.05).sum() / len(mannwhitney_p_H0)
    t_test_power = (np.asarray(t_test_p_H1) < 0.05).sum() / len(t_test_p_H1)
    mannwhitney_power = (np.asarray(mannwhitney_p_H1) < 0.05).sum() / len(mannwhitney_p_H1)

    sns.distplot(means1, ax = axes[ax], label = 'mean X1')
    sns.distplot(means2, ax = axes[ax], label = 'mean X2')
    axes[ax].set_title('Distribution sample mean: N={}, Skew={}, Kurtosis={}'.format(N,round(skew,2), round(kurtosis,2) ))
    axes[ax].legend(['Alpha: t={} Mann={} | Power: t={}, Mann={}'.format(t_test_alpha,mannwhitney_alpha,  round(t_test_power,3), mannwhitney_power)], loc = 1)
    ax += 1

#plt.legend()
plt.suptitle('Results from 5000 repeats')
plt.show()
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Discussion

We have performed 5000 simulations using the true populations to compare the true type I error ($\alpha$) and true power ($1 - \beta$) for the t-test and Mann tests. Sample sizes were increased from 10 to 1000. A few points are worth noting

  • As expected, the sample mean is not normally distributed for small sample size (N = 10) as shown by the distribution skew and kurtosis. For larger sample size, the distribution is approximately normal
  • For all sample sizes, the Mann Whitney test has more power than the t-test, and this by a factor of 2 to 3 times more power
  • For all samples sizes, the Mann Whitney test has greater type I error, and this by a factor or 2 - 3
  • t-test has low power for small sample size

Conclusion: when the variance of the two populations are indeed the same, the Mann Whitney test greatly outperforms the t-test in terms of power, but has a higher Type 1 error rate

Experiment 2: Different variances, same mean

In this simulation we test for the difference of means, when the two gamma populations have the same mean but different variance.

  • X1: gamma with $k = 0.5$ and $\theta = 1$ hence mean $E[X1] = k\theta = .5$ and variance $Var[X1] = k\theta^2 = .5$
  • X2: gamma with $k = 0.25$ and $\theta = 2$ $E[X2] = .5$ and variance $Var[X2] = 1$

  • $H_0: \mu_{X_1} = \mu_{X_2}$

  • $H_1: \mu_{X_1} \neq \mu_{X_2}$

Here we won't be able to computer the power because the simulation does not contain the true $H_1$ scenario. However we can compute the type 1 error when $Var[X_1] = Var[X_2]$ and when $Var[X_1] \neq Var[X_2]$

Results from the simulation show that the t-test is very robust to different variance, and the type I error is close to 5% for all sample sizes. As expected, the Mann Whitney test performs poorly in this case since it is not testing for a difference in means but for a difference in distributions

In [20]:
#The population distributions
X_1 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = 100000, random_state = 0)
X_2 = stats.gamma.rvs(a = .25, loc = 0, scale = 2, size = 100000, random_state = 0)

plt.figure(figsize = (7,7))
sns.distplot(X_1)
sns.distplot(X_2)
plt.xlim(0,5)
plt.show()
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [21]:
N_range = [10, 50, 200, 1000]

fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (15,15))
axes = axes.flatten()
ax = 0

for N in N_range:

    means1, means2 = [], []
    medians1, medians2 = [], []
    t_test_p_H0, mannwhitney_p_H0 = [], []
    t_test_p_H1, mannwhitney_p_H1 = [], []

    for i in range(5000):
        sample0 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = N)
        sample1 = stats.gamma.rvs(a = .5, loc = 0, scale = 1, size = N)
        sample2 = stats.gamma.rvs(a = .25, loc = 0, scale = 2, size = N)
        means1.append(sample1.mean().copy())
        means2.append(sample2.mean().copy())
        
        #Computations for true alpha and power
        t_test_p_H0.append(stats.ttest_ind(sample0,sample1)[1].copy())
        mannwhitney_p_H0.append(stats.mannwhitneyu(sample0,sample1)[1].copy())
        t_test_p_H1.append(stats.ttest_ind(sample1,sample2)[1].copy())
        mannwhitney_p_H1.append(stats.mannwhitneyu(sample1,sample2)[1].copy())
        
    means1 = np.asarray(means1)
    means2 = np.asarray(means2)
    skew, kurtosis = stats.skew(means1), stats.kurtosis(means1) 
    
    #Computing true alpha and power
    t_test_alpha0 = (np.asarray(t_test_p_H0) < 0.05).sum() / len(t_test_p_H0)
    mannwhitney_alpha0 = (np.asarray(mannwhitney_p_H0) < 0.05).sum() / len(mannwhitney_p_H0)
    t_test_alpha1 = (np.asarray(t_test_p_H1) < 0.05).sum() / len(t_test_p_H1)
    mannwhitney_alpha1 = (np.asarray(mannwhitney_p_H1) < 0.05).sum() / len(mannwhitney_p_H1)

    sns.distplot(means1, ax = axes[ax], label = 'mean X1')
    sns.distplot(means2, ax = axes[ax], label = 'mean X2')
    axes[ax].set_title('Distribution sample mean: N={}, Skew={}, Kurtosis={}'.format(N,round(skew,2), round(kurtosis,2) ))
    axes[ax].legend(['Alpha same var: t={} Mann={} | Alpha diff var: t={}, Mann={}'.format(t_test_alpha0,mannwhitney_alpha0,  round(t_test_alpha1,3), mannwhitney_alpha1)], loc = 1)
    ax += 1

#plt.legend()
plt.suptitle('Results from 5000 repeats')
plt.show()
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval