Subset selection in python¶

This notebook explores common methods for performing subset selection on a regression model, namely

Best subset selection
Forward stepwise selection
Criteria for choosing the optimal model
- $C_p$, AIC, BIC, $R^2_{adj}$

The figures, formula and explanation are taken from the book "Introduction to Statistical Learning (ISLR)" Chapter 6 and have been adapted in python

Sources:¶

Introduction to Statistical Learning (ISLR) - Chapter 6
https://github.com/JWarmenhoven/ISLR-python
http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html

Libraries¶

In [1]:

import itertools
import time
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

%matplotlib inline
plt.style.use('ggplot')

/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

The dataset¶

The credit dataset is a use case for linear regression where some predictors are qualitative.

Note - all datasets from the book are available here

In [2]:

credit = pd.read_csv('/Users/User/Desktop/Data/Datasets/ISLR_data/Credit.csv', usecols=list(range(1,12)))
credit.head(3)

Out[2]:

	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance
0	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333
1	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903
2	104.593	7075	514	4	71	11	Male	No	No	Asian	580

Encoding qualitative data¶

In [3]:

credit = pd.get_dummies(credit, columns = ['Gender', 'Student','Married','Ethnicity'],drop_first = True)
credit.head(3)

Out[3]:

	Income	Limit	Rating	Cards	Age	Education	Balance	Gender_Female	Student_Yes	Married_Yes	Ethnicity_Asian	Ethnicity_Caucasian
0	14.891	3606	283	2	34	11	333	0	0	1	0	1
1	106.025	6645	483	3	82	15	903	1	1	1	1	0
2	104.593	7075	514	4	71	11	580	0	0	0	1	0

Best subset selection¶

To perform best selection, we fit separate models for each possible combination of the $n$ predictors and then select the best subset. That is we fit:

All models that contains exactly one predictor
All models that contain 2 predictors at the second step: $\binom{n}{2}$
Until reaching the end point where all $n$ predictors are included in the model

This results in $2^n$ possibilities as this is a power set problem. In our case there are $2^{11} = 2048$ possible combinations

Algorithm¶

Let $\mathcal{M}_0$ denote the null model which contains no predictors, this model simply predicts the sample mean of each observation
- For $k = 1,2,...,n$
  - Fit all $\binom{n}{k}$ models that contain exactly $k$ predictors
  - Pick the best among these $\binom{n}{k}$ models, and call it $\mathcal{M}_k$. Here the best is defined as having the smallest RSS, or an equivalent measure
- Select the single best model among $\mathcal{M_0, M_1,...,M_n}$ using cross validated predicton error, $C_p$, BIC, adjusted $R^2$ or any other method.

Helper function for fitting linear regression (Sklearn)¶

In [4]:

def fit_linear_reg(X,Y):
    #Fit linear regression model and return RSS and R squared values
    model_k = linear_model.LinearRegression(fit_intercept = True)
    model_k.fit(X,Y)
    RSS = mean_squared_error(Y,model_k.predict(X)) * len(Y)
    R_squared = model_k.score(X,Y)
    return RSS, R_squared

Implementing Best subset selection (using itertools.combinations)¶

In [5]:

#Importing tqdm for the progress bar
from tqdm import tnrange, tqdm_notebook

#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11
RSS_list, R_squared_list, feature_list = [],[], []
numb_features = []

#Looping over k = 1 to k = 11 features in X
for k in tnrange(1,len(X.columns) + 1, desc = 'Loop...'):

    #Looping over all possible combinations: from 11 choose k
    for combo in itertools.combinations(X.columns,k):
        tmp_result = fit_linear_reg(X[list(combo)],Y)   #Store temp result 
        RSS_list.append(tmp_result[0])                  #Append lists
        R_squared_list.append(tmp_result[1])
        feature_list.append(combo)
        numb_features.append(len(combo))   

#Store in DataFrame
df = pd.DataFrame({'numb_features': numb_features,'RSS': RSS_list, 'R_squared':R_squared_list,'features':feature_list})

Finding the best subsets for each number of features¶

Using the smallest RSS value, or the largest R_squared value

In [6]:

df_min = df[df.groupby('numb_features')['RSS'].transform(min) == df['RSS']]
df_max = df[df.groupby('numb_features')['R_squared'].transform(max) == df['R_squared']]
display(df_min.head(3))
display(df_max.head(3))

	RSS	R_squared	features	numb_features
2	2.143512e+07	0.745848	(Rating,)	1
12	1.053254e+07	0.875118	(Income, Rating)	2
79	4.227219e+06	0.949879	(Income, Rating, Student_Yes)	3

	RSS	R_squared	features	numb_features
2	2.143512e+07	0.745848	(Rating,)	1
12	1.053254e+07	0.875118	(Income, Rating)	2
79	4.227219e+06	0.949879	(Income, Rating, Student_Yes)	3

Adding columns to the dataframe with RSS and R squared values of the best subset¶

In [7]:

df['min_RSS'] = df.groupby('numb_features')['RSS'].transform(min)
df['max_R_squared'] = df.groupby('numb_features')['R_squared'].transform(max)
df.head()

Out[7]:

	RSS	R_squared	features	numb_features	min_RSS	max_R_squared
0	6.620874e+07	0.214977	(Income,)	1	2.143512e+07	0.745848
1	2.171566e+07	0.742522	(Limit,)	1	2.143512e+07	0.745848
2	2.143512e+07	0.745848	(Rating,)	1	2.143512e+07	0.745848
3	8.370950e+07	0.007475	(Cards,)	1	2.143512e+07	0.745848
4	8.433963e+07	0.000003	(Age,)	1	2.143512e+07	0.745848

Plotting the best subset selection process¶

In [8]:

fig = plt.figure(figsize = (16,6))
ax = fig.add_subplot(1, 2, 1)

ax.scatter(df.numb_features,df.RSS, alpha = .2, color = 'darkblue' )
ax.set_xlabel('# Features')
ax.set_ylabel('RSS')
ax.set_title('RSS - Best subset selection')
ax.plot(df.numb_features,df.min_RSS,color = 'r', label = 'Best subset')
ax.legend()

ax = fig.add_subplot(1, 2, 2)
ax.scatter(df.numb_features,df.R_squared, alpha = .2, color = 'darkblue' )
ax.plot(df.numb_features,df.max_R_squared,color = 'r', label = 'Best subset')
ax.set_xlabel('# Features')
ax.set_ylabel('R squared')
ax.set_title('R_squared - Best subset selection')
ax.legend()

plt.show()

Forward stepwise selection¶

For computational reasons, the best subset cannot be applied for any large $n$ due to the $2^n$ complexity. Forward Stepwise begins with a model containing no predictors, and then adds predictors to the model, one at the time. At each step, the variable that gives the greatest additional improvement to the fit is added to the model.

Algorithm¶

Let $\mathcal{M}_0$ denote the null model which contains no predictors

For $k = 1,2,...,n-1$
- Consider all $n - k$ models that augment the predictors in $\mathcal{M_k}$ with one additional predictor
- Choose the \textit{best} among these $n - k$ models, and call it $\mathcal{M_{k+1}}$
Select the single best model among $\mathcal{M_0, M_1,...,M_n}$ using cross validated predicton error, $C_p$, BIC, adjusted $R^2$ or any other method.

In [9]:

#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11

remaining_features = list(X.columns.values)
features = []
RSS_list, R_squared_list = [np.inf], [np.inf] #Due to 1 indexing of the loop...
features_list = dict()

for i in range(1,k+1):
    best_RSS = np.inf
    
    for combo in itertools.combinations(remaining_features,1):

            RSS = fit_linear_reg(X[list(combo) + features],Y)   #Store temp result 

            if RSS[0] < best_RSS:
                best_RSS = RSS[0]
                best_R_squared = RSS[1] 
                best_feature = combo[0]

    #Updating variables for next loop
    features.append(best_feature)
    remaining_features.remove(best_feature)
    
    #Saving values for plotting
    RSS_list.append(best_RSS)
    R_squared_list.append(best_R_squared)
    features_list[i] = features.copy()

Displaying results of the first 4 steps¶

In [10]:

print('Forward stepwise subset selection')
print('Number of features |', 'Features |', 'RSS')
display([(i,features_list[i], round(RSS_list[i])) for i in range(1,5)])

Forward stepwise subset selection
Number of features | Features | RSS

[(1, ['Rating'], 21435122.0),
 (2, ['Rating', 'Income'], 10532541.0),
 (3, ['Rating', 'Income', 'Student_Yes'], 4227219.0),
 (4, ['Rating', 'Income', 'Student_Yes', 'Limit'], 4032502.0)]

Comparing models: AIC, BIC, Mallows'CP¶

The training set Mean Squared Error (MSE) is generally an underestimate of the test MSE. This is because when we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS is minimized. In particular, the training RSS decreases as we add more features to the model, but the test error may not. Therefore the training RSS and $R^2$ may not be used for selecting the best model unless we adjust for this underestimation.

Mallow's $C_p$¶

Mallow's $C_p$ is named after Colin Lingwood Mallows and is defined as:

$$ C_p = \frac{1}{m} (RSS + 2d\hat\sigma^2)$$

where $\hat\sigma^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement. Typically $\hat\sigma^2$ is estimated using the full model containing all predictors.

Essentially,the $C_p$ statistic adds a penalty of $2d\hat\sigma^2$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases, and this is intended to adjust for the corresponding decrease in training RSS. It can be shown that if $\hat\sigma^2$ is an unbiased estimate of $\sigma^2$ then $C_p$ is an unbiased estimate of the test MSE, so we choose the model with the smallest $C_p$.

Akaike's Information Criteria (AIC)¶

The AIC criterion is defiend for a large class of models fit by maximum likelihood. In the case of a linear model with Gaussian errors, MLE and least squares are the same thing and the AIC is given by

$$ AIC = \frac{1}{m\hat \sigma^2} (RSS + 2d\hat\sigma^2)$$

Bayesian Information Criteria (BIC)¶

BIC is derived from a Bayesian point of view, and looks similar to the $C_p$ and $AIC$ - it is defined (up to irrelevant constants) as:

$$ BIC = \frac{1}{m\hat \sigma^2} (RSS + \log(m) d \hat\sigma^2)$$

Like $C_p$ and AIC, the BIC will tend to take small values for a model with low test error.

Adjusted $R^2$¶

Since the $R^2$ always increases as more variables are added, the adjusted $R^2$ accounts for that fact and introduces a penalty. The intuition is that once all the correct variables have been included in the model,additional noise variables will lead to a very small decrase in RSS, but an increase in $k$ and hence will decrease the adjusted $R^2$. In effect, we pay a price for the inclusion of unnecessary variables in the model.

$$ R_{a}^2 = 1 - \frac{RSS / (m - k -1)}{TSS / (m - 1)} = 1 - (1 - R^2) \frac{m - 1}{m - k - 1} $$

Theoretical justification¶

$C_p$, AIC, BIC all have rigorous theoretical justification that rely on asymptotic arguments, i.e. when the sample size $m$ grows very large, whereas the adjusted $R^2$, although quite intuitive, is not as well motivated in statistical theory.

Estimation of $\hat\sigma^2$¶

Using the RSS of the full model with $p$ features, (i.e.the smallest RSS) we estimate $\hat\sigma^2$ as:

$$ \hat\sigma^2 = \frac{RSS}{m - p -1} $$

Combining forward stepwise results into a new DataFrame¶

In [11]:

df1 = pd.concat([pd.DataFrame({'features':features_list}),pd.DataFrame({'RSS':RSS_list, 'R_squared': R_squared_list})], axis=1, join='inner')
df1['numb_features'] = df1.index

Computing the C_p, AIC, BIC and R-square adjusted¶

In [12]:

#Initializing useful variables
m = len(Y)
p = 11
hat_sigma_squared = (1/(m - p -1)) * min(df1['RSS'])

#Computing
df1['C_p'] = (1/m) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['AIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['BIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] +  np.log(m) * df1['numb_features'] * hat_sigma_squared )
df1['R_squared_adj'] = 1 - ( (1 - df1['R_squared'])*(m-1)/(m-df1['numb_features'] -1))
df1

Out[12]:

	features	RSS	R_squared	numb_features	C_p	AIC	BIC	R_squared_adj
1	[Rating]	2.143512e+07	0.745848	1	53636.603151	5.495771	5.505750	0.745210
2	[Rating, Income]	1.053254e+07	0.875118	2	26428.949364	2.707991	2.727949	0.874489
3	[Rating, Income, Student_Yes]	4.227219e+06	0.949879	3	10714.442485	1.097835	1.127771	0.949499
4	[Rating, Income, Student_Yes, Limit]	4.032502e+06	0.952188	4	10276.446437	1.052956	1.092871	0.951703
5	[Rating, Income, Student_Yes, Limit, Cards]	3.866091e+06	0.954161	5	9909.218362	1.015329	1.065222	0.953579
6	[Rating, Income, Student_Yes, Limit, Cards, Age]	3.821620e+06	0.954688	6	9846.837591	1.008937	1.068809	0.953996
7	[Rating, Income, Student_Yes, Limit, Cards, Ag...	3.810759e+06	0.954817	7	9868.483418	1.011155	1.081006	0.954010
8	[Rating, Income, Student_Yes, Limit, Cards, Ag...	3.804746e+06	0.954888	8	9902.248962	1.014615	1.094444	0.953965
9	[Rating, Income, Student_Yes, Limit, Cards, Ag...	3.798367e+06	0.954964	9	9935.100415	1.017981	1.107789	0.953924
10	[Rating, Income, Student_Yes, Limit, Cards, Ag...	3.791345e+06	0.955047	10	9966.344067	1.021182	1.120969	0.953891
11	[Rating, Income, Student_Yes, Limit, Cards, Ag...	3.786730e+06	0.955102	11	10003.604241	1.025000	1.134765	0.953829

In [13]:

df1['R_squared_adj'].idxmax()
df1['R_squared_adj'].max()

Out[13]:

0.9540098163629882

Plotting the computed values as a function of number of features¶

In [14]:

variables = ['C_p', 'AIC','BIC','R_squared_adj']
fig = plt.figure(figsize = (18,6))

for i,v in enumerate(variables):
    ax = fig.add_subplot(1, 4, i+1)
    ax.plot(df1['numb_features'],df1[v], color = 'lightblue')
    ax.scatter(df1['numb_features'],df1[v], color = 'darkblue')
    if v == 'R_squared_adj':
        ax.plot(df1[v].idxmax(),df1[v].max(), marker = 'x', markersize = 20)
    else:
        ax.plot(df1[v].idxmin(),df1[v].min(), marker = 'x', markersize = 20)
    ax.set_xlabel('Number of predictors')
    ax.set_ylabel(v)

fig.suptitle('Subset selection using C_p, AIC, BIC, Adjusted R2', fontsize = 16)
plt.show()

Miscellaneous¶

Pairplot¶

In [15]:

sns.pairplot(credit[['Balance','Age','Cards','Education','Income','Limit','Rating']]);

Statsmodel linear regression¶

Least squares coefficient estimates associated with the regression of balance onto ethnicity in the Credit data set. The linear model is given in (3.30). That is, ethnicity is encoded via two dummy variables

In [16]:

#Setting up the X and Y variables, adding constant term for intercept
Y = credit.Balance
X = credit[['Ethnicity_Asian','Ethnicity_Caucasian']]
X = sm.add_constant(X)
X.head()

Out[16]:

	const	Ethnicity_Asian	Ethnicity_Caucasian
0	1.0	0	1
1	1.0	1	0
2	1.0	1	0
3	1.0	1	0
4	1.0	0	1

Table 3.8 - page 86¶

In [17]:

model_1 = sm.OLS(Y, X)
result_1 = model_1.fit()
result_1.summary()

Out[17]:

OLS Regression Results
Dep. Variable:	Balance	R-squared:	0.000
Model:	OLS	Adj. R-squared:	-0.005
Method:	Least Squares	F-statistic:	0.04344
Date:	Mon, 11 Jun 2018	Prob (F-statistic):	0.957
Time:	11:35:51	Log-Likelihood:	-3019.3
No. Observations:	400	AIC:	6045.
Df Residuals:	397	BIC:	6057.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	531.0000	46.319	11.464	0.000	439.939	622.061
Ethnicity_Asian	-18.6863	65.021	-0.287	0.774	-146.515	109.142
Ethnicity_Caucasian	-12.5025	56.681	-0.221	0.826	-123.935	98.930

Omnibus:	28.829	Durbin-Watson:	1.946
Prob(Omnibus):	0.000	Jarque-Bera (JB):	27.395
Skew:	0.581	Prob(JB):	1.13e-06
Kurtosis:	2.460	Cond. No.	4.39

Data Blog
Data Science, Machine Learning and Statistics, implemented in Python

Choosing the optimal model: Subset selection

Xavier Bourret Sicotte

Table of Contents

Subset selection in python¶

Sources:¶

Libraries¶

The dataset¶

Encoding qualitative data¶

Best subset selection¶

Algorithm¶

Helper function for fitting linear regression (Sklearn)¶

Implementing Best subset selection (using itertools.combinations)¶

Finding the best subsets for each number of features¶

Adding columns to the dataframe with RSS and R squared values of the best subset¶

Plotting the best subset selection process¶

Forward stepwise selection¶

Algorithm¶

Displaying results of the first 4 steps¶

Comparing models: AIC, BIC, Mallows'CP¶

Mallow's $C_p$¶

Akaike's Information Criteria (AIC)¶

Bayesian Information Criteria (BIC)¶

Adjusted $R^2$¶

Theoretical justification¶

Estimation of $\hat\sigma^2$¶

Combining forward stepwise results into a new DataFrame¶

Computing the C_p, AIC, BIC and R-square adjusted¶

Plotting the computed values as a function of number of features¶

Miscellaneous¶

Pairplot¶

Statsmodel linear regression¶

Table 3.8 - page 86¶

Comments

Data Blog Data Science, Machine Learning and Statistics, implemented in Python

Choosing the optimal model: Subset selection

Xavier Bourret Sicotte

Table of Contents

Subset selection in python¶

Sources:¶

Libraries¶

The dataset¶

Encoding qualitative data¶

Best subset selection¶

Algorithm¶

Helper function for fitting linear regression (Sklearn)¶

Implementing Best subset selection (using itertools.combinations)¶

Finding the best subsets for each number of features¶

Adding columns to the dataframe with RSS and R squared values of the best subset¶

Plotting the best subset selection process¶

Forward stepwise selection¶

Algorithm¶

Displaying results of the first 4 steps¶

Comparing models: AIC, BIC, Mallows'CP¶

Mallow's $C_p$¶

Akaike's Information Criteria (AIC)¶

Bayesian Information Criteria (BIC)¶

Adjusted $R^2$¶

Theoretical justification¶

Estimation of $\hat\sigma^2$¶

Combining forward stepwise results into a new DataFrame¶

Computing the C_p, AIC, BIC and R-square adjusted¶

Plotting the computed values as a function of number of features¶

Miscellaneous¶

Pairplot¶

Statsmodel linear regression¶

Table 3.8 - page 86¶

Comments

Data Blog
Data Science, Machine Learning and Statistics, implemented in Python