Subset selection in python¶
This notebook explores common methods for performing subset selection on a regression model, namely
- Best subset selection
- Forward stepwise selection
- Criteria for choosing the optimal model
- $C_p$, AIC, BIC, $R^2_{adj}$
The figures, formula and explanation are taken from the book "Introduction to Statistical Learning (ISLR)" Chapter 6 and have been adapted in python
Sources:¶
- Introduction to Statistical Learning (ISLR) - Chapter 6
- https://github.com/JWarmenhoven/ISLR-python
- http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html
Libraries¶
import itertools
import time
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
%matplotlib inline
plt.style.use('ggplot')
credit = pd.read_csv('/Users/User/Desktop/Data/Datasets/ISLR_data/Credit.csv', usecols=list(range(1,12)))
credit.head(3)
Encoding qualitative data¶
credit = pd.get_dummies(credit, columns = ['Gender', 'Student','Married','Ethnicity'],drop_first = True)
credit.head(3)
Best subset selection¶
To perform best selection, we fit separate models for each possible combination of the $n$ predictors and then select the best subset. That is we fit:
- All models that contains exactly one predictor
- All models that contain 2 predictors at the second step: $\binom{n}{2}$
- Until reaching the end point where all $n$ predictors are included in the model
This results in $2^n$ possibilities as this is a power set problem. In our case there are $2^{11} = 2048$ possible combinations
Algorithm¶
Let $\mathcal{M}_0$ denote the null model which contains no predictors, this model simply predicts the sample mean of each observation
For $k = 1,2,...,n$
- Fit all $\binom{n}{k}$ models that contain exactly $k$ predictors
- Pick the best among these $\binom{n}{k}$ models, and call it $\mathcal{M}_k$. Here the best is defined as having the smallest RSS, or an equivalent measure
Select the single best model among $\mathcal{M_0, M_1,...,M_n}$ using cross validated predicton error, $C_p$, BIC, adjusted $R^2$ or any other method.
Helper function for fitting linear regression (Sklearn)¶
def fit_linear_reg(X,Y):
#Fit linear regression model and return RSS and R squared values
model_k = linear_model.LinearRegression(fit_intercept = True)
model_k.fit(X,Y)
RSS = mean_squared_error(Y,model_k.predict(X)) * len(Y)
R_squared = model_k.score(X,Y)
return RSS, R_squared
Implementing Best subset selection (using itertools.combinations)¶
#Importing tqdm for the progress bar
from tqdm import tnrange, tqdm_notebook
#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11
RSS_list, R_squared_list, feature_list = [],[], []
numb_features = []
#Looping over k = 1 to k = 11 features in X
for k in tnrange(1,len(X.columns) + 1, desc = 'Loop...'):
#Looping over all possible combinations: from 11 choose k
for combo in itertools.combinations(X.columns,k):
tmp_result = fit_linear_reg(X[list(combo)],Y) #Store temp result
RSS_list.append(tmp_result[0]) #Append lists
R_squared_list.append(tmp_result[1])
feature_list.append(combo)
numb_features.append(len(combo))
#Store in DataFrame
df = pd.DataFrame({'numb_features': numb_features,'RSS': RSS_list, 'R_squared':R_squared_list,'features':feature_list})
Finding the best subsets for each number of features¶
Using the smallest RSS value, or the largest R_squared value
df_min = df[df.groupby('numb_features')['RSS'].transform(min) == df['RSS']]
df_max = df[df.groupby('numb_features')['R_squared'].transform(max) == df['R_squared']]
display(df_min.head(3))
display(df_max.head(3))
Adding columns to the dataframe with RSS and R squared values of the best subset¶
df['min_RSS'] = df.groupby('numb_features')['RSS'].transform(min)
df['max_R_squared'] = df.groupby('numb_features')['R_squared'].transform(max)
df.head()
Plotting the best subset selection process¶
fig = plt.figure(figsize = (16,6))
ax = fig.add_subplot(1, 2, 1)
ax.scatter(df.numb_features,df.RSS, alpha = .2, color = 'darkblue' )
ax.set_xlabel('# Features')
ax.set_ylabel('RSS')
ax.set_title('RSS - Best subset selection')
ax.plot(df.numb_features,df.min_RSS,color = 'r', label = 'Best subset')
ax.legend()
ax = fig.add_subplot(1, 2, 2)
ax.scatter(df.numb_features,df.R_squared, alpha = .2, color = 'darkblue' )
ax.plot(df.numb_features,df.max_R_squared,color = 'r', label = 'Best subset')
ax.set_xlabel('# Features')
ax.set_ylabel('R squared')
ax.set_title('R_squared - Best subset selection')
ax.legend()
plt.show()
Forward stepwise selection¶
For computational reasons, the best subset cannot be applied for any large $n$ due to the $2^n$ complexity. Forward Stepwise begins with a model containing no predictors, and then adds predictors to the model, one at the time. At each step, the variable that gives the greatest additional improvement to the fit is added to the model.
Algorithm¶
Let $\mathcal{M}_0$ denote the null model which contains no predictors
- For $k = 1,2,...,n-1$
- Consider all $n - k$ models that augment the predictors in $\mathcal{M_k}$ with one additional predictor
- Choose the \textit{best} among these $n - k$ models, and call it $\mathcal{M_{k+1}}$
- Select the single best model among $\mathcal{M_0, M_1,...,M_n}$ using cross validated predicton error, $C_p$, BIC, adjusted $R^2$ or any other method.
#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11
remaining_features = list(X.columns.values)
features = []
RSS_list, R_squared_list = [np.inf], [np.inf] #Due to 1 indexing of the loop...
features_list = dict()
for i in range(1,k+1):
best_RSS = np.inf
for combo in itertools.combinations(remaining_features,1):
RSS = fit_linear_reg(X[list(combo) + features],Y) #Store temp result
if RSS[0] < best_RSS:
best_RSS = RSS[0]
best_R_squared = RSS[1]
best_feature = combo[0]
#Updating variables for next loop
features.append(best_feature)
remaining_features.remove(best_feature)
#Saving values for plotting
RSS_list.append(best_RSS)
R_squared_list.append(best_R_squared)
features_list[i] = features.copy()
Displaying results of the first 4 steps¶
print('Forward stepwise subset selection')
print('Number of features |', 'Features |', 'RSS')
display([(i,features_list[i], round(RSS_list[i])) for i in range(1,5)])
Comparing models: AIC, BIC, Mallows'CP¶
The training set Mean Squared Error (MSE) is generally an underestimate of the test MSE. This is because when we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS is minimized. In particular, the training RSS decreases as we add more features to the model, but the test error may not. Therefore the training RSS and $R^2$ may not be used for selecting the best model unless we adjust for this underestimation.
Mallow's $C_p$¶
Mallow's $C_p$ is named after Colin Lingwood Mallows and is defined as:
$$ C_p = \frac{1}{m} (RSS + 2d\hat\sigma^2)$$
where $\hat\sigma^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement. Typically $\hat\sigma^2$ is estimated using the full model containing all predictors.
Essentially,the $C_p$ statistic adds a penalty of $2d\hat\sigma^2$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases, and this is intended to adjust for the corresponding decrease in training RSS. It can be shown that if $\hat\sigma^2$ is an unbiased estimate of $\sigma^2$ then $C_p$ is an unbiased estimate of the test MSE, so we choose the model with the smallest $C_p$.
Akaike's Information Criteria (AIC)¶
The AIC criterion is defiend for a large class of models fit by maximum likelihood. In the case of a linear model with Gaussian errors, MLE and least squares are the same thing and the AIC is given by
$$ AIC = \frac{1}{m\hat \sigma^2} (RSS + 2d\hat\sigma^2)$$
Bayesian Information Criteria (BIC)¶
BIC is derived from a Bayesian point of view, and looks similar to the $C_p$ and $AIC$ - it is defined (up to irrelevant constants) as:
$$ BIC = \frac{1}{m\hat \sigma^2} (RSS + \log(m) d \hat\sigma^2)$$
Like $C_p$ and AIC, the BIC will tend to take small values for a model with low test error.
Adjusted $R^2$¶
Since the $R^2$ always increases as more variables are added, the adjusted $R^2$ accounts for that fact and introduces a penalty. The intuition is that once all the correct variables have been included in the model,additional noise variables will lead to a very small decrase in RSS, but an increase in $k$ and hence will decrease the adjusted $R^2$. In effect, we pay a price for the inclusion of unnecessary variables in the model.
$$ R_{a}^2 = 1 - \frac{RSS / (m - k -1)}{TSS / (m - 1)} = 1 - (1 - R^2) \frac{m - 1}{m - k - 1} $$
Theoretical justification¶
$C_p$, AIC, BIC all have rigorous theoretical justification that rely on asymptotic arguments, i.e. when the sample size $m$ grows very large, whereas the adjusted $R^2$, although quite intuitive, is not as well motivated in statistical theory.
Estimation of $\hat\sigma^2$¶
Using the RSS of the full model with $p$ features, (i.e.the smallest RSS) we estimate $\hat\sigma^2$ as:
$$ \hat\sigma^2 = \frac{RSS}{m - p -1} $$
Combining forward stepwise results into a new DataFrame¶
df1 = pd.concat([pd.DataFrame({'features':features_list}),pd.DataFrame({'RSS':RSS_list, 'R_squared': R_squared_list})], axis=1, join='inner')
df1['numb_features'] = df1.index
Computing the C_p, AIC, BIC and R-square adjusted¶
#Initializing useful variables
m = len(Y)
p = 11
hat_sigma_squared = (1/(m - p -1)) * min(df1['RSS'])
#Computing
df1['C_p'] = (1/m) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['AIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['BIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] + np.log(m) * df1['numb_features'] * hat_sigma_squared )
df1['R_squared_adj'] = 1 - ( (1 - df1['R_squared'])*(m-1)/(m-df1['numb_features'] -1))
df1
df1['R_squared_adj'].idxmax()
df1['R_squared_adj'].max()
Plotting the computed values as a function of number of features¶
variables = ['C_p', 'AIC','BIC','R_squared_adj']
fig = plt.figure(figsize = (18,6))
for i,v in enumerate(variables):
ax = fig.add_subplot(1, 4, i+1)
ax.plot(df1['numb_features'],df1[v], color = 'lightblue')
ax.scatter(df1['numb_features'],df1[v], color = 'darkblue')
if v == 'R_squared_adj':
ax.plot(df1[v].idxmax(),df1[v].max(), marker = 'x', markersize = 20)
else:
ax.plot(df1[v].idxmin(),df1[v].min(), marker = 'x', markersize = 20)
ax.set_xlabel('Number of predictors')
ax.set_ylabel(v)
fig.suptitle('Subset selection using C_p, AIC, BIC, Adjusted R2', fontsize = 16)
plt.show()
sns.pairplot(credit[['Balance','Age','Cards','Education','Income','Limit','Rating']]);
Statsmodel linear regression¶
Least squares coefficient estimates associated with the regression of balance onto ethnicity in the Credit data set. The linear model is given in (3.30). That is, ethnicity is encoded via two dummy variables
#Setting up the X and Y variables, adding constant term for intercept
Y = credit.Balance
X = credit[['Ethnicity_Asian','Ethnicity_Caucasian']]
X = sm.add_constant(X)
X.head()
Table 3.8 - page 86¶
model_1 = sm.OLS(Y, X)
result_1 = model_1.fit()
result_1.summary()