Subset selection in python¶
This notebook explores common methods for performing subset selection on a regression model, namely
- Best subset selection
- Forward stepwise selection
- Criteria for choosing the optimal model
- Cp, AIC, BIC, R2adj
The figures, formula and explanation are taken from the book "Introduction to Statistical Learning (ISLR)" Chapter 6 and have been adapted in python
Sources:¶
- Introduction to Statistical Learning (ISLR) - Chapter 6
- https://github.com/JWarmenhoven/ISLR-python
- http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html
Libraries¶
import itertools
import time
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
%matplotlib inline
plt.style.use('ggplot')
credit = pd.read_csv('/Users/User/Desktop/Data/Datasets/ISLR_data/Credit.csv', usecols=list(range(1,12)))
credit.head(3)
Encoding qualitative data¶
credit = pd.get_dummies(credit, columns = ['Gender', 'Student','Married','Ethnicity'],drop_first = True)
credit.head(3)
Best subset selection¶
To perform best selection, we fit separate models for each possible combination of the n predictors and then select the best subset. That is we fit:
- All models that contains exactly one predictor
- All models that contain 2 predictors at the second step: (n2)
- Until reaching the end point where all n predictors are included in the model
This results in 2n possibilities as this is a power set problem. In our case there are 211=2048 possible combinations
Algorithm¶
Let M0 denote the null model which contains no predictors, this model simply predicts the sample mean of each observation
For k=1,2,...,n
- Fit all (nk) models that contain exactly k predictors
- Pick the best among these (nk) models, and call it Mk. Here the best is defined as having the smallest RSS, or an equivalent measure
Select the single best model among M0,M1,...,Mn using cross validated predicton error, Cp, BIC, adjusted R2 or any other method.
Helper function for fitting linear regression (Sklearn)¶
def fit_linear_reg(X,Y):
#Fit linear regression model and return RSS and R squared values
model_k = linear_model.LinearRegression(fit_intercept = True)
model_k.fit(X,Y)
RSS = mean_squared_error(Y,model_k.predict(X)) * len(Y)
R_squared = model_k.score(X,Y)
return RSS, R_squared
Implementing Best subset selection (using itertools.combinations)¶
#Importing tqdm for the progress bar
from tqdm import tnrange, tqdm_notebook
#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11
RSS_list, R_squared_list, feature_list = [],[], []
numb_features = []
#Looping over k = 1 to k = 11 features in X
for k in tnrange(1,len(X.columns) + 1, desc = 'Loop...'):
#Looping over all possible combinations: from 11 choose k
for combo in itertools.combinations(X.columns,k):
tmp_result = fit_linear_reg(X[list(combo)],Y) #Store temp result
RSS_list.append(tmp_result[0]) #Append lists
R_squared_list.append(tmp_result[1])
feature_list.append(combo)
numb_features.append(len(combo))
#Store in DataFrame
df = pd.DataFrame({'numb_features': numb_features,'RSS': RSS_list, 'R_squared':R_squared_list,'features':feature_list})
Finding the best subsets for each number of features¶
Using the smallest RSS value, or the largest R_squared value
df_min = df[df.groupby('numb_features')['RSS'].transform(min) == df['RSS']]
df_max = df[df.groupby('numb_features')['R_squared'].transform(max) == df['R_squared']]
display(df_min.head(3))
display(df_max.head(3))
Adding columns to the dataframe with RSS and R squared values of the best subset¶
df['min_RSS'] = df.groupby('numb_features')['RSS'].transform(min)
df['max_R_squared'] = df.groupby('numb_features')['R_squared'].transform(max)
df.head()
Plotting the best subset selection process¶
fig = plt.figure(figsize = (16,6))
ax = fig.add_subplot(1, 2, 1)
ax.scatter(df.numb_features,df.RSS, alpha = .2, color = 'darkblue' )
ax.set_xlabel('# Features')
ax.set_ylabel('RSS')
ax.set_title('RSS - Best subset selection')
ax.plot(df.numb_features,df.min_RSS,color = 'r', label = 'Best subset')
ax.legend()
ax = fig.add_subplot(1, 2, 2)
ax.scatter(df.numb_features,df.R_squared, alpha = .2, color = 'darkblue' )
ax.plot(df.numb_features,df.max_R_squared,color = 'r', label = 'Best subset')
ax.set_xlabel('# Features')
ax.set_ylabel('R squared')
ax.set_title('R_squared - Best subset selection')
ax.legend()
plt.show()
Forward stepwise selection¶
For computational reasons, the best subset cannot be applied for any large n due to the 2n complexity. Forward Stepwise begins with a model containing no predictors, and then adds predictors to the model, one at the time. At each step, the variable that gives the greatest additional improvement to the fit is added to the model.
Algorithm¶
Let M0 denote the null model which contains no predictors
- For k=1,2,...,n−1
- Consider all n−k models that augment the predictors in Mk with one additional predictor
- Choose the \textit{best} among these n−k models, and call it Mk+1
- Select the single best model among M0,M1,...,Mn using cross validated predicton error, Cp, BIC, adjusted R2 or any other method.
#Initialization variables
Y = credit.Balance
X = credit.drop(columns = 'Balance', axis = 1)
k = 11
remaining_features = list(X.columns.values)
features = []
RSS_list, R_squared_list = [np.inf], [np.inf] #Due to 1 indexing of the loop...
features_list = dict()
for i in range(1,k+1):
best_RSS = np.inf
for combo in itertools.combinations(remaining_features,1):
RSS = fit_linear_reg(X[list(combo) + features],Y) #Store temp result
if RSS[0] < best_RSS:
best_RSS = RSS[0]
best_R_squared = RSS[1]
best_feature = combo[0]
#Updating variables for next loop
features.append(best_feature)
remaining_features.remove(best_feature)
#Saving values for plotting
RSS_list.append(best_RSS)
R_squared_list.append(best_R_squared)
features_list[i] = features.copy()
Displaying results of the first 4 steps¶
print('Forward stepwise subset selection')
print('Number of features |', 'Features |', 'RSS')
display([(i,features_list[i], round(RSS_list[i])) for i in range(1,5)])
Comparing models: AIC, BIC, Mallows'CP¶
The training set Mean Squared Error (MSE) is generally an underestimate of the test MSE. This is because when we fit a model to the training data using least squares, we specifically estimate the regression coefficients such that the training RSS is minimized. In particular, the training RSS decreases as we add more features to the model, but the test error may not. Therefore the training RSS and R2 may not be used for selecting the best model unless we adjust for this underestimation.
Mallow's Cp¶
Mallow's Cp is named after Colin Lingwood Mallows and is defined as:
Cp=1m(RSS+2dˆσ2)
where ˆσ2 is an estimate of the variance of the error ϵ associated with each response measurement. Typically ˆσ2 is estimated using the full model containing all predictors.
Essentially,the Cp statistic adds a penalty of 2dˆσ2 to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases, and this is intended to adjust for the corresponding decrease in training RSS. It can be shown that if ˆσ2 is an unbiased estimate of σ2 then Cp is an unbiased estimate of the test MSE, so we choose the model with the smallest Cp.
Akaike's Information Criteria (AIC)¶
The AIC criterion is defiend for a large class of models fit by maximum likelihood. In the case of a linear model with Gaussian errors, MLE and least squares are the same thing and the AIC is given by
AIC=1mˆσ2(RSS+2dˆσ2)
Bayesian Information Criteria (BIC)¶
BIC is derived from a Bayesian point of view, and looks similar to the Cp and AIC - it is defined (up to irrelevant constants) as:
BIC=1mˆσ2(RSS+log(m)dˆσ2)
Like Cp and AIC, the BIC will tend to take small values for a model with low test error.
Adjusted R2¶
Since the R2 always increases as more variables are added, the adjusted R2 accounts for that fact and introduces a penalty. The intuition is that once all the correct variables have been included in the model,additional noise variables will lead to a very small decrase in RSS, but an increase in k and hence will decrease the adjusted R2. In effect, we pay a price for the inclusion of unnecessary variables in the model.
R2a=1−RSS/(m−k−1)TSS/(m−1)=1−(1−R2)m−1m−k−1
Theoretical justification¶
Cp, AIC, BIC all have rigorous theoretical justification that rely on asymptotic arguments, i.e. when the sample size m grows very large, whereas the adjusted R2, although quite intuitive, is not as well motivated in statistical theory.
Estimation of ˆσ2¶
Using the RSS of the full model with p features, (i.e.the smallest RSS) we estimate ˆσ2 as:
ˆσ2=RSSm−p−1
Combining forward stepwise results into a new DataFrame¶
df1 = pd.concat([pd.DataFrame({'features':features_list}),pd.DataFrame({'RSS':RSS_list, 'R_squared': R_squared_list})], axis=1, join='inner')
df1['numb_features'] = df1.index
Computing the C_p, AIC, BIC and R-square adjusted¶
#Initializing useful variables
m = len(Y)
p = 11
hat_sigma_squared = (1/(m - p -1)) * min(df1['RSS'])
#Computing
df1['C_p'] = (1/m) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['AIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] + 2 * df1['numb_features'] * hat_sigma_squared )
df1['BIC'] = (1/(m*hat_sigma_squared)) * (df1['RSS'] + np.log(m) * df1['numb_features'] * hat_sigma_squared )
df1['R_squared_adj'] = 1 - ( (1 - df1['R_squared'])*(m-1)/(m-df1['numb_features'] -1))
df1
df1['R_squared_adj'].idxmax()
df1['R_squared_adj'].max()
Plotting the computed values as a function of number of features¶
variables = ['C_p', 'AIC','BIC','R_squared_adj']
fig = plt.figure(figsize = (18,6))
for i,v in enumerate(variables):
ax = fig.add_subplot(1, 4, i+1)
ax.plot(df1['numb_features'],df1[v], color = 'lightblue')
ax.scatter(df1['numb_features'],df1[v], color = 'darkblue')
if v == 'R_squared_adj':
ax.plot(df1[v].idxmax(),df1[v].max(), marker = 'x', markersize = 20)
else:
ax.plot(df1[v].idxmin(),df1[v].min(), marker = 'x', markersize = 20)
ax.set_xlabel('Number of predictors')
ax.set_ylabel(v)
fig.suptitle('Subset selection using C_p, AIC, BIC, Adjusted R2', fontsize = 16)
plt.show()
sns.pairplot(credit[['Balance','Age','Cards','Education','Income','Limit','Rating']]);
Statsmodel linear regression¶
Least squares coefficient estimates associated with the regression of balance onto ethnicity in the Credit data set. The linear model is given in (3.30). That is, ethnicity is encoded via two dummy variables
#Setting up the X and Y variables, adding constant term for intercept
Y = credit.Balance
X = credit[['Ethnicity_Asian','Ethnicity_Caucasian']]
X = sm.add_constant(X)
X.head()
Table 3.8 - page 86¶
model_1 = sm.OLS(Y, X)
result_1 = model_1.fit()
result_1.summary()