ISL Chapter 8 - Tree based models¶
This notebook explores chapter 8 of the book "Introduction to Statistical Learning" and aims to reproduce several of the key figures and discussion topics.
Of interest is the use of the graphviz library to help visualize the resulting trees and GridSearch from the Sklearn library to plot the validation curves
Libraries¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white')
from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import tree
import graphviz
Basics of decision trees - Hitters data¶
We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year).
We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)
hitters = pd.read_csv('/Users/User/Desktop/Data/Datasets/ISLR_data/Hitters.csv')
hitters.dropna(axis=0, how='any', subset=['Salary'], inplace=True)
hitters.head()
Preamble: Exploring and transforming the Hitters data set¶
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(func = np.log, inverse_func= np.exp)
y = transformer.transform(hitters.Salary.values.reshape(-1,1))
X = hitters[['Years','Hits']]
fig = plt.figure(figsize = (14,6))
ax = plt.subplot(121)
ax = sns.distplot(y)
ax.set_title('Histogram (dist) plot of log salaries')
ax = plt.subplot(122)
ax.set_xlabel('Years')
ax.set_ylabel('Hits')
ax.set_title('Scatter plot - colors are log salary')
ax2 = ax.scatter(hitters.Years, hitters.Hits, alpha = .4, c = y.flatten(), cmap = 'seismic')
plt.colorbar(ax2)
plt.show
Train simple regression tree¶
from sklearn.tree import DecisionTreeRegressor
simple_tree = DecisionTreeRegressor(random_state = 0, max_depth = 2, max_leaf_nodes=3)
simple_tree.fit(X, y)
Figure 8.1 A simple regression tree for the Hitters data¶
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(simple_tree, out_file=None, feature_names = ['Years','Hits'], class_names = ['Salary'])
graph = graphviz.Source(dot_data)
graph
Figure 8.2 The three-region partition boundaries for the Hitters data set¶
#Scatter plot
log_preds = np.round(simple_tree.predict(X),2).flatten()
color_dict = {'R1': '#440154FF', 'R2': '#29AF7FFF', 'R3': '#FDE725FF'}
plt.figure(figsize = (8,8))
plt.scatter(X.Years, X.Hits, c = log_preds, cmap = 'viridis')
#Plot decision boundary
N = 100
X1 = np.linspace(0, 25, N)
X2 = np.linspace(0, 250, N)
X1, X2 = np.meshgrid(X1, X2)
zz = np.array( [simple_tree.predict( np.array([xx,yy]).reshape(1,-1))
for xx, yy in zip(np.ravel(X1), np.ravel(X2)) ] )
Z = zz.reshape(X1.shape)
plt.contourf( X1, X2, Z, 3, alpha = .1, cmap = 'viridis')
plt.contour( X1, X2, Z, 3, alpha = 1, cmap = 'viridis')
# Legend
markers = [plt.Line2D([0,0],[0,0],color=color, marker='o', linestyle='') for color in color_dict.values()]
plt.legend(markers, color_dict.keys(), numpoints=1)
plt.xlabel('Years')
plt.ylabel('Hits')
plt.title('Three region partition - simple regression tree')
plt.show()
Predictions in the original space¶
preds = transformer.inverse_transform(log_preds.reshape(1,-1)) * 1000
An overfit tree¶
#Fitting the tree
max_d = 15
overfit_tree = DecisionTreeRegressor(random_state = 0, max_depth = max_d)
overfit_tree.fit(X, y)
#Scatter plot
log_preds = np.round(overfit_tree.predict(X),2).flatten()
plt.figure(figsize = (8,8))
plt.scatter(X.Years, X.Hits, c = log_preds, cmap = 'viridis')
#Plot decision boundary
N = 100
X1 = np.linspace(0, 25, N)
X2 = np.linspace(0, 250, N)
X1, X2 = np.meshgrid(X1, X2)
zz = np.array( [overfit_tree.predict( np.array([xx,yy]).reshape(1,-1))
for xx, yy in zip(np.ravel(X1), np.ravel(X2)) ] )
Z = zz.reshape(X1.shape)
plt.contourf( X1, X2, Z, max_d, alpha = .1, cmap = 'viridis')
plt.contour( X1, X2, Z, max_d, alpha = 1, cmap = 'viridis')
# Legend
plt.xlabel('Years')
plt.ylabel('Hits')
plt.title('Partitions - over fit regression tree')
plt.show()
Reducing overfitting¶
Decision trees are likely to overfit as the depth increases due to the number of \qq{data splits} which reduce the number of data points per feature and thus bring about the curse of dimensionality. To avoid such overfitting there are globally two main approaches:
1) early stopping 2) pruning
Early stopping¶
The idea is to stop learning before the tree becomes too complex. There are three main methods:
- Limit the depth of the tree. For example by selecting the complexity with the lowest validation error
- Use an error metric to limit the depth. Do not continue if metric improvement is less than a tolerance level. e.g. classification error
- Stop if the number of data points contained in a node is too small. In practice this constraint should always be complemented
Limiting the depth of the tree by looking at the validation curve¶
i.e. finding the max depth for which the test MSE is smallest
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
#Split train + validation and test data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
#Creating pipeline
pipe = Pipeline([('functiontransformer',
FunctionTransformer(func=np.log, inverse_func=np.exp)),
('decisiontreeregressor',
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best'))])
#Setup of parameters and gridsearch
depths = np.arange(1,11)
param = {'decisiontreeregressor__max_depth': depths}
grid_search = GridSearchCV(estimator = pipe, param_grid = param ,cv = 3, scoring = 'neg_mean_squared_error', return_train_score=True )
#Grid search
grid_search.fit(X_train,y_train)
#Plot
plt.figure(figsize = (7,5))
plt.errorbar(depths,- grid_search.cv_results_['mean_train_score'], yerr=grid_search.cv_results_['std_train_score'], capsize=10,label = 'Train MSE')
plt.errorbar(depths,- grid_search.cv_results_['mean_test_score'], yerr=grid_search.cv_results_['std_test_score'], capsize=3, label = 'Avg Cross Val MSE')
plt.xlabel('Max depth'), plt.ylabel('MSE'), plt.title('Validation curve')
plt.legend(), plt.show()
Retrain the model on all train data and check MSE on test data¶
pipe.set_params(decisiontreeregressor__max_depth = 4).fit(X_train, y_train)
mean_squared_error(y_test,pipe.predict(X_test))
Limiting the complexity via the number of leaves¶
As depth is not the best measure of complexity as the same node can have many different splits
#Setup of parameters and gridsearch
leaves = np.arange(2,20)
param = {'decisiontreeregressor__max_leaf_nodes': leaves}
grid_search = GridSearchCV(estimator = pipe, param_grid = param ,cv = 3, scoring = 'neg_mean_squared_error', return_train_score=True )
#Grid search
grid_search.fit(X_train,y_train)
#Plot
plt.figure(figsize = (7,5))
plt.errorbar(leaves,- grid_search.cv_results_['mean_train_score'],yerr =grid_search.cv_results_['std_train_score'], label = 'Train MSE', errorevery = 3)
plt.errorbar(leaves,- grid_search.cv_results_['mean_test_score'], yerr = grid_search.cv_results_['std_test_score'],label = 'Avg Cross Val MSE', errorevery = 3)
plt.xlabel('Max leaves'), plt.ylabel('MSE'), plt.title('Validation curve')
plt.legend(), plt.show()