Implementing AdaBoost using Logistic regression weak classifiers¶
This notebook explores the well known AdaBoost M1 algorithm which combines several weak classifiers to create a better overall classifier. The notebook consists of three main sections:
- A review of the Adaboost M1 algorithm and an intuitive visualization of its inner workings
- An implementation from scratch in Python, using an Sklearn decision tree stump as the weak classifier
- A discussion on the trade-off between the Learning rate and Number of weak classifiers parameters
This notebook was used as a basis for the following answers on stats.stackexchange
- https://stats.stackexchange.com/questions/82323/shrinkage-parameter-in-adaboost/355632#355632
- https://stats.stackexchange.com/questions/164233/intuitive-explanations-of-differences-between-gradient-boosting-trees-gbm-ad/355618#355618
- https://stats.stackexchange.com/questions/274563/adaboost-best-weak-learner-with-0-5-error/355604#355604
This implementation relies on a simple decision tree stum with maximum depth = 1 and 2 leaf nodes. Of interest is the plot of decision boundaries for different weak learners inside the AdaBoost combination, together with their respective sample weights.
An intuitive explanation of AdaBoost algorithn¶
AdaBoost recap¶
Let $G_m(x) \ m = 1,2,...,M$ be the sequence of weak classifiers, our objective is to build the following:
$$G(x) = \text{sign} \left( \alpha_1 G_1(x) + \alpha_2 G_2(x) + ... \alpha_M G_M(x)\right) = \text{sign} \left( \sum_{m = 1}^M \alpha_m G_m(x)\right)$$
The final prediction is a combination of the predictions from all classifiers through a weighted majority vote
The coefficients $\alpha_m$ are computed by the boosting algorithm, and weight the contribution of each respective $G_m(x)$. The effect is to give higher influence to the more accurate classifiers in the sequence.
- At each boosting step, the data is modified by applying weights $w_1, w_2,...,w_N$ to each training observation. At step $m$ the observations that were misclassified previously have their weights increased
- Note that at the first step $m=1$ the weights are initialized uniformly $w_i = 1 / N$
The algorith, as summarised on page 339 of Elements of Statistical Learning¶
- Initialize the obervation weights $w_i = 1/N$
For $m = 1,2,...,M$
- Compute the weighted error $ Err_m = \frac{\sum_{i-1}^N w_i \mathcal{I}(y^{(i)} \neq G_m(x^{(i)}) )}{\sum_{i=1}^N w_i}$
- Compute coefficient $\alpha_m = \log \left( \frac{1 - err_m}{err_m}\right)$
- Set data weights $w_i \leftarrow w_i \exp[ \alpha_m \ \mathcal{I}(y^{(i)} \neq G_m(x^{(i)}))]$
Output $G(x) = \text{sign} \left[ \sum_{m=1}^M \alpha_m G_m(x)\right]$
AdaBoost on a toy example¶
Consider the toy data set on which I have applied AdaBoost with the following settings: Number of iterations $M = 10$, weak classifier = Decision Tree of depth 1 and 2 leaf nodes. The boundary between red and blue data points is clearly non linear, yet the algorithm does pretty well.
Visualizing the sequence of weak learners and the sample weights¶
The first 6 weak learners $m = 1,2...,6$ are shown below. The scatter points are scaled according to their respective sample weight at each iteration
First iteration:
- The decision boundary is very simple (linear) since these are wea learners
- All points are of the same size, as expected
- 6 blue points are in the red region and are misclassified
Second iteration:
- The linear decision boundary has changed
- The previously misclassified blue points are now larger (greater sample_weight) and have influenced the decision boundary
- 9 blue points are now misclassified
Final result after 10 iterations
All classifiers have a linear decision boundary, at different positions. The resulting coefficients of the first 6 iterations $\alpha_m$ are :
([1.041, 0.875, 0.837, 0.781, 1.04 , 0.938...
As expected, the first iteration has largest coefficient as it is the one with the fewest mis-classifications.
Next steps¶
An intuitive explanation of gradient boosting - to be completed
Sources:¶
Implementing AdaBoost M1 from scratch¶
Libraries¶
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import seaborn as sns
sns.set_style('white')
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
Dataset and Utility plotting function¶
#Toy Dataset
x1 = np.array([.1,.2,.4,.8, .8, .05,.08,.12,.33,.55,.66,.77,.88,.2,.3,.4,.5,.6,.25,.3,.5,.7,.6])
x2 = np.array([.2,.65,.7,.6, .3,.1,.4,.66,.77,.65,.68,.55,.44,.1,.3,.4,.3,.15,.15,.5,.55,.2,.4])
y = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])
X = np.vstack((x1,x2)).T
def plot_decision_boundary(classifier, X, y, N = 10, scatter_weights = np.ones(len(y)) , ax = None ):
'''Utility function to plot decision boundary and scatter plot of data'''
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
#Check what methods are available
if hasattr(classifier, "decision_function"):
zz = np.array( [classifier.decision_function(np.array([xi,yi]).reshape(1,-1)) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
elif hasattr(classifier, "predict_proba"):
zz = np.array( [classifier.predict_proba(np.array([xi,yi]).reshape(1,-1))[:,1] for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
else :
zz = np.array( [classifier(np.array([xi,yi]).reshape(1,-1)) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
# reshape result and plot
Z = zz.reshape(xx.shape)
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
#Get current axis and plot
if ax is None:
ax = plt.gca()
ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5)
ax.contour(xx, yy, Z, 2, cmap='RdBu')
ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright, s = scatter_weights * 40)
ax.set_xlabel('$X_1$')
ax.set_ylabel('$X_2$')
Sklearn AdaBoost and decision boundary¶
boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2),
algorithm = 'SAMME',n_estimators=10, learning_rate=1.0)
boost.fit(X,y)
plot_decision_boundary(boost, X,y, N = 50)#, weights)
plt.show()
boost.score(X,y)
Python implementation¶
def AdaBoost_scratch(X,y, M=10, learning_rate = 1):
#Initialization of utility variables
N = len(y)
estimator_list, y_predict_list, estimator_error_list, estimator_weight_list, sample_weight_list = [], [],[],[],[]
#Initialize the sample weights
sample_weight = np.ones(N) / N
sample_weight_list.append(sample_weight.copy())
#For m = 1 to M
for m in range(M):
#Fit a classifier
estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2)
estimator.fit(X, y, sample_weight=sample_weight)
y_predict = estimator.predict(X)
#Misclassifications
incorrect = (y_predict != y)
#Estimator error
estimator_error = np.mean( np.average(incorrect, weights=sample_weight, axis=0))
#Boost estimator weights
estimator_weight = learning_rate * np.log((1. - estimator_error) / estimator_error)
#Boost sample weights
sample_weight *= np.exp(estimator_weight * incorrect * ((sample_weight > 0) | (estimator_weight < 0)))
#Save iteration values
estimator_list.append(estimator)
y_predict_list.append(y_predict.copy())
estimator_error_list.append(estimator_error.copy())
estimator_weight_list.append(estimator_weight.copy())
sample_weight_list.append(sample_weight.copy())
#Convert to np array for convenience
estimator_list = np.asarray(estimator_list)
y_predict_list = np.asarray(y_predict_list)
estimator_error_list = np.asarray(estimator_error_list)
estimator_weight_list = np.asarray(estimator_weight_list)
sample_weight_list = np.asarray(sample_weight_list)
#Predictions
preds = (np.array([np.sign((y_predict_list[:,point] * estimator_weight_list).sum()) for point in range(N)]))
print('Accuracy = ', (preds == y).sum() / N)
return estimator_list, estimator_weight_list, sample_weight_list
Running scratch AdaBoost¶
Accuracy = 100% - same as for Sklearn implementation
estimator_list, estimator_weight_list, sample_weight_list = AdaBoost_scratch(X,y, M=10, learning_rate = 1)
Adaboost_scratch Plotting utility function¶
def plot_AdaBoost_scratch_boundary(estimators,estimator_weights, X, y, N = 10,ax = None ):
def AdaBoost_scratch_classify(x_temp, est,est_weights ):
'''Return classification prediction for a given point X and a previously fitted AdaBoost'''
temp_pred = np.asarray( [ (e.predict(x_temp)).T* w for e, w in zip(est,est_weights )] ) / est_weights.sum()
return np.sign(temp_pred.sum(axis = 0))
'''Utility function to plot decision boundary and scatter plot of data'''
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
zz = np.array( [AdaBoost_scratch_classify(np.array([xi,yi]).reshape(1,-1), estimators,estimator_weights ) for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
# reshape result and plot
Z = zz.reshape(xx.shape)
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
if ax is None:
ax = plt.gca()
ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5)
ax.contour(xx, yy, Z, 2, cmap='RdBu')
ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright)
ax.set_xlabel('$X_1$')
ax.set_ylabel('$X_2$')
Plotting resulting decision boundary¶
plot_AdaBoost_scratch_boundary(estimator_list, estimator_weight_list, X, y, N = 50 )
Viewing decision boundaries of the weak classifiers inside AdaBoost¶
- $M = 10$ iterations
- Size of scatter points is proportional to scale of sample_weights
For example, in the top left plot $m = 0$, 6 blue points are misclassified. In the next ieration $m=1$, their sample weight is increased and the scatter plot shows them as bigger.
estimator_list, estimator_weight_list, sample_weight_list = AdaBoost_scratch(X,y, M=10, learning_rate = 1)
fig = plt.figure(figsize = (14,14))
for m in range(0,9):
fig.add_subplot(3,3,m+1)
s_weights = (sample_weight_list[m,:] / sample_weight_list[m,:].sum() ) * 40
plot_decision_boundary(estimator_list[m], X,y,N = 50, scatter_weights =s_weights )
plt.title('Estimator decision boundary, m = {}'.format(m))
A visual explanation of the trade-off between learning rate and iterations¶
This post is based on the assumption that the AdaBoost algorithm is similar to the M1 or SAMME implementations which can be sumarized as follows:
Let $G_m(x) \ m = 1,2,...,M$ be the sequence of weak classifiers, our objective is:
$$G(x) = \text{sign} \left( \alpha_1 G_1(x) + \alpha_2 G_2(x) + ... \alpha_M G_M(x)\right) = \text{sign} \left( \sum_{m = 1}^M \alpha_m G_m(x)\right)$$
AdaBoost.M1
- Initialize the obervation weights $w_i = 1/N$
For $m = 1,2,...,M$
- Compute the weighted error
$ Err_m = \frac{\sum_{i-1}^N w_i \mathcal{I}(y^{(i)} \neq G_m(x^{(i)}) )}{\sum_{i=1}^N w_i}$
- Compute the estimator coefficient $\alpha_m = L \log \left( \frac{1 - err_m}{err_m}\right)$ where $L \leq 1$ is the learning rate
- Set data weights $w_i \leftarrow w_i \exp[ \alpha_m \ \mathcal{I}(y^{(i)} \neq G_m(x^{(i)}))]$
- To avoid numerical instability, normalize the weights at each step $w_i \leftarrow \frac{w_1}{\sum_{i=1}^N w_i}$
- Compute the weighted error
Output $G(x) = \text{sign} \left[ \sum_{m=1}^M \alpha_m G_m(x)\right]$
The impact of Learning Rate L and the number of weak classifiers M¶
From the above algorithm we can understand intuitively that
- Decreasing the learning rate $L$makes the coefficients $\alpha_m$ smaller, which reduces the amplitude of the sample_weights at each step (since $w_i \leftarrow w_i e^{ \alpha_m \mathcal{I}...} $ ). This translates into smaller variations of the weighted data points and therefore fewer differences between the weak classifier decision boundaries
- Increasing the number of weak classifiers M increases the number of iterations, and allows the sample weights to gain greater amplitude. This translates into 1) more weak classifiers to combine at the end, and 2) more variations in the decision boundaries of these classifiers. Put together these effects tend to lead to more complex overall decision boundaries.
From this intuition, it would make sense to see a trade-off between the parameters $L$ and $M$. Increasing one and decreasing the other will tend to cancel the effect.
An example on a toy dataset¶
Here is a toy dataset used on a different question. Plotting the final decision boundary for different values of L and M shows there is some intuitive relation between the two.
- Too small $L$ or $M$ leads to an overly simplistic boundary. Left hand side plots
- Making one large and the other small tends to cancel the effect Plots in the middle
- Making both large gives a good result but may overfit Right hand side plot
Sources:¶
- Sklearn implementation here - line 479
- Elements of Statistical Learning - page 339
fig = plt.figure(figsize = (15,10))
for k,l in enumerate([0.1,0.5,1]):
fig.add_subplot(2,3,k+1)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
estimator_list, estimator_weight_list, sample_weight_list = AdaBoost_scratch(X,y, M=10, learning_rate = l)
plot_AdaBoost_scratch_boundary(estimator_list,estimator_weight_list, X, y, N = 50,ax = None )
plt.title('Adaboost boundary: M = 10, L = {}'.format(l))
print(estimator_weight_list)
for k,m in enumerate([1,3,10]):
fig.add_subplot(2,3,k+4)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
estimator_list, estimator_weight_list, sample_weight_list = AdaBoost_scratch(X,y, M=m, learning_rate = 1)
plot_AdaBoost_scratch_boundary(estimator_list,estimator_weight_list, X, y, N = 50,ax = None )
plt.title('Adaboost boundary: M = {}, L = 1'.format(m))
print(estimator_weight_list)