Ridge and Lasso regression - intuition behind the optimal solution¶

This short notebook offers a visual intuition behind the similarity and differences between Ridge and Lasso regression. In particular we will the contour of the Olrdinary Least Square (OLS) cost function, together with the $L_2$ and $L_1$ cost functions.

Ridge regression¶

Recall that the Ridge minimization problem can be expressed equivalently as:

$$ \hat \theta_{ridge} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2 + \lambda \sum_{j=1}^n \theta_j^2 $$

Which can be viewed as the minimization of two terms: $OLS + L_2$. To gain an intuition consider each term separately:

The first OLS term can be written as $(y - X \theta)^T(y - X \theta) $ which gives rise to an elipse contour plot centered around the Maximum Likelihood Estimator.
The second $L_2$ term can be written as $\theta^T \theta = \theta_1^2 + \theta^2_2 = c$ which is the equation of a circle with radius $c$.
The solution to the constrained optimization lies at the intersection between the contours of the two functions, and this intersection varies as a function of $\lambda$. For $\lambda = 0$ the solution is the MLE (as usual) and for $\lambda = \infty$ the solution is at $[0,0]$.

Lasso regression¶

Recall that the Lasso minimization problem can be expressed as:

$$ \hat \theta_{lasso} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2 + \lambda \sum_{j=1}^n | \theta_j| \ $$

Which can be viewed as the minimization of two terms: $OLS + L_1$.

The first OLS term can be written as $(y - X \theta)^T(y - X \theta) $ which gives rise to an elipse contour plot centered around the Maximum Likelihood Estimator.
The second $L_1$ term is the equation of a diamond centered around 0.
As previously, the solution to the constrained optimization lies at the intersection between the contours of the two functions, and this intersection varies as a function of $\lambda$. For $\lambda = 0$ the solution is the MLE (as usual) and for $\lambda = \infty$ the solution is at $[0,0]$.
Note that at the vertices of the diamond, one of the variables has value 0, meaning there is a probability that one of the features will have a value exactly 0

Intuitive comparison between Ridge and Lasso¶

The Lasso $L_1$ term provides a much more aggressive regularization because the intersection between the constraint function and the cost function happens at the “vertices” of the diamond where either / or each variable is equal to zero. The effect of such regularization is to “cancel” out each variable, and by varying $\lambda$ we can test the impact of each variable on the model. In effect this is automatic variable selection.

On the other hand, Ridge regression provides a less aggressive form of regularization where the coefficients tend to zero in the limit only.

Visualizing the OLS cost function, L1, L2 constraints and the solution path¶

Libraries¶

In [1]:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import linear_model

%matplotlib inline
plt.style.use('seaborn-white')

Helper functions¶

In [12]:

def costfunction(X,y,theta):
    '''OLS cost function'''
    #Initialisation of useful values 
    m = np.size(y)
    
    #Cost function in vectorized form
    h = X @ theta
    J = float((1./(2*m)) * (h - y).T @ (h - y));    
    return J;


def closed_form_solution(X,y):
    '''Linear regression closed form solution'''
    return np.linalg.inv(X.T @ X) @ X.T @ y
    
def closed_form_reg_solution(X,y,lamda = 10): 
    '''Ridge regression closed form solution'''
    m,n = X.shape
    I = np.eye((n))
    return (np.linalg.inv(X.T @ X + lamda * I) @ X.T @ y)[:,0]

def cost_l2(x,y):
    '''L2 cost functiom'''
    return x**2 + y**2

def cost_l1(x,y):
    '''L1 cost function'''
    return np.abs(x) + np.abs(y)

Simulated dataset¶

In [13]:

#Creating the dataset (as previously)
x = np.linspace(0,1,40)
noise = 1*np.random.uniform(  size = 40)
y = np.sin(x * 1.5 * np.pi ) 
y_noise = (y + noise).reshape(-1,1)

#Subtracting the mean so that the y's are centered
y_noise = y_noise - y_noise.mean()
X = np.vstack((2*x,x**2)).T

#Nornalizing the design matrix to facilitate visualization
X = X / np.linalg.norm(X,axis = 0)

Computing the Ridge regularization solutions as a function of $\lambda$¶

Note we are using the closed form solution

In [9]:

lambda_range = np.logspace(0,4,num = 100)/1000
theta_0_list_reg_l2 = []
theta_1_list_reg_l2 = []

for l in lambda_range:
    t0, t1 = closed_form_reg_solution(X,y_noise,l)
    theta_0_list_reg_l2.append(t0)
    theta_1_list_reg_l2.append(t1)

Computing the Lasso regularization solutions as a function of $\lambda$¶

Note we are using Sklearn as there is no closed form solution

In [10]:

lambda_range = np.logspace(0,2,num = 100)/1000
theta_0_list_reg_l1 = []
theta_1_list_reg_l1 = []

for l in lambda_range:
    model_sk_reg = linear_model.Lasso(alpha=l, fit_intercept=False)
    model_sk_reg.fit(X,y_noise)
    t0, t1 = model_sk_reg.coef_
    theta_0_list_reg_l1.append(t0)
    theta_1_list_reg_l1.append(t1)

Plotting the results¶

In both diagrams, the contour plots are the Ridge and Lasso cost functions in the limits $\lambda = 0$ and $\lambda = \infty$. In effect they are the contour plots of OLS, $L_2$ and $L_1$ cost functions

The red dots in between are the optimal solutions as a function of $\lambda$

In [15]:

#Setup of meshgrid of theta values
xx, yy = np.meshgrid(np.linspace(-2,17,100),np.linspace(-17,3,100))

#Computing the cost function for each theta combination
zz_l2 = np.array(  [cost_l2(xi, yi )for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) #L2 function

zz_l1 = np.array(  [cost_l1(xi, yi )for xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] ) #L1 function

zz_ls = np.array(  [costfunction(X, y_noise.reshape(-1,1),np.array([t0,t1]).reshape(-1,1)) 
                     for t0, t1 in zip(np.ravel(xx), np.ravel(yy)) ] ) #least square cost function

#Reshaping the cost values    
Z_l2 = zz_l2.reshape(xx.shape)
Z_ls = zz_ls.reshape(xx.shape)
Z_l1 = zz_l1.reshape(xx.shape)

#Defining the global min of each function
min_ls = np.linalg.inv(X.T@X)@X.T@y_noise
min_l2 = np.array([0.,0.])
min_l1 = np.array([0.,0.])

#Plotting the contours - L2 
fig = plt.figure(figsize = (16,7))
ax = fig.add_subplot(1, 2, 1)
ax.contour(xx, yy, Z_l2, levels = [.5,1.5,3,6,9,15,30,60,100,150,250],  cmap = 'gist_gray', label = 'l2')
ax.contour(xx, yy, Z_ls, levels = [.01,.06,.09,.11,.15], cmap = 'coolwarm', label = 'least squares')
ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_title('Ridge solution as a function of $\\lambda$ - OLS and $L_2$ ')

#Plotting the minimum - L2 
ax.plot(min_ls[0],min_ls[1], marker = 'x', color = 'red', markersize = 10)
ax.plot(0,0, marker = 'o', color = 'black', markersize = 10)

#Plotting the path of L2 regularized minimum
ax.plot(theta_0_list_reg_l2,theta_1_list_reg_l2, linestyle = 'none', marker = 'o', color = 'red', alpha = .2)


#Plotting the contours - L1 
ax = fig.add_subplot(1, 2, 2)
ax.contour(xx, yy, Z_l1, levels = [.5,1,2,3,4,5,6,8,10,12,14],  cmap = 'gist_gray', label = 'l_1')
ax.contour(xx, yy, Z_ls, levels = [.01,.06,.09,.11,.15], cmap = 'coolwarm', label = 'least squares')
ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_title('Lasso solution as a function of $\\lambda$ - OLS and $L_1$ ')

#Plotting the minimum - L1
ax.plot(min_ls[0],min_ls[1], marker = 'x', color = 'red', markersize = 10)
ax.plot(0,0, marker = 'o', color = 'black', markersize = 10)

#Plotting the path of L1 regularized minimum
ax.plot(theta_0_list_reg_l1,theta_1_list_reg_l1, linestyle = 'none', marker = 'o', color = 'red', alpha = .2)

plt.show()

/anaconda3/lib/python3.6/site-packages/matplotlib/contour.py:967: UserWarning: The following kwargs were not used by contour: 'label'
  s)

Data Blog
Data Science, Machine Learning and Statistics, implemented in Python

Ridge and Lasso: visualizing the optimal solutions

Xavier Bourret Sicotte

Ridge and Lasso regression - intuition behind the optimal solution¶

Ridge regression¶

Lasso regression¶

Intuitive comparison between Ridge and Lasso¶

Visualizing the OLS cost function, L1, L2 constraints and the solution path¶

Libraries¶

Helper functions¶

Simulated dataset¶

Computing the Ridge regularization solutions as a function of $\lambda$¶

Computing the Lasso regularization solutions as a function of $\lambda$¶

Plotting the results¶

Comments

Data Blog Data Science, Machine Learning and Statistics, implemented in Python

Ridge and Lasso: visualizing the optimal solutions

Xavier Bourret Sicotte

Ridge and Lasso regression - intuition behind the optimal solution¶

Ridge regression¶

Lasso regression¶

Intuitive comparison between Ridge and Lasso¶

Visualizing the OLS cost function, L1, L2 constraints and the solution path¶

Libraries¶

Helper functions¶

Simulated dataset¶

Computing the Ridge regularization solutions as a function of $\lambda$¶

Computing the Lasso regularization solutions as a function of $\lambda$¶

Plotting the results¶

Comments

Data Blog
Data Science, Machine Learning and Statistics, implemented in Python