Table of Contents
Implementing coordinate descent for lasso regression in Python¶
Following the previous blog post where we have derived the closed form solution for lasso coordinate descent, we will now implement it in python numpy and visualize the path taken by the coefficients as a function of $\lambda$.
Our results are also compared to the Sklearn implementation as a sanity check.
Algorithm¶
In the simple case of cyclic coordinate descent, we cycle through all features one at a time, minimizing the cost function with respect to each coordinate.
- Start with $\mathbf{x}^{(0)} = (x_1^0,...,x_n^0)$
- Round $k+1$ defines $\mathbf{x}^{(k+1)}$ from $\mathbf{x}^{(k)}$ by iteratively solving the single variable optimization problem
- $\mathbf{x}_i^{(k+1)} = argmin_{\omega} f(x_1^{(k+1)}, ..., x_{i-1}^{(k+1)}, \omega, x_{i+1}^{(k)}, ..., x_n^{(k)})$
- i.e. $x_i : = x_i - \alpha \frac{\partial f}{\partial x_i}(\mathbf{x})$
- Repeat for each variable $x_i$ in $\mathbf{x}$ for $i = 1,...,n$.
If a closed form solution to the single variable optimization exists then we can update the parameter directly at each step:
- $\theta_j =$ closed form solution
Note that the term $\rho_j$ can be rewritten to simplify the implementation
$$\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ) = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \hat y^{(i)}_{pred} + \theta_j x_j^{(i)} ) $$
Lasso coordinate descent - closed form solution¶
As shown previously, the Lasso cost function does have a closed form solution in the special case of coordinate descent, as it becomes a single variable optimization problem. The solution for normalized data is defined in terms of the soft threshold function $S()$
\begin{aligned} \begin{cases} \theta_j = \rho_j + \lambda & \text{for} \ \rho_j < - \lambda \\ \theta_j = 0 & \text{for} \ - \lambda \leq \rho_j \leq \lambda \\ \theta_j = \rho_j - \lambda & \text{for} \ \rho_j > \lambda \end{cases} \end{aligned}
Coordinate descent update rule:¶
Repeat until convergence or max number of iterations:
- For $j = 0,1,...,n$
- Compute $\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ) = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \hat y^{(i)}_{pred} + \theta_j x_j^{(i)} )$
- Set $\theta_j = S(\rho_j, \lambda)$
Note that if there is a constant term, then it is not regularized so $\theta_0 = \rho_0$
### Libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits import mplot3d
from sklearn import linear_model
from sklearn import datasets
%matplotlib inline
plt.style.use('seaborn-white')
The dataset¶
We will be using the well known diabetes dataset taken from Sklearn's built in datasets
#Load the diabetes dataset. In this case we will not be using a constant intercept feature
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target.reshape(-1,1)
Implementation using Numpy¶
def soft_threshold(rho,lamda):
'''Soft threshold function used for normalized data and lasso regression'''
if rho < - lamda:
return (rho + lamda)
elif rho > lamda:
return (rho - lamda)
else:
return 0
def coordinate_descent_lasso(theta,X,y,lamda = .01, num_iters=100, intercept = False):
'''Coordinate gradient descent for lasso regression - for normalized data.
The intercept parameter allows to specify whether or not we regularize theta_0'''
#Initialisation of useful values
m,n = X.shape
X = X / (np.linalg.norm(X,axis = 0)) #normalizing X in case it was not done before
#Looping until max number of iterations
for i in range(num_iters):
#Looping through each coordinate
for j in range(n):
#Vectorized implementation
X_j = X[:,j].reshape(-1,1)
y_pred = X @ theta
rho = X_j.T @ (y - y_pred + theta[j]*X_j)
#Checking intercept parameter
if intercept == True:
if j == 0:
theta[j] = rho
else:
theta[j] = soft_threshold(rho, lamda)
if intercept == False:
theta[j] = soft_threshold(rho, lamda)
return theta.flatten()
Lasso coefficient path using Numpy implementation¶
# Initialize variables
m,n = X.shape
initial_theta = np.ones((n,1))
theta_list = list()
lamda = np.logspace(0,4,300)/10 #Range of lambda values
#Run lasso regression for each lambda
for l in lamda:
theta = coordinate_descent_lasso(initial_theta,X,y,lamda = l, num_iters=100)
theta_list.append(theta)
#Stack into numpy array
theta_lasso = np.stack(theta_list).T
#Plot results
n,_ = theta_lasso.shape
plt.figure(figsize = (12,8))
for i in range(n):
plt.plot(lamda, theta_lasso[i], label = diabetes.feature_names[i])
plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('Coefficients')
plt.title('Lasso Paths - Numpy implementation')
plt.legend()
plt.axis('tight')
Lasso coefficient path using Sklearn¶
#Load the diabetes dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X / np.linalg.norm(X,axis=0) # Standardize data (easier to set the l1_ratio parameter)
m,n = X.shape
eps = 5e-5 # the smaller it is the longer is the path
alphas_lasso, coefs_lasso, _ = linear_model.lasso_path(X, y, eps, fit_intercept=False)
# Display results
plt.figure(figsize = (12,8))
#neg_log_alphas_lasso = -np.log10(alphas_lasso)
for i in range(n):
plt.plot(alphas_lasso, coefs_lasso[i], label = diabetes.feature_names[i])
plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('coefficients')
plt.title('Lasso paths - Sklearn')
plt.legend()
plt.axis('tight')