Implementing coordinate descent for lasso regression in Python

Following the previous blog post where we have derived the closed form solution for lasso coordinate descent, we will now implement it in python numpy and visualize the path taken by the coefficients as a function of $\lambda$.

Our results are also compared to the Sklearn implementation as a sanity check.

Algorithm

In the simple case of cyclic coordinate descent, we cycle through all features one at a time, minimizing the cost function with respect to each coordinate.

  • Start with $\mathbf{x}^{(0)} = (x_1^0,...,x_n^0)$
  • Round $k+1$ defines $\mathbf{x}^{(k+1)}$ from $\mathbf{x}^{(k)}$ by iteratively solving the single variable optimization problem
    • $\mathbf{x}_i^{(k+1)} = argmin_{\omega} f(x_1^{(k+1)}, ..., x_{i-1}^{(k+1)}, \omega, x_{i+1}^{(k)}, ..., x_n^{(k)})$
    • i.e. $x_i : = x_i - \alpha \frac{\partial f}{\partial x_i}(\mathbf{x})$
    • Repeat for each variable $x_i$ in $\mathbf{x}$ for $i = 1,...,n$.

If a closed form solution to the single variable optimization exists then we can update the parameter directly at each step:

  • $\theta_j =$ closed form solution

Note that the term $\rho_j$ can be rewritten to simplify the implementation

$$\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ) = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \hat y^{(i)}_{pred} + \theta_j x_j^{(i)} ) $$

Lasso coordinate descent - closed form solution

As shown previously, the Lasso cost function does have a closed form solution in the special case of coordinate descent, as it becomes a single variable optimization problem. The solution for normalized data is defined in terms of the soft threshold function $S()$

\begin{aligned} \begin{cases} \theta_j = \rho_j + \lambda & \text{for} \ \rho_j < - \lambda \\ \theta_j = 0 & \text{for} \ - \lambda \leq \rho_j \leq \lambda \\ \theta_j = \rho_j - \lambda & \text{for} \ \rho_j > \lambda \end{cases} \end{aligned}

Coordinate descent update rule:

Repeat until convergence or max number of iterations:

  • For $j = 0,1,...,n$
  • Compute $\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ) = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \hat y^{(i)}_{pred} + \theta_j x_j^{(i)} )$
  • Set $\theta_j = S(\rho_j, \lambda)$
  • Note that if there is a constant term, then it is not regularized so $\theta_0 = \rho_0$

    ### Libraries

In [ ]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits import mplot3d
from sklearn import linear_model
from sklearn import datasets

%matplotlib inline
plt.style.use('seaborn-white')

The dataset

We will be using the well known diabetes dataset taken from Sklearn's built in datasets

In [195]:
#Load the diabetes dataset. In this case we will not be using a constant intercept feature
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target.reshape(-1,1)

Implementation using Numpy

In [196]:
def soft_threshold(rho,lamda):
    '''Soft threshold function used for normalized data and lasso regression'''
    if rho < - lamda:
        return (rho + lamda)
    elif rho >  lamda:
        return (rho - lamda)
    else: 
        return 0
    

def coordinate_descent_lasso(theta,X,y,lamda = .01, num_iters=100, intercept = False):
    '''Coordinate gradient descent for lasso regression - for normalized data. 
    The intercept parameter allows to specify whether or not we regularize theta_0'''
    
    #Initialisation of useful values 
    m,n = X.shape
    X = X / (np.linalg.norm(X,axis = 0)) #normalizing X in case it was not done before
    
    #Looping until max number of iterations
    for i in range(num_iters): 
        
        #Looping through each coordinate
        for j in range(n):
            
            #Vectorized implementation
            X_j = X[:,j].reshape(-1,1)
            y_pred = X @ theta
            rho = X_j.T @ (y - y_pred  + theta[j]*X_j)
        
            #Checking intercept parameter
            if intercept == True:  
                if j == 0: 
                    theta[j] =  rho 
                else:
                    theta[j] =  soft_threshold(rho, lamda)  

            if intercept == False:
                theta[j] =  soft_threshold(rho, lamda)   
            
    return theta.flatten()

Lasso coefficient path using Numpy implementation

In [197]:
# Initialize variables
m,n = X.shape
initial_theta = np.ones((n,1))
theta_list = list()
lamda = np.logspace(0,4,300)/10 #Range of lambda values

#Run lasso regression for each lambda
for l in lamda:
    theta = coordinate_descent_lasso(initial_theta,X,y,lamda = l, num_iters=100)
    theta_list.append(theta)

#Stack into numpy array
theta_lasso = np.stack(theta_list).T

#Plot results
n,_ = theta_lasso.shape
plt.figure(figsize = (12,8))

for i in range(n):
    plt.plot(lamda, theta_lasso[i], label = diabetes.feature_names[i])

plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('Coefficients')
plt.title('Lasso Paths - Numpy implementation')
plt.legend()
plt.axis('tight')
Out[197]:
(0.06309573444801936, 1584.8931924611109, -849.8147108555936, 820.610451673353)

Lasso coefficient path using Sklearn

In [198]:
#Load the diabetes dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X / np.linalg.norm(X,axis=0)  # Standardize data (easier to set the l1_ratio parameter)
m,n = X.shape

eps = 5e-5  # the smaller it is the longer is the path

alphas_lasso, coefs_lasso, _ = linear_model.lasso_path(X, y, eps, fit_intercept=False)

# Display results
plt.figure(figsize = (12,8))
#neg_log_alphas_lasso = -np.log10(alphas_lasso)

for i in range(n):
    plt.plot(alphas_lasso, coefs_lasso[i], label = diabetes.feature_names[i])

plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('coefficients')
plt.title('Lasso paths - Sklearn')
plt.legend()
plt.axis('tight')
Out[198]:
(6.545782818637954e-05,
 3.5244762392941755,
 -861.120524805701,
 825.1469480793486)