Ridge regression - introduction

This notebook is the first of a series exploring regularization for linear regression, and in particular ridge and lasso regression.

We will focus here on ridge regression with some notes on the background theory and mathematical derivations that are useful to understand the concepts.

Then, the algorithm is implemented in Python numpy

Finally we will provide visualizations of the cost functions with and without regularization to help gain an intuition as to why ridge regression is a solution to poor conditioning and numerical stability.

Sources:

Ridge Regression - Theory

Ridge regression and the Lasso are two forms of regularized regression. These methods seek to alleviate the consequences of multi-collinearity, poorly conditioned equations, and overfitting.

  • When variables are highly correlated a large coefficient in one variable may be alleviated by a large coefficient in another variable, which is negatively correlated
  • Regularization imposes an upper threshold on the values taken by the coefficients, thereby producing a more parsimonious solution, and a set of coefficients with smaller variance

Ridge regression as an L2 constrained optimization problem

Ridge regression is motivated by a constrained minimization problem, which can be formulated as follows:

$$ \hat \theta_{ridge} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2$$

$$ \text{subject to } \sum_{j=1}^n \theta_j^2 \leq t$$

For $t \geq 0$. The feasible set for this minimization problem is therefore constrained to be

$$ S(t) : = \{ \theta \in \mathbb{R}^n \ || \theta ||^2_2 \leq t \}$$

Where $\theta$ does not contain the intercept $\theta_0$. The ridge estimator are not equivariant under a rescaling of the $x$ because of the $L_2$ penalty, this difficulty is circumvented by centering the predictors. Here we will assume that the design matrix $X$ is in fact the centered matrix, such that we can exclude $\beta_0$ the intercept from the penalty term.

The use of an $L_2$ penalty in least square problem is sometimes referred to as the Tikhonov regularization. Using a Lagrange multiplier we can rewrite the problem as:

$$ \hat \theta_{ridge} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2 + \lambda \sum_{j=1}^n \theta^2 $$

Where $\lambda \geq 0$ and where there is a one-to-one correspondence between $t$ and $\lambda$

A constrained optimization problem is said to be a convex optimization if both the objective function and the constraint are convex functions. In our case, the RSS($\beta$) is convex as it is least squares, and to show that the regularization term is convex, we can compute its Hessian matrix and show it is positive definite :

\begin{aligned} f(\theta) & = \sum_{j=1}^n \theta_j^2 \\ & = \theta^T \theta \\ \frac{\partial}{\partial \theta} &= 2 \theta \\ \frac{\partial^2}{\partial \theta^2} &= 2 > 0 \end{aligned}

Which proves that $f(\theta)$ is strictly convex.

Ridge regression as a solution to poor conditioning

An alternative way of seeing the problem is to consider the matrix solutions to the problem of linear regression

$$ \hat \theta = \mathbf{(X^TX)^{-1}X^Ty}$$

Which requires the inversion of $\mathbf{X^TX}$. If this matrix is not of full rank (i.e. there are linearly dependent columns) or if it is poorly conditioned (i.e. eigen values which are very close to zero) then the inversion can either be impossible, or numerically unstable.

A solution is to add a small element to the diagonal such that the eigen-values (and the matrix) are better conditioned

$$ \hat \theta = \mathbf{(X^TX + \lambda^2 I_n)^{-1}X^Ty}$$

This is actually the solution to the following quadratic cost function:

$$ J(\theta) = \mathbf{(y - X \theta)^T(y - X \theta) + \lambda^2 \theta^T \theta}$$

This is proven as follows:

\begin{aligned} J(\theta) &= \mathbf{(y - X \theta)^T(y - X \theta) + \lambda^2 \theta^T \theta} \\ & = \mathbf{(y^T - \theta^T X^T) (y - X \theta) + \lambda^2 \theta^T \theta} \\ & = \mathbf{ y^Ty - y^TX\theta - \theta^TX^Ty + \theta^TX^TX\theta + \lambda^2 \theta^T \theta } \\ & = \mathbf{ y^Ty - 2 \theta^TX^Ty + \theta^TX^TX\theta + \lambda^2 \theta^T \theta } \\ \frac{ \partial}{\partial \theta} J(\theta) &= \mathbf{ - 2 X^Ty + 2 \theta X^TX + 2 \lambda^2 \theta } = 0 \\ \theta &= \mathbf{ (X^TX + \lambda^2I)^{-1} X^Ty} \end{aligned}

Intuition

As this is a convex optimization problem, it can be shown using the Lagragian that the following two optimization problems are equivalent:

$$ min \mathbf{(y - X \theta)^T(y - X \theta) + \lambda^2 \theta^T \theta}\ \ \equiv \ \ argmin_{\theta^T\theta \leq t(\lambda)} \mathbf{(y - X \theta)^T(X \theta) }$$

Which can be used to form an intuitive view of the problem.

  • Consider $\theta^T \theta = \theta_1^2 + \theta^2_2 = c$ which is the equation of a circle with radius $c$. -
  • Similarly, the equation $(y - X \theta)^T(y - X \theta) $ is the OLS (or Maximum Likelihood) solution gives rise to an elipse centered around the Maximum Likelihood Estimator.
  • The solution to the constrained optimization lies at the intersection between the contours of the two functions, and this intersection varies as a function of $\lambda$. For $\lambda = 0$ the solution is the MLE (as usual) and for $\lambda = \infty$ the solution is at $[0,0]$.
In [6]:
from IPython.display import Image
Image(url='https://newonlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson04/ridge_regression_geomteric/index.png')
Out[6]:

Ridge regression - Implementation with Python - Numpy

Regularized Cost Function

$ J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}$

$ J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y) + \frac{\lambda}{2m} \theta^T \theta$ (vectorized version)

Where $h_\theta(x^{(i)}) = \theta_0 x_0^{(i)} + \theta_1 x_1^{(i)} + ... + \theta_n x_n^{(i)}$. Note that if the data is not centered then we add a column of $1$'s to the design matrix X, as usual...

Gradient

$\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m}X^T(X\theta - y) + \frac{\lambda}{m} \theta $

Gradient descent (vectorized)

$\theta^{(t+1)} : = \theta^{(t)} - \alpha \frac{\partial}{\partial \theta} J(\theta^{(t)}) $

Closed form solution

$\theta = (X^TX + \lambda I)^{-1} X^T Y$

Libraries

In [7]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits import mplot3d
from sklearn import linear_model

%matplotlib inline
plt.style.use('seaborn-white')

Functions

In [8]:
def costfunction(X,y,theta):
    '''Cost function for linear regression'''
    #Initialization of useful values 
    m = np.size(y)
    
    #Vectorized implementation
    h = X @ theta
    J = float((1./(2*m)) * (h - y).T @ (h - y));    
    return J;


def costFunctionReg(X,y,theta,lamda = 10):
    '''Cost function for ridge regression (regularized L2)'''
    #Initialization
    m = len(y) 
    J = 0
    
    #Vectorized implementation
    h = X @ theta
    J_reg = (lamda / (2*m)) * np.sum(np.square(theta))
    J = float((1./(2*m)) * (h - y).T @ (h - y)) + J_reg;
    return(J) 


def gradient_descent(X,y,theta,alpha = 0.0005,num_iters=1000):
    '''Gradient descent for linear regression'''
    #Initialisation of useful values 
    m = np.size(y)
    J_history = np.zeros(num_iters)
    theta_0_hist, theta_1_hist = [], [] #For plotting afterwards
    
    for i in range(num_iters):
        #Cost and intermediate values for each iteration
        J_history[i] = costfunction(X,y,theta)
        theta_0_hist.append(theta[0,0])
        theta_1_hist.append(theta[1,0])
        
        #Grad function in vectorized form
        h = X @ theta
        gradient = (1/m)*(X.T @ (h-y))
        theta = theta - alpha * gradient       
    return theta,J_history, theta_0_hist, theta_1_hist

def gradient_descent_reg(X,y,theta,alpha = 0.0005,lamda = 10,num_iters=1000):
    '''Gradient descent for ridge regression'''
    #Initialisation of useful values 
    m = np.size(y)
    J_history = np.zeros(num_iters)
    theta_0_hist, theta_1_hist = [], [] #Used for three D plot

    for i in range(num_iters):
        #Hypothesis function
        h = np.dot(X,theta)
        
        #Grad function in vectorized form
        theta = theta - alpha * (1/m)* (  (X.T @ (h-y)) + lamda * theta )
           
        #Cost function in vectorized form       
        J_history[i] = costFunctionReg(X,y,theta,lamda)
           
        #Calculate the cost for each iteration(used to plot convergence)
        theta_0_hist.append(theta[0,0])
        theta_1_hist.append(theta[1,0])   
    return theta ,J_history, theta_0_hist, theta_1_hist

def closed_form_solution(X,y):
    '''Closed form solution for linear regression'''
    return np.linalg.inv(X.T @ X) @ X.T @ y
    
def closed_form_reg_solution(X,y,lamda = 10): 
    '''Closed form solution for ridge regression'''
    m,n = X.shape
    I = np.eye((n))
    return (np.linalg.inv(X.T @ X + lamda * I) @ X.T @ y)[:,0]

def cost_l2(x,y):
    return x**2 + y**2

def cost_l1(x,y):
    return np.abs(x) + np.abs(y)

The dataset

As previously, we will use a simulated dataset which consists of a sine wave with uniform noise added.

For simplicity and interest:

  • We center the data (such that a y-intercept parameter is not needed) : $y := y - mean(y)$
  • Add a second explanatory variable which is the first variable (x) squared
  • Normalize the explanatory data so that the unit length of each feature is 1
  • The design matrix is therefore given by: $X = [x, x^2]$ normalized
In [16]:
#Generating sine curve and uniform noise
x = np.linspace(0,1,40)
noise = 1*np.random.uniform(  size = 40)
y = np.sin(x * 1.5 * np.pi ) 
y_noise = (y + noise).reshape(-1,1)

#Centering the y data
y_noise = y_noise - y_noise.mean()

#Design matrix is x, x^2
X = np.vstack((2*x,x**2)).T

#Nornalizing the design matrix to facilitate visualization
X = X / np.linalg.norm(X,axis = 0)

#Plotting the result
plt.scatter(x,y_noise, label = 'Dataset')
plt.plot(x,y - y.mean(),label = 'Sine')
plt.title('Noisy sine curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Visualizing Ridge regression and its impact on the cost function

In presence of multi-colinearity between the explanatory variables, the least squares cost function will be very "flat" near the global minimum and small changes in the data will lead to large changes in the coefficient values.

From a linear algebra perspective, the design matrix will be poorly conditionned, and ridge (l2) regularization solves this problem by adding values to the diagonals.

From a cost function perspective (i.e. cost as a function of the parameter space) this has the effect of changing the shape of the "valley" and making it less "flat".

In fact, if one reverses the plot and thinks of the valley as a "ridge" (i.e. the top of the mountain) then Ridge regression has the effect of "removing the ridge"...

See next for the two surface and contour plots, without and with regularization

Plotting the cost function without regularization

In [25]:
#Setup of meshgrid of theta values
T0, T1 = np.meshgrid(np.linspace(7,18,100),np.linspace(-18,-9,100))

#Computing the cost function for each theta combination
zs = np.array(  [costfunction(X, y_noise.reshape(-1,1),np.array([t0,t1]).reshape(-1,1)) 
                     for t0, t1 in zip(np.ravel(T0), np.ravel(T1)) ] )
#Reshaping the cost values    
Z = zs.reshape(T0.shape)


#Computing the gradient descent
theta_result,J_history, theta_0, theta_1 = gradient_descent(X,y_noise,np.array([7,-10]).reshape(-1,1),alpha = 1,num_iters=5000)

#Angles needed for quiver plot
anglesx = np.array(theta_0)[1:] - np.array(theta_0)[:-1]
anglesy = np.array(theta_1)[1:] - np.array(theta_1)[:-1]

%matplotlib inline
fig = plt.figure(figsize = (16,8))

#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(T0, T1, Z, rstride = 5, cstride = 5, cmap = 'jet', alpha=0.5)
ax.plot(theta_0,theta_1,J_history, marker = '*', color = 'r', alpha = .4, label = 'Gradient descent')

ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_zlabel('Cost function')
ax.view_init(45, -45)

#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(T0, T1, Z, 70, cmap = 'jet')
ax.quiver(theta_0[:-1], theta_1[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .9)
ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_title('Gradient descent: Root at {}'.format(theta_result.ravel()))

plt.suptitle('Cost function and gradient descent: no regularization')
plt.show()

Plotting the cost function with large regularization (ridge regression)

In [24]:
 l = 10

#Setup of meshgrid of theta values
T1, T2 = np.meshgrid(np.linspace(-10,10,100),np.linspace(-10,10,100))

#Computing the cost function for each theta combination
zs = np.array(  [costFunctionReg(X, y_noise.reshape(-1,1),np.array([t1,t2]).reshape(-1,1),l) 
                     for t1, t2 in zip(np.ravel(T1), np.ravel(T2)) ] )
#Reshaping the cost values    
Z = zs.reshape(T1.shape)


#Computing the gradient descent
theta_result_reg,J_history_reg, theta_0, theta_1 = gradient_descent_reg(X,y_noise,np.array([7.,10.]).reshape(-1,1), 0.8,l,num_iters=5000)


#Angles needed for quiver plot
anglesx = np.array(theta_0)[1:] - np.array(theta_0)[:-1]
anglesy = np.array(theta_1)[1:] - np.array(theta_1)[:-1]

%matplotlib inline
fig = plt.figure(figsize = (16,8))

#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(T1, T2, Z, rstride = 5, cstride = 5, cmap = 'jet', alpha=0.5)
ax.plot(theta_0,theta_1,J_history_reg, marker = '*', color = 'r', alpha = .4, label = 'Gradient descent')

ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')
ax.set_zlabel('error')
ax.set_title('RSS gradient descent: Root at {}'.format(theta_result_reg.ravel()))
ax.view_init(45, -45)


#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(T1, T2, Z, 100, cmap = 'jet')
ax.quiver(theta_0[:-1], theta_1[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .9)
ax.set_xlabel('theta 1')
ax.set_ylabel('theta 2')

plt.suptitle('Cost function and gradient descent: Ridge regularization')
plt.show()