Table of Contents
- 1 Introduction: Optimization and non linear methods
- 2 Finding roots of non linear functions (1 - dimension)
- 3 Finding extrema of a function
- 3.1 Introducing the Rosenbrock function
- 3.2 Gradient descent method
- 3.3 Gradient descent on a simpler function (quadratic)
- 3.4 Improving the Gradient descent with line search (to be completed)
- 3.5 Newton's method
- 3.6 Comparing Newton and Gradient Descent in presence of a single saddle point
- 3.7 Comparing Newton and Gradient Descent in presence of a multiple saddle points
- 3.8 Drawbacks of Newton's method
- 3.9 A comparison between gradient descent and Newton's method
- 3.10 Appendix - code snipets and to do
Introduction: Optimization and non linear methods¶
This notebook explores introductory level algorithms and tools that can be used for non linear optimization. It starts with root finding algorithms in 1 dimensions using a simple example and then moves onto optimization methods (minimum finding) and multidimensional cases. For each example, a graph of the function and a visualization of the steps taken by the optimization algorithm are shown.
Of interest are the worked examples done with "pen and paper" as well as the direct implementations using python and numpy only. A study of the optimization of the well known Rosenbrock function is done using both the Gradient Descent and Newton Rapshon approaches.
Sources¶
The following is a collection of sources of interest which were used to build this notebook:
- Duke university: Optimization and Non-linear Methods
- Sebastien Ruder's blog
- Nonlinear Optimization course at the University of Jyväskylä
- Numerical python tours
- Louis Tiao's blog: visualizing-and-animating-optimization-algorithms-with-matplotlib
Libraries¶
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
from mpl_toolkits import mplot3d
Finding roots of non linear functions (1 - dimension)¶
Example used throughout this section:¶
Consider the following polynomial function which is continuous, differentiable and well behaved. It is an interesting example to use when testing root finding algorithms because it exhibits several basins of attractions in 1D. A basin of attraction is a set of initial values leading to the same solution (i.e. root in this case), see here for some details. In this example, different starting points will lead to different solutions, or in some cases no solutions at all when the algorithm "blows up"
The next section will investigate different methods of finding the roots of:
$f(x) = x^3 - 2x^2 - 11x + 12$
$f'(x) = 3x^2 - 4x - 11$
#Defining the functions and its first derivative
def f(x):
return x**3 - 2*x**2 - 11*x + 12
def dfdx(x):
return 3*x**2 - 4*x - 11
#Initializing data
x = np.linspace(-4,6,100)
y = f(x)
y_dash = dfdx(x)
#Plotting the function and its derivative
fig = plt.figure()
plt.xlabel('x')
plt.axhline(0)
plt.plot(x,y, label = 'f(x)')
plt.plot(x,y_dash, '--r', label = 'df(x)\dx')
plt.legend()
plt.show()
Bisection Method¶
The bisection method is one of the simplest methods for finding zeroes of a non-linear function. It is guaranteed to find a root - but it can be slow. The main idea comes from the intermediate value theorem: If $f(a)$ and $f(b)$ have different signs and $f$ is continous, then f must have a zero between a and b. We evaluate the function at the midpoint, $c=\frac{1}{2}(a+b).$ $f(c)$ is either zero, has the same sign as f(a) or the same sign as $f(b)$. Suppose $f(c)$ has the same sign as f(a) (as pictured below). We then repeat the process on the interval $[c,b]$.
def Bisection_Search(func,a,b,epsilon,nMax = 1000):
#Initializating variables, iter_x, iter_x are used to plot results
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0),np.empty(0)
#Looping condition to ensure that loop is not infinite
while i < nMax:
i +=1
c = .5 *(a + b)
iter_x = np.append(iter_x,c)
iter_y = np.append(iter_y,func(c))
iter_count = np.append(iter_count ,i)
if np.abs(func(c)) < epsilon:
return c, iter_x, iter_y, iter_count
elif np.sign(func(c)) == np.sign(func(a)):
a = c
elif np.sign(func(c)) == np.sign(func(b)):
b = c
root,iter_x,iter_y, iter_count = Bisection_Search(f,-5,-2,0.01)
#Plotting the iterations and intermediate values
fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.axhline(0)
ax.plot(x,y)
ax.scatter(x = iter_x,y = iter_y)
ax.set_title('Bisection Method: {} iterations'.format(len(iter_count)))
#Loop to add text annotations to the iteration points
for i, txt in enumerate(iter_count):
ax.annotate(txt, (iter_x[i], iter_y[i]))
plt.show()
Newton Raphson's method¶
Newton's method, also known as Newton-Raphson's method, is a very famous and widely used method for solving nonlinear algebraic equations. Compared to the other methods we will consider, it is generally the fastest one (usually by far). It does not guarantee that an existing solution will be found, however.
Newton's method is normally formulated with an iteration index n:
$x_{n+1} = x_n - \large \frac{f(x_n)}{f'(x_n)}$
To avoid storing intermediate values in memory, one can overwrite the previous value and work with a single variable:
$x := x - \frac{f(x)}{f'(x)}$
In this example:
$f(x) = x^3 - 2x^2 - 11x + 12$
$f'(x) = 3x^2 - 4x - 11$
def Newton_Raphson(func, deriv, x, epsilon, nMax = 100):
#Initializating variables, iter_x, iter_x are used to plot results
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0),np.empty(0)
error = x - (x - func(x)/ deriv(x))
#Looping as long as error is greater than epsilon
while np.abs(error) > epsilon and i < nMax:
i +=1
iter_x = np.append(iter_x,x)
iter_y = np.append(iter_y,func(x))
iter_count = np.append(iter_count ,i)
error = x - (x - func(x)/ deriv(x))
x = x - func(x)/ deriv(x)
return x, iter_x, iter_y, iter_count
root,iter_x,iter_y, iter_count = Newton_Raphson(f,dfdx,-2,0.01)
#Plotting the iterations and intermediate values
fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.axhline(0)
ax.plot(x,y)
ax.scatter(x = iter_x,y = iter_y)
ax.set_title('Newton Method: {} iterations'.format(len(iter_count)))
for i, txt in enumerate(iter_count):
ax.annotate(txt, (iter_x[i], iter_y[i]))
plt.show()
Newton Raphson's using Scipy¶
Similar results are obtained using Scipy's built in optimizer
import scipy.optimize
scipy.optimize.newton(f,6,dfdx)
Secant method¶
When finding the derivative $f′(x)$ in Newton's method is problematic, or when function evaluations take too long; we may adjust the method slightly. Instead of using tangent lines to the graph we may use secants. The approach is referred to as the secant method.
The idea of the secant method is to think as in Newton's method, but instead of using $f′(x_n)$, we approximate this derivative by a finite difference or the secant, i.e., the slope of the straight line that goes through the two most recent approximations $x_n$ and $x_n{−1}$. This slope reads
$f'(x_n) \approx \large \frac{f(x_n) - f(x_{n-1})}{x_n - x_{n-1}}$
Inserting this expression for $f'(x_n)$ into Newton's method simply gives us the secant method:
$x_{n+1} = x_n - f(x_n) \times \large \frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})}$
As with the Bisection Search - the secant method requires two starting points
def Secant_method(func, x0,x1, epsilon, nMax = 100):
#Initialization
i = 0
x2 = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0),np.empty(0)
error = 100
x2 = x1 - func(x1) * ((x1 - x0 ) / (func(x1) - func(x0)))
#Looping as long as error is greater than epsilon
while np.abs(error) > epsilon and i < nMax:
#Variables for plotting
i +=1
iter_x = np.append(iter_x,x2)
iter_y = np.append(iter_y,func(x2))
iter_count = np.append(iter_count ,i)
# Calculate new value
x2 = x1 - func(x1) * ((x1 - x0 ) / (func(x1) - func(x0)))
error = x2 - x1
x0 = x1
x1 = x2
return x2, iter_x, iter_y, iter_count
root,iter_x,iter_y, iter_count = Secant_method(f,-5,-2,0.01)
#Plotting the iterations and intermediate values
fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.axhline(0)
ax.plot(x,y)
ax.scatter(x = iter_x,y = iter_y)
ax.set_title('Secant Method: {} iterations'.format(len(iter_count)))
for i, txt in enumerate(iter_count):
ax.annotate(txt, (iter_x[i], iter_y[i]))
plt.show()
Finding extrema of a function¶
In this section, we are concerned in optimizing a function rather than finding a root, or solution. As a result the algorithms will try to find local / global minimum and maximum points. A well known function used to compare optimization algorithms is the Rosenbrock function, described next.
Introducing the Rosenbrock function¶
In mathematical optimization, the Rosenbrock function is a non-convex function, introduced by Howard H. Rosenbrock in 1960, which is used as a performance test problem for optimization algorithms. It is also known as Rosenbrock's valley or Rosenbrock's banana function.
The global minimum is inside a long, narrow, parabolic shaped flat valley. To find the valley is trivial. To converge to the global minimum, however, is difficult.
The function is defined by:
$f(x,y) = (a - x)^2 + b(y - x^2)^2$ and has a global minimum at $(x,y) = (a,a^2)$. Usually we choose parameters $(a,b) = (1, 100)$ giving:
$f(x,y) = (1 - x)^2 + 100(y - x^2)^2$ with minimum at $(1,1)$
Gradient
$\nabla f = [-400xy + 400x^3 + 2x -2, 200y - 200x^2]$
Hessian
$H = \begin{bmatrix}-400y + 1200x^2 + 2 & -400x \\ -400x & 200 \end{bmatrix} $
def Rosenbrock(x,y):
return (1 + x)**2 + 100*(y - x**2)**2
def Grad_Rosenbrock(x,y):
g1 = -400*x*y + 400*x**3 + 2*x -2
g2 = 200*y -200*x**2
return np.array([g1,g2])
def Hessian_Rosenbrock(x,y):
h11 = -400*y + 1200*x**2 + 2
h12 = -400 * x
h21 = -400 * x
h22 = 200
return np.array([[h11,h12],[h21,h22]])
Gradient descent method¶
Gradient descent (or steepest descent) is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
The approach is based on the observation that if a multivariable function is defined and differentiable in a neighborhood of a point, then the function decreases fastest in the direction of the negative gradient. It follows that if:
$\mathbf{x_{n+1} = x_n} - \gamma \nabla \mathbf{F(x)}$
For $\gamma$ small enough, then $F(x_n) \geq F(x_{n+1})$
The value of $\gamma$ is called the stepsize and is allowed to change at every iteration. Some more advanced implementations find the optimal value of $gamma$ at each step by using line search methods...
Comments
- The gradient is everywhere perpendicular to the contour lines.
- After each line minimization the new gradient is always orthogonal to the previous step direction (true of any line minimization.)
- Consequently, the iterates tend to zig-zag down the valley in a very inefficient manner
Gradient Descent implementation¶
def Gradient_Descent(Grad,x,y, gamma = 0.00125, epsilon=0.0001, nMax = 10000 ):
#Initialization
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0), np.empty(0)
error = 10
X = np.array([x,y])
#Looping as long as error is greater than epsilon
while np.linalg.norm(error) > epsilon and i < nMax:
i +=1
iter_x = np.append(iter_x,x)
iter_y = np.append(iter_y,y)
iter_count = np.append(iter_count ,i)
#print(X)
X_prev = X
X = X - gamma * Grad(x,y)
error = X - X_prev
x,y = X[0], X[1]
print(X)
return X, iter_x,iter_y, iter_count
root,iter_x,iter_y, iter_count = Gradient_Descent(Grad_Rosenbrock,-2,2)
Plotting the Rosenbrock function and the Gradient Descent iteration steps¶
x = np.linspace(-2,2,250)
y = np.linspace(-1,3,250)
X, Y = np.meshgrid(x, y)
Z = Rosenbrock(X, Y)
#Angles needed for quiver plot
anglesx = iter_x[1:] - iter_x[:-1]
anglesy = iter_y[1:] - iter_y[:-1]
%matplotlib inline
fig = plt.figure(figsize = (16,8))
#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z,rstride = 5, cstride = 5, cmap = 'jet', alpha = .4, edgecolor = 'none' )
ax.plot(iter_x,iter_y, Rosenbrock(iter_x,iter_y),color = 'r', marker = '*', alpha = .4)
ax.view_init(45, 280)
ax.set_xlabel('x')
ax.set_ylabel('y')
#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(X,Y,Z, 50, cmap = 'jet')
#Plotting the iterations and intermediate values
ax.scatter(iter_x,iter_y,color = 'r', marker = '*')
ax.quiver(iter_x[:-1], iter_y[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .3)
ax.set_title('Gradient Descent with {} iterations'.format(len(iter_count)))
plt.show()
Comments¶
As expected, the algorithm finds the "valley" - however it struggles to reach the global minimum and gets stuck in a zigzag behaviour. Even after several thousand iterations, the minimum is not found. Trying different initial values leads to vastly different results, many of which are not close to the global minimum, and in many cases leading to infinite values (i.e. algorithm does not converge)
The behaviour is also very dependent on the initial choice of the $\gamma$ value.
Gradient descent on a simpler function (quadratic)¶
Consider a simpler function to optimize.
$f(x,y) = .01x^2 + .1y^2$
As this function is convex and significantly easier to optimize, we would expect the Gradient Descent algorithm to give better results than for the Rosenbrock function.
def f(x,y):
return .01*x**2 + .1*y**2
def Grad_f(x,y):
g1 = 2*.01*x
g2 = 2*.1*y
return np.array([g1,g2])
root,iter_x,iter_y, iter_count = Gradient_Descent(Grad_f,-2,-2,1)
x = np.linspace(-3,3,250)
y = np.linspace(-3,3,250)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
#Angles needed for quiver plot
anglesx = iter_x[1:] - iter_x[:-1]
anglesy = iter_y[1:] - iter_y[:-1]
%matplotlib inline
fig = plt.figure(figsize = (16,8))
#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z,rstride = 5, cstride = 5, cmap = 'jet', alpha = .4, edgecolor = 'none' )
ax.plot(iter_x,iter_y, f(iter_x,iter_y),color = 'r', marker = '*', alpha = .4)
ax.set_title('f(x,y) = .01x^2 + .1y^2')
ax.view_init(65, 340)
ax.set_xlabel('x')
ax.set_ylabel('y')
#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(X,Y,Z, 50, cmap = 'jet')
#Plotting the iterations and intermediate values
ax.scatter(iter_x,iter_y,color = 'r', marker = '*')
ax.quiver(iter_x[:-1], iter_y[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .3)
ax.set_title('Gradient Descent with {} iterations'.format(len(iter_count)))
plt.show()
Improving the Gradient descent with line search (to be completed)¶
We can improve the algorithm by replacing the constant $\gamma$ value with a value that changes at each step.
To be completed
Newton's method¶
Similarly to the root finding algorithm, Newton's method for finding extrema takes the form:
$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}$
Extending to multivariate case, we replace the derivative with the gradient $\nabla f(x_n)$ and the reciprocal of the second derivative with the inverse of the Hessian matrix:
$x_{n+1} = x_n - [\mathbf{H} \ f(x_n)]^{-1} \nabla f(x_n)$
Finding the inverse of the Hessian in high dimensions can be computationally expensive. In such cases, instead of inverting the Hessian it is better to calculate the vector $\delta x = x_{n+1} - x_n$ as the solution to the linear system of equations:
$[\mathbf{H} \ f(x_n)] \nabla \mathbf{x_n} = - \nabla f(x_n)$
Note that since Newton's method converges quadratically, any (smooth) quadratic function will be optimized in one step, as is shown in the trivial example below.
Example 1) Simple case
$f(x_1, x_2) = 0.5 x_1^2 + 2.5x_2^2$ where $\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}$
$\nabla \ f(x_1,x_2) = \begin{bmatrix}x_1 \\ 5x_2 \end{bmatrix} $ and $H = J(\nabla^T) = \begin{bmatrix} 1 & 0 \\ 0 & 5 \end{bmatrix} $
$H^{-1} = \begin{bmatrix}1 & 0 \\ 0 & 1/5 \end{bmatrix}$
$\mathbf{x_{n+1} = x_n - H^{-1}_n} \nabla f(\mathbf{x_n})$
$\begin{bmatrix}x_1\\ x_2 \end{bmatrix}_{n+1} := \begin{bmatrix}x_1\\ x_2 \end{bmatrix}_{n} - \begin{bmatrix}1 & 0 \\ 0 & 1/5 \end{bmatrix} \begin{bmatrix}x_1 \\ 5x_2 \end{bmatrix}_n = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}_n - \begin{bmatrix}x_1 \\ x_2 \end{bmatrix}_n = \begin{bmatrix}0 \\ 0 \end{bmatrix}$
def Newton_Raphson_Optimize(Grad, Hess, x,y, epsilon=0.000001, nMax = 200):
#Initialization
i = 0
iter_x, iter_y, iter_count = np.empty(0),np.empty(0), np.empty(0)
error = 10
X = np.array([x,y])
#Looping as long as error is greater than epsilon
while np.linalg.norm(error) > epsilon and i < nMax:
i +=1
iter_x = np.append(iter_x,x)
iter_y = np.append(iter_y,y)
iter_count = np.append(iter_count ,i)
print(X)
X_prev = X
X = X - np.linalg.inv(Hess(x,y)) @ Grad(x,y)
error = X - X_prev
x,y = X[0], X[1]
return X, iter_x,iter_y, iter_count
root,iter_x,iter_y, iter_count = Newton_Raphson_Optimize(Grad_Rosenbrock,Hessian_Rosenbrock,-2,2)
Plotting the iterations on top of the surface and contour plot¶
x = np.linspace(-3,3,250)
y = np.linspace(-9,8,350)
X, Y = np.meshgrid(x, y)
Z = Rosenbrock(X, Y)
#Angles needed for quiver plot
anglesx = iter_x[1:] - iter_x[:-1]
anglesy = iter_y[1:] - iter_y[:-1]
%matplotlib inline
fig = plt.figure(figsize = (16,8))
#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z,rstride = 5, cstride = 5, cmap = 'jet', alpha = .4, edgecolor = 'none' )
ax.plot(iter_x,iter_y, Rosenbrock(iter_x,iter_y),color = 'r', marker = '*', alpha = .4)
#Rotate the initialization to help viewing the graph
ax.view_init(45, 280)
ax.set_xlabel('x')
ax.set_ylabel('y')
#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(X,Y,Z, 60, cmap = 'jet')
#Plotting the iterations and intermediate values
ax.scatter(iter_x,iter_y,color = 'r', marker = '*')
ax.quiver(iter_x[:-1], iter_y[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .3)
ax.set_title('Newton method with {} iterations'.format(len(iter_count)))
plt.show()
Comments¶
The Newton Raphson method converges extremely rapidly to the global solution, despite the significant challenge posed by the Rosenbrock function. Irrespective of the starting point (within the scope of the graph above) the minimum is found within 4 - 6 iterations. Compare this to the several thousand iterations required by the gradient descent approach.
Newton Raphson optimization using Scipy¶
from scipy.optimize import minimize, rosen, rosen_der, rosen_hess
x0 = [-3,10]
res = minimize(rosen, x0, method='trust-exact', jac=rosen_der, hess = rosen_hess, tol = 1e-4, callback = print)
Comparing Newton and Gradient Descent in presence of a single saddle point¶
$f_2(x,y) = .01x^2 - .1y^2$
$f'_2(x,y) = .02x - .2y$
def f_2(x,y):
return .01*x**2 - .1*y**2
def Grad_f_2(x,y):
g1 = 2*.01*x
g2 = - 2*.1*y
return np.array([g1,g2])
def Hessian_f_2(x,y):
return np.array([[.02,0],[0,-.2]])
root_gd,iter_x_gd,iter_y_gd, iter_count_gd = Gradient_Descent(Grad_f_2,-2.5,-0.010,1, nMax = 25)
root_nr,iter_x_nr,iter_y_nr, iter_count_nr = Newton_Raphson_Optimize(Grad_f_2,Hessian_f_2,-2,-.01, nMax = 25)
x = np.linspace(-3,3,100)
y = np.linspace(-1,1,100)
X, Y = np.meshgrid(x, y)
Z = f_2(X, Y)
#Angles needed for quiver plot
anglesx = iter_x_gd[1:] - iter_x_gd[:-1]
anglesy = iter_y_gd[1:] - iter_y_gd[:-1]
anglesx_nr = iter_x_nr[1:] - iter_x_nr[:-1]
anglesy_nr = iter_y_nr[1:] - iter_y_nr[:-1]
%matplotlib inline
fig = plt.figure(figsize = (16,8))
#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z,rstride = 5, cstride = 5, cmap = 'jet', alpha = .4, edgecolor = 'none' )
ax.plot(iter_x_gd,iter_y_gd, f_2(iter_x_gd,iter_y_gd),color = 'r', marker = '*', alpha = .4, label = 'Gradient descent')
ax.plot(iter_x_nr,iter_y_nr, f_2(iter_x_nr,iter_y_nr),color = 'darkblue', marker = 'o', alpha = .4, label = 'Newton')
ax.legend()
#Rotate the initialization to help viewing the graph
#ax.view_init(45, 280)
ax.set_xlabel('x')
ax.set_ylabel('y')
#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(X,Y,Z, 60, cmap = 'jet')
#Plotting the iterations and intermediate values
ax.scatter(iter_x_gd,iter_y_gd,color = 'r', marker = '*', label = 'Gradient descent')
ax.quiver(iter_x_gd[:-1], iter_y_gd[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'r', alpha = .3)
ax.scatter(iter_x_nr,iter_y_nr,color = 'darkblue', marker = 'o', label = 'Newton')
ax.quiver(iter_x_nr[:-1], iter_y_nr[:-1], anglesx_nr, anglesy_nr, scale_units = 'xy', angles = 'xy', scale = 1, color = 'darkblue', alpha = .3)
ax.legend()
ax.set_title('Comparing Newton and Gradient descent')
plt.show()
Comparing Newton and Gradient Descent in presence of a multiple saddle points¶
Using the himmerlblau function¶
$f(x,y)=(x^{2}+y-11)^{2}+(x+y^{2}-7)^{2}.\quad $
$\nabla f(x,y) = [2 (-7 + x + y^2 + 2 x (-11 + x^2 + y)), 2 (-11 + x^2 + y + 2 y (-7 + x + y^2))] $
$H f(x,y) = \begin{bmatrix} 4 (x^2 + y - 11) + 8 x^2 + 2 & 4 x + 4 y \\ 4 x + 4 y & 4 (x + y^2 - 7) + 8 y^2 + 2) \end{bmatrix}$
def Himmer(x,y):
return (x**2 + y - 11)**2 + ( x + y**2 - 7 )**2
def Grad_Himmer(x,y):
return np.array([2 * (-7 + x + y**2 + 2 * x * (-11 + x**2 + y)), 2 * (-11 + x**2 + y + 2 * y * (-7 + x + y**2))])
def Hessian_Himmer(x,y):
h11 = 4 * (x**2 + y - 11) + 8 * x**2 + 2
h12 = 4 * x + 4 * y
h21 = 4 * x + 4 * y
h22 = 4 * (x + y**2 - 7) + 8 * y**2 + 2
return np.array([[h11,h12],[h21,h22]])
root_gd,iter_x_gd,iter_y_gd, iter_count_gd = Gradient_Descent(Grad_Himmer,0.5,-2,gamma = 0.001, epsilon=0.01, nMax = 1000)
#(Grad,x,y, gamma = 0.00125, epsilon=0.0001, nMax = 10000 )
root_nr,iter_x_nr,iter_y_nr, iter_count_nr = Newton_Raphson_Optimize(Grad_Himmer,Hessian_Himmer,0.5,-2, nMax = 50)
x = np.linspace(-5,5,100)
y = np.linspace(-5,5,100)
X, Y = np.meshgrid(x, y)
Z = Himmer(X, Y)
#Angles needed for quiver plot
anglesx = iter_x_gd[1:] - iter_x_gd[:-1]
anglesy = iter_y_gd[1:] - iter_y_gd[:-1]
anglesx_nr = iter_x_nr[1:] - iter_x_nr[:-1]
anglesy_nr = iter_y_nr[1:] - iter_y_nr[:-1]
%matplotlib inline
fig = plt.figure(figsize = (16,8))
#Surface plot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z,rstride = 5, cstride = 5, cmap = 'jet', alpha = .4, edgecolor = 'none' )
ax.plot(iter_x_gd,iter_y_gd, f_2(iter_x_gd,iter_y_gd),color = 'orange', marker = '*', alpha = .4, label = 'Gradient descent')
ax.plot(iter_x_nr,iter_y_nr, f_2(iter_x_nr,iter_y_nr),color = 'darkblue', marker = 'o', alpha = .4, label = 'Newton')
ax.legend()
#Rotate the initialization to help viewing the graph
ax.view_init(45, 60)
ax.set_xlabel('x')
ax.set_ylabel('y')
#Contour plot
ax = fig.add_subplot(1, 2, 2)
ax.contour(X,Y,Z, 60, cmap = 'jet')
#Plotting the iterations and intermediate values
ax.scatter(iter_x_gd,iter_y_gd,color = 'orange', marker = '*', label = 'Gradient descent')
ax.quiver(iter_x_gd[:-1], iter_y_gd[:-1], anglesx, anglesy, scale_units = 'xy', angles = 'xy', scale = 1, color = 'orange', alpha = .3)
ax.scatter(iter_x_nr,iter_y_nr,color = 'darkblue', marker = 'o', label = 'Newton')
ax.quiver(iter_x_nr[:-1], iter_y_nr[:-1], anglesx_nr, anglesy_nr, scale_units = 'xy', angles = 'xy', scale = 1, color = 'darkblue', alpha = .3)
ax.legend()
ax.set_title('Comparing Newton and Gradient descent')
plt.show()
Drawbacks of Newton's method¶
A combination of two reasons:
- Newton method attracts to saddle points;
- saddle points are common in machine learning, or in fact any multivariable optimization.
Look at the function $$f=x^2-y^2$$
If you apply multivariate Newton method, you get the following. $$\mathbf{x}_{n+1} = \mathbf{x}_n - [\mathbf{H}f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n)$$
Let's get the Hessian: $$\mathbf{H}= \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.$$
$$\mathbf{H}= \begin{bmatrix} 2 & 0 \\[2.2ex] 0 & -2 \end{bmatrix}$$
Invert it: $$[\mathbf{H} f]^{-1}= \begin{bmatrix} 1/2 & 0 \\[2.2ex] 0 & -1/2 \end{bmatrix}$$
Get the gradient: $$\nabla f=\begin{bmatrix} 2x \\[2.2ex] -2y \end{bmatrix}$$
Get the final equation: $$\mathbf{\begin{bmatrix} x \\[2.2ex] y \end{bmatrix}}_{n+1} = \begin{bmatrix} x \\[2.2ex] y \end{bmatrix}_n -\begin{bmatrix} 1/2 & 0 \\[2.2ex] 0 & -1/2 \end{bmatrix} \begin{bmatrix} 2x_n \\[2.2ex] -2y_n \end{bmatrix}= \mathbf{\begin{bmatrix} x \\[2.2ex] y \end{bmatrix}}_n - \begin{bmatrix} x \[2.2ex] y
\end{bmatrix}_n¶
\begin{bmatrix} 0 \\[2.2ex] 0 \end{bmatrix} $$
So, you see how the Newton method led you to the saddle point at $x=0,y=0$.
In contrast, the gradient descent method will not lead to the saddle point. The gradient is zero at the saddle point, but a tiny step out would pull the optimization away as you can see from the gradient above - its gradient on y-variable is negative.
A comparison between gradient descent and Newton's method¶
You asked two questions: Why don't more people use Newton's method, and why do so many people use stochastic gradient descent? These questions have different answers, because there are many algorithms that lessen the computational burden of Newton's method but often work better than SGD.
First: Newton's Method takes a long time per iteration and is memory-intensive. As jwimberley points out, Newton's Method requires computing the second derivative, $H$, which is $O(N^2)$, where $N$ is the number of features, while computing the gradient, $g$, is only $O(N)$. But the next step is $H^{-1} g$, which is $O(N^3)$ to compute. So while computing the Hessian is expensive, inverting it or solving least squares is often even worse. (If you have sparse features, the asymptotics look better, but other methods also perform better, so sparsity doesn't make Newton relatively more appealing.)
Second, many methods, not just gradient descent, are used more often than Newton; they are often knockoffs of Newton's method, in the sense that they approximate a Newton step at a lower computational cost per step but take more iterations to converge. Some examples:
Because of the expense of inverting the Hessian, ``quasi-Newton" methods like BFGS approximate the inverse Hessian, $H^{-1}$, by looking at how the gradient has changed over the last few steps.
BFGS is still very memory-intensive in high-dimensional settings because it requires storing the entire $O(N^2)$ approximate inverse Hessian. Limited memory BFGS (L-BFGS) calculates the next step direction as the approximate inverse Hessian times the gradient, but it only requires storing the last several gradient updates; it doesn't explicitly store the approximate inverse Hessian.
When you don't want to deal with approximating second derivatives at all, gradient descent is appealing because it only uses only first-order information. Gradient descent is implicitly approximating the inverse Hessian as the learning rate times the identity matrix. I, personally, rarely use gradient descent: L-BFGS is just as easy to implement, since it only requires specifying the objective function and gradient; it has a better inverse Hessian approximation than gradient descent; and because gradient descent requires tuning the learning rate.
Sometimes you have a very large number of observations (data points), but you could learn almost as well from a smaller number of observations. When that is the case, you can use "batch methods", like stochastic gradient descent, that cycle through using subsets of the observations.
More advanced discussion¶
PRO
Indeed the ratio of the number of saddle points to local minima increases exponentially with the dimensionality N.
While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, ...the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.
https://arxiv.org/pdf/1406.2572.pdf
CONS
More people should be using Newton's method in machine learning*. I say this as someone with a background in numerical optimization, who has dabbled in machine learning over the past couple of years.
The drawbacks in answers here (and even in the literature) are not an issue if you use Newton's method correctly. Moreover, the drawbacks that do matter also slow down gradient descent the same amount or more, but through less obvious mechanisms.
Using linesearch with the Wolfe conditions or using or trust regions prevents convergence to saddle points. A proper gradient descent implementation should be doing this too. The paper referenced in Cam.Davidson.Pilon's answer points out problems with "Newton's method" in the presence of saddle points, but the fix they advocate is also a Newton method.
Using Newton's method does not require constructing the whole (dense) Hessian; you can apply the inverse of the Hessian to a vector with iterative methods that only use matrix-vector products (e.g., Krylov methods like conjugate gradient). See, for example, the CG-Steihaug trust region method.
You can compute Hessian matrix-vector products efficiently by solving two higher order adjoint equations of the same form as the adjoint equation that is already used to compute the gradient (e.g., the work of two backpropagation steps in neural network training).
Ill conditioning slows the convergence of iterative linear solvers, but it also slows gradient descent equally or worse. Using Newton's method instead of gradient descent shifts the difficulty from the nonlinear optimization stage (where not much can be done to improve the situation) to the linear algebra stage (where we can attack it with the entire arsenal of numerical linear algebra preconditioning techniques).
Also, the computation shifts from "many many cheap steps" to "a few costly steps", opening up more opportunities for parallelism at the sub-step (linear algebra) level.
For background information about these concepts, I recommend the book "Numerical Optimization" by Nocedal and Wright.
*Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness.
Appendix - code snipets and to do¶
Rosenbrock function returning vector result¶
def rosen(x):
"""Generalized n-dimensional version of the Rosenbrock function"""
return sum(100*(x[1:]-x[:-1]**2.0)**2.0 +(1-x[:-1])**2.0)
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = rosen(np.vstack([X.ravel(), Y.ravel()])).reshape((100,100))
# Note: the global minimum is at (1,1) in a tiny contour island
plt.contour(X, Y, Z, 1000, cmap = 'jet')
plt.text(1, 1, 'x', va='center', ha='center', color='red', fontsize=20);
def f2(x,y):
return x**4 + 2*x**2 * y**2 + y**4
def f3(x,y):
return x**2 + y**2
x = np.linspace(-10,10,50)
y = np.linspace(-10,10,50)
X, Y = np.meshgrid(x, y)
Z2 = f2(X, Y)
Z3 = f3(X, Y)
%matplotlib inline
fig = plt.figure(figsize = (12,6))
#First plot: function 2
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot_surface(X,Y,Z2,rstride = 3, cstride = 3, cmap = 'coolwarm', linewidth = 0, antialiased = False )
ax.set_title("Function2")
#Second plot: function 3
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot_surface(X,Y,Z3,rstride = 3, cstride = 3, cmap = 'coolwarm', linewidth = 0, antialiased = False )
ax.set_title("Function3")
Newton method - Multivariate¶
Generalizing the 1D approach by:
$x_{n+1} = x_n - J^{-1} \ f(x_n)$
- $x$ and $f(x)$ are now vectors
- $J^{-1}$ is the inverse Jacobian matrix.
- In general the Jacobian is not square and cannot be inverted, so we use the generalized inverse: $(J^TJ)^{-1}J^T$
$x_{n+1} = x_n - (J^TJ)^{-1}J^T \ f(x_n)$
Example for f3
$f_3: (x,y) = x^2 + y^2$ where $x,y \in \mathbb{R}$
$f'_3: (x,y) = 2x + 2y$
$ \frac{d \ f_3}{dx} = 2x $ and $ \frac{d \ f_3}{dy} = 2y $
$J = \left( 2x , 2y \right)$
def J3(x,y):
return np.array([2*x,2*y])
def df3dx(x,y):
return 2*x + 2*y