Multivariate Gaussian and Maximum Likelihood Estimators

Context

The Multivariate Gaussian appears frequently in Machine Learning and the following results are used in many ML books and courses without the derivations.

Given data in form of a matrix $\mathbf{X} $ of dimensions $ m \times p$, if we assume that the data follows a $p$-variate Gaussian distribution with parameters mean $\mu$ ( $p \times 1 $) and covariance matrix $\Sigma$ ($p \times p$) the Maximum Likelihood Estimators are given by:

  • $\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$
  • $\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T $

Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian


Deriving the Maximum Likelihood Estimators

Assume that we have $m$ random vectors, each of size $p$: $\mathbf{X^{(1)}, X^{(2)},...,X^{(m)}}$ where each random vectors can be interpreted as an observation (data point) across $p$ variables. If each $\mathbf{X}^{(i)}$ are i.i.d. as multivariate Gaussian vectors:

$$ \mathbf{X^{(i)}} \sim \mathcal{N}_p(\mu, \Sigma) $$

Where the parameters $\mu, \Sigma$ are unknown. To obtain their estimate we can use the method of maximum likelihood and maximize the log likelihood function.

Note that by the independence of the random vectors, the joint density of the data $\mathbf{ \{X^{(i)}}, i = 1,2,...,m\}$ is the product of the individual densities, that is $\prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma })$. Taking the logarithm gives the log-likelihood function

\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned}

\begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned}

Deriving $\hat \mu$

To take the derivative with respect to $\mu$ and equate to zero we will make use of the following matrix calculus identity:

$\mathbf{ \frac{\partial w^T A w}{\partial w} = 2Aw}$ if $\mathbf{w}$ does not depend on $\mathbf{A}$ and $\mathbf{A}$ is symmetric.

\begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( \mu - x^{(i)} ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned}

Which is often called the sample mean vector.

Deriving $\hat \Sigma$

Deriving the MLE for the covariance matrix requires more work and the use of the following linear algebra and calculus properties:

  • The trace is invariant under cyclic permutations of matrix products: $tr[ACB] = tr[CAB] = tr[BCA]$
  • Since $x^TAx$ is scalar, we can take its trace and obtain the same value: $x^tAx = tr[x^TAx] = tr[x^txA]$
  • $\frac{\partial}{\partial A} tr[AB] = B^T$
  • $\frac{\partial}{\partial A} \log |A| = A^{-T}$

Combining these properties allows us to calculate

$$ \frac{\partial}{\partial A} x^tAx =\frac{\partial}{\partial A} tr[x^TxA] = [xx^t]^T = x^{TT}x^T = xx^T $$

Which is the outer product of the vector $x$ with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. $\Sigma^{-1}$ (note $C$ is constant)

\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m tr[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } ] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned}

Equating to zero and solving for $\Sigma$

\begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned}

Sources

Implementing a multivariate gaussian in python

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits import mplot3d
from sklearn import linear_model

%matplotlib inline
plt.style.use('seaborn-white')
In [141]:
def multivariate_gaussian_pdf(X,MU,SIGMA):
    '''Returns the pdf of a nultivariate gaussian distribution
     - X, MU are p x 1 vectors
     - SIGMA is a p x p matrix'''
    #Initialize and reshape
    X = X.reshape(-1,1)
    MU = MU.reshape(-1,1)
    p,_ = SIGMA.shape

    #Compute values
    SIGMA_inv = np.linalg.inv(SIGMA)
    denominator = np.sqrt((2 * np.pi)**p * np.linalg.det(SIGMA))
    exponent = -(1/2) * ((X - MU).T @ SIGMA_inv @ (X - MU))
    
    #Return result
    return float((1. / denominator) * np.exp(exponent) )    

Plot the Gaussian in 3D

In [179]:
# Our 2-dimensional distribution will be over variables X and Y
N = 50
X = np.linspace(-3, 3, N)
Y = np.linspace(-3, 4, N)
X, Y = np.meshgrid(X, Y)

# Mean vector and covariance matrix
mu = np.array([0., 1.])
Sigma = np.array([[ 1. , 0.8], [0.8,  1]])

#Computing the cost function for each theta combination
zz = np.array(  [multivariate_gaussian_pdf( np.array([xx,yy]).reshape(-1,1), mu, Sigma) 
                     for xx, yy in zip(np.ravel(X), np.ravel(Y)) ] )
#Reshaping the cost values    
Z = zz.reshape(X.shape)

#Plot the result in 3D
fig = plt.figure(figsize = (10,10))
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, linewidth=1, antialiased=True,
                cmap=cm.viridis)

cset = ax.contourf(X, Y, Z, zdir='z', offset=-0.15, cmap=cm.viridis)

# Adjust the limits, ticks and view angle
ax.set_zlim(-0.15,0.2)
ax.set_zticks(np.linspace(0,0.2,5))
ax.view_init(20, 25)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Multivariate Gaussian Sigma = {}'.format(Sigma))

plt.show()
In [176]:
# Our 2-dimensional distribution will be over variables X and Y
N = 50
X = np.linspace(-3, 3, N)
Y = np.linspace(-3, 4, N)
X, Y = np.meshgrid(X, Y)

# Mean vector and covariance matrix
mu = np.array([0., 1.])
Sigma = np.array([[ 1. , 0.0], [0,  1]])

#Computing the cost function for each theta combination
zz = np.array(  [multivariate_gaussian_pdf( np.array([xx,yy]).reshape(-1,1), mu, Sigma) 
                     for xx, yy in zip(np.ravel(X), np.ravel(Y)) ] )
#Reshaping the cost values    
Z = zz.reshape(X.shape)

#Plot the result in 3D
fig = plt.figure(figsize = (10,10))
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, linewidth=1, antialiased=True,
                cmap=cm.viridis)

cset = ax.contourf(X, Y, Z, zdir='z', offset=-0.15, cmap=cm.viridis)

# Adjust the limits, ticks and view angle
ax.set_zlim(-0.15,0.2)
ax.set_zticks(np.linspace(0,0.2,5))
ax.view_init(20, 25)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Multivariate Gaussian Sigma = {}'.format(Sigma))

plt.show()