Inferential Statistics - Simple Linear Regression¶
This notebook explores various implementations of statistical inferences of a simple linear regression model.
As the derivations of the model and the associated statistics have been covered in numerous books, lectures and notebooks (see sources) we will focus instead on summary of the main formula and a case study and its Python implementation using three main libraries.
This notebook is divided into four main sections:
- Introduction: notation, model description, formula, hypothesis
- Implementation using Numpy and Pandas
- Implementation using Statsmodel
- Implementation using Scikit Learn
Introduction¶
Libraries¶
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline
from IPython.core.display import HTML
sns.set() #Making seaborn the default styling
Dataset¶
The dataset illustrates the well known problem of trying to predict appartment prices given data on their surface.
- The explanatory (or independent, exogeneous) variable x is the appartment surface area in m2
- The dependent (or endogeneous) variable y is the appartment price in French Francs
x = np.array([28,50,106,196,55,190,110,60,48,35,86,65,32,52,40,260,70,117,90,30,105,52,80,60,140,20,100,28])
y = np.array([650,1400,3250,4000,1340,3950,2500,1600,1250,1250,1750,1500,775,1225,1000,7500,1625,4750,
1890,390,1875,1000,1350,1475,4950,425,2475,425])
Notation¶
The following notation convention will be used throughout:
- xi is an observation of the explanatory variable
- yi is an observation of the dependent variable
- ˉy and ˉx are the mean of a set of observations
- ˆyi is the value predicted by the model, given the observation xi
Image("/Users/User/Desktop/Data/Learning_Notebooks/images/Linear_Reg_Diagram.png")
Decomposition of the sum of squares and notation¶
SSTotal=SSResidual+SSExplained
- SSTotal=∑ni−1(yi−ˉy)2 also noted Syy
- SSResidual=∑ni−1(yi−ˆyi)2 also noted ∑e2i
- SSExpl=∑ni−1(ˆyi−ˉy)2 also noted Sˆyˆy
Simple linear regression¶
The model based on the true (unobservable) parameters is:
yi=α+βxi+ϵi
Given the observations, we estimate the model and the coefficients of regression with a and b such that:
yi=a+bxi+ˆϵi
ˆyi=a+bxi
ˆϵi=yi−ˆyi
Note that ϵi is the unobservable statistical error between the value of yi and the true population mean. The value ˆϵi is the observable residual and is an estimate of the statistical error.
Coefficient of determination R2¶
R2=SSexpSStot=SˆyˆySyy=∑ni−1(ˆyi−ˉy)2∑ni−1(yi−ˉy)2
Leverage¶
The leverage of an observcation xi can be viewed intuitively as a measure of the "influence" of the point on the regression model. It is defined as:
hi=1n+(xi−ˉx)2∑(xi−ˉx)2
Leverage values are bounded between 1/n and 1. The average is 2/n and we can consider that the leverage is significant if it is greater than 4/n
Maximum Likelihood Model - Normally distributed errors¶
Model description¶
Given a sample of size n: {(Yi,xi)} for i=1..n taken from a population described by the following model:
Yi=Y | xi=α+βxi+ϵi
Where:
- α is an unknown constant describing the y intercept of the regression line (called constant term)
- β is an unknown constant describing the slope of the regression line (called coefficient of regression)
- xi is the value of the explanatory variable
- ϵi is a random variable which follows a Normal distribution ϵi i.i.d. ∼N(0,σ2). It is the statistical error i.e. the difference between the random variable Yi and the true population mean
- Yi is a random variable which also follows a Normal distribution Yi i.i.d. ∼N(α+βxi,σ2)
Assumptions¶
- E(ϵi)=0 ∀ i Expectation of the random variable ϵi is zero
- var(ϵi)=σ2 ∀ i Variance of the random variable ϵi is finite and defined
- ϵi are i.i.d. (independent, identically distributed) ⇒∀ i≠j cov(ϵi,ϵj)=0
- ϵi are normally distributed: ϵi i.i.d. ∼N(0,σ2)
- xi are observed with no error
Estimators¶
Since α, β and σ2 are unknown population parameters, we will estimate them using:
- B the estimator of β with realization b in a given sample
- A the estimator of α with realization a in a given sample
- S2n−2 the estimator of σ2 with realization ˆσ2 in a given sample
- ˆϵi or e the residual, an estimator of the statistical error
Hence we have Yi=A+Bxi+ei
Espectation and variance of the model¶
- E(Yi)=E(Y | xi)=α+βxi
- Var(Yi)=Var(Y | xi)=σ2
Image("/Users/User/Desktop/Data/Learning_Notebooks/images/Linear_Reg_Diagram2.png")
Maximum Likelihood Estimators¶
Maximizing the likelihood function and solving the equations gives:
- A=ˉy−Bˉx
- B=∑(xi−ˉx)(yi−ˉy)∑(xi−ˉx)2=∑yixi−nˉyˉx∑x2i−nˉx2=Cov(x,y)Var(x)=SxySxx
Properties of the estimators:¶
- Expectation: E(A)=α and E(B)=β
- Variance: Var(A)=σ2(1n+ˉx2∑(xi−ˉx)2) and Var(B)=σ2∑(xi−ˉx)2
Distribution of the estimators¶
Since A and B are linear combinations of i.i.d. normal random variables Y1,Y2,...,Yn they also follow a normal distribution:
- A∼N(α,σ2(1n+ˉx2∑(xi−ˉx)2)) and B∼N(β,σ2∑(xi−ˉx)2)
It can be shown that the following estimator for the variance of the errors is unbiased and is calculated based on the observed residuals ei:
- S2n−2=1n−2∑(Yi−A−Bxi)2=1n−2∑e2i
Implications of the normal distribution hypothesis¶
Estimators can be Z standardized:
A−E(A)√Var(A)=A−α√σ2(1n+ˉx2∑(xi−ˉx)2)∼N(0,1)
B−E(B)√Var(B)=B−β√σ2∑(xi−ˉx)2∼N(0,1)
Replacing the true variance σ2 with its estimator S2n−2 leads to a student T distribution with n−2 degrees of freedom:
A−αSA∼Tn−2
B−βSB∼Tn−2
Since A and B are independent of S2n−2
Under the assumption that β=0 then the following statistics follow a Chi-squared distribution:
∑(ˆY−ˉY)2σ2=SSexpσ2∼χ21 and ∑(Yi−ˉY)2σ2=SStotσ2∼χ2n−1
Also, the following follows a Chi-squared sitribution:
∑(Yi−ˉYi)2σ2=SSresσ2∼χ2n−2
And hence the ratio of the two follows a Fisher distribution
F=SSexpSSres/(n−2)∼F1,n−2
Hypothesis testing for the parameters α and β¶
- H0:α=0
- H1:α≠0
Test statistic: TA=ASA∼Tn−2
- H0:β=0
- H1:β≠0
Test statistic: TB=BSB∼Tn−2
Hypothesis testing for the statistical significance of the model as a whole¶
- H0:Yi=α+ϵi
- H1:Yi=α+βxi+ϵi
Test statistic: F=SSexpSSres/(n−2)∼F1,n−2
Confidence intervals¶
Confidence interval around the mean of the regression line¶
For a given observation x0 we want to construct a C.I. for E(Y0)=E(Y|x=x0)=α+βx0
- Expectation: E[Y | x=x0]=E(A)+E(B)x0=α+βx0
- Variance: Var[Y | x=x0]=σ2(1n+(x0−ˉx)2∑ni=1(xi−ˉx)2)=σ2h0 where h0 is the leverage of point x0.
Since A+Bx0 is a linear combination of Normal r.v. it also has a Normal distribution:
A+Bx0∼N(α+βx0,σ2h0) and so A+Bx0−(α+βx0)σ√h0∼N(0,1)
Since σ2 is unknown, we replace it by its estimator S2n−2 to obtain a pivotal quantity:
A+Bx0−(α+βx0)Sn−2√h0∼Tn−2
Re-arranging to build a confidence interval of level 1−γ :
IC1−γ[E(Y | x0)]=[A+Bx0±t(n−2)(1−γ/2)Sn−2√h0]
Prediction Confidence Interval around a new observation xp¶
Here we are interested in the Expectation and Variance of the residual for a new observation (i.e. one which was not included in the model previously)
- Expectation of E(Yp−(A+Bxp))=E(Yp)−E(A+Bxp)=0
- Variance: Var(Yp−(A+Bxp))=Var(Yp)+Var(A+Bxp)=σ2(1+hp) Note here that the variance is greater than the variance in the previous C.I.
As previously, replacing the variance with S2n−2 and constructing the pivotal quantity:
Yp−(A+Bxp)Sn−2√1+hp∼Tn−2
And the prediction confidence interval:
IC1−γ[E(Y | xp)]=[A+Bxp±t(n−2)(1−γ/2)Sn−2√1+hp]
The uncertainty around the prediction interval is greater
Some comments on the residuals¶
If the hypotheses of the model are not violated (i.e. the errors are normally distributed etc..) then the following properties hold. Recall that residuals ei are the estimators of the true statistical error ϵi
- Residuals are not independent: their sum is null
- The variance of the residuals is Var(Yi−(A+Bxi))=Var(Yi)+Var(A+Bxi)−2Cov(Yi−(A+Bxi))=σ2+σ2hi−2σ2hi=σ2(1−hi)
- The variance can be estimated as S2n−2(1−hi)
Note: do not confuse the variance of the residuals of the sample σ2(1−hi) with the variance of the residuals of a new observation xp : σ2(1+hp)
Cook's distance: a measure of influence¶
Defined as Di=(ei)22σ2(hi(1−hi)2)
Values larger than 1, 4 / n or 4 / (n - P -1) for multiple linear regression are considered to be large and point towards an influential observation
class simple_linear_reg_model(object):
'''Class used to bundle together the data, coefficients, parameters, and statistics
of the simple linear regression model '''
def __init__(self,x,y):
'''Initializing the dataset and the basic computed values:
x,y : datasets
n: number of observations
x_bar, y_bar: means of variables
s_xx, s_yy, s_xy: short hand notation, e.g. s_xx = sum(x - x_bar)^2
'''
self.x = x
self.y = y
self.n = len(x)
self.x_bar = np.mean(self.x)
self.y_bar = np.mean(self.y)
self.s_xx = np.sum((self.x - self.x_bar)**2)
self.s_yy = np.sum((self.y - self.y_bar)**2)
self.s_xy = np.sum((self.x - self.x_bar) * (self.y - self.y_bar))
#Other attributes are created here but initialized in the corresponding methods
self.a, self.b = None, None
self.y_hat, self.epsilon, self.h = None, None, None
self.SS_tot, self.SS_res, self.SS_exp = None, None, None
self.Sn_2, self.S_A, self.S_B = None, None, None
self.R_squared = None
self.F, self.T_A, self.T_B = None, None, None
self.F_pvalue, self.T_A_pvalue, self.T_B_pvalue = None, None, None
self.CI_reg_upp,self.CI_reg_low, self.CI_pred_upp,self.CI_pred_low = None,None,None,None
self.S_epsilon, self.Cook_Distance = None, None
def fitModel(self):
#Fit the linear regression model to the data and compute the coefficients
self.b = self.s_xy / self.s_xx
self.a = self.y_bar - self.b * self.x_bar
#Predicted values y_hat, residuals and leverage
self.y_hat = self.a + self.b * self.x
self.epsilon = self.y - self.y_hat
self.h = (1 / self.n) + (x - self.x_bar)**2 / (self.s_xx)
#Sum of squared
self.SS_tot = np.sum( (self.y - self.y_bar)**2 )
self.SS_res = np.sum( (self.y - self.y_hat)**2 )
self.SS_exp = np.sum( (self.y_hat - self.y_bar)**2 )
#Estimators of the variances
self.Sn_2 = (1/(self.n - 2)) * self.SS_res
self.S_A = self.Sn_2 * ( (1 / self.n) + self.x_bar**2 / self.s_xx )
self.S_B = self.Sn_2 / self.s_xx
#Coefficient of determination: R squared
self.R_squared = self.SS_exp / self.SS_tot
#Test statistics
self.T_A = self.a / np.sqrt(self.S_A)
self.T_B = self.b / np.sqrt(self.S_B)
self.F = self.SS_exp / (self.SS_res / (self.n -2))
#Confidence intervals around the given observations
t = stats.t.ppf(.975,self.n - 2)
self.CI_reg_upp = self.a + self.b * self.x + t * np.sqrt(self.Sn_2 * self.h)
self.CI_reg_low = self.a + self.b * self.x - t * np.sqrt(self.Sn_2 * self.h)
self.CI_pred_upp = self.a + self.b * self.x + t * np.sqrt(self.Sn_2 * (1 + self.h))
self.CI_pred_low = self.a + self.b * self.x - t * np.sqrt(self.Sn_2 * (1 + self.h))
#Variance of the residuals and Cook's distance
self.S_epsilon = self.Sn_2 * (1 - self.h)
self.Cook_Distance = (self.epsilon**2 / (2*self.Sn_2) ) * (self.h / (1 - self.h)**2 )
def pValues(self):
#Compute the p-values associated with the test statistics, given that the null hypothesis
#H_0: no linear relationship between x and y, i.e. coefficients are = 0
#Note that we use the statsmodel library here
#Note - the p_value for the t statistic is for a two sided test: Pr > |t| hence the factor of 2
self.T_A_pvalue = (1 - stats.t.cdf(abs(self.T_A), self.n -2) ) * 2
self.T_B_pvalue = (1 - stats.t.cdf(abs(self.T_B), self.n -2) ) * 2
self.F_pvalue = (1 - stats.f.cdf(self.F, 1, self.n -2 ) )
def printResults(self):
#Print results in a similar format to SAS output
print('Parameter Estimates')
df1 = pd.DataFrame(data = {'1. DF' : [1,1 ], '2. Parameter Estimate':[self.a,self.b],
'3. Standard Error':[np.sqrt(self.S_A),np.sqrt(self.S_B)],
'4. t Value': [self.T_A,self.T_B], '5. Pr > |t|': [self.T_A_pvalue, self.T_B_pvalue]},
index = ['Intercept (a)','SURFACE (b)'])
display(df1.round(3))
df2 = pd.DataFrame(data = {'1. DF' : [1,self.n -2 ,self.n -1],
'2. Sum of Squares':[self.SS_exp,self.SS_res,self.SS_tot], '3. F_value':[self.F,'',''],
'4. Pr > F': [self.F_pvalue,'','']}, index = ['Model (Explained)',
'Error (Residuals)','Total'])
print('Analysis of Variance')
display(df2.round(3))
df3 = pd.DataFrame(data = {'Values': [self.y_bar,self.R_squared]},
index = ['Dependent Mean (y_bar)','R-squared'])
print('Mean and R-squared')
display(df3.round(3))
df4 = pd.DataFrame(data = {'1. Dependent Variable' : self.y, '2. Predicted Value': self.y_hat,
'3. Residual': self.epsilon , '3. Std Error Residuals': np.sqrt(self.S_epsilon) ,
'4. 95% CI Mean -': self.CI_reg_low, '5. 95% CI Mean +': self.CI_reg_upp,
'6. 95% CI Pred -': self.CI_pred_low, '7. 95% CI Pred +':self.CI_pred_upp,
'8. Leverage (h)': self.h, '9. Cook Distance': self.Cook_Distance})
print('Output statistics')
display(df4.round(3))
Initialize the model, perform the fitting and probability calculations and print the results in SAS format¶
Given the very small p-values for both the slope parameter B and the model as a whole, we can confidently reject the null hypothesis and state that there is a statistically significant linear relationship between the explanatory variable x and the dependent variable y
lm = simple_linear_reg_model(x,y)
lm.fitModel()
lm.pValues()
lm.printResults()
#Limiting max width of iPython column display
HTML("<style>.rendered_html th {max-width: 90px;}</style>")
Print the resulting scatter plot and regression line with confidence intervals¶
#Plot the x and y values in a scatter
plt.figure(figsize = (12,8))
plt.plot(lm.x, lm.y, marker = '.', linestyle = 'none', label = 'data')
plt.xlabel('Surface / m2')
plt.ylabel('Price / Francs ')
plt.title('Simple Linear Regression: House prices & Surface')
#Make regression line to plot
x_vals = np.array([0,max(lm.x)])
y_vals = lm.a + x_vals * lm.b
plt.plot(x_vals,y_vals, label = 'R2: %0.2f '%(lm.R_squared))
#Plot the Confidence Intervals
plt.fill_between(sorted(lm.x), sorted(lm.CI_reg_upp), sorted(lm.CI_reg_low), alpha = .2, facecolor='grey', linestyle='dashdot', edgecolor = 'red', label = '95% CI Mean' )
plt.plot(sorted(lm.x), sorted(lm.CI_pred_upp), linewidth=1, linestyle='dashdot', color = 'navy')
plt.plot(sorted(lm.x), sorted(lm.CI_pred_low), linewidth=1, linestyle='dashdot', color = 'navy', label = '95% CI Pred')
plt.legend()
plt.xlim(min(lm.x) -2,max(lm.x) +2)
plt.show()
Study of the residuals and influential points¶
Studying the residuals is fundamental as it allows to highlight observations which are
- Potential outliers
- Play an important role in the regression analysis
Moreover, it allows us to confirm the hypothesis of the model, in particular:
- Linearity
- Homoscedasticity
- Normal distribution of the errors ϵi
Plotting the residual scatter plot and QQ plot¶
Comments: The residuals appear to be approximately symetrically distributed, with mean close to zero. The distribution is not perfectly normal, in particular the right tail is heavier than normal.
This points towards a violation of the hypothesis of heteroscedasticity, i.e. the residual increases as a function of x. A potential solution would be to perform linear regression on log(Y) instead
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (16,6))
#Figure 1: residual plot
ax1.plot(lm.x, lm.epsilon, marker = '.', linestyle = 'none')
ax1.set_xlabel('Surface \m2 (x)')
ax1.set_ylabel('Regression residual /Francs')
ax1.set_title('Residuals plot')
#Figure 2: Plot of the distribution of the residuals
ax2 = sns.distplot(lm.epsilon,ax = ax2, bins = 15)
ax2.set_ylabel('Probability')
ax2.set_xlabel('Residual value')
ax2.set_title('Distribution of residuals')
#Figure 3: QQ plot comparing residual quantiles to theoretical normal quantiles
ax3 = stats.probplot(lm.epsilon, dist = "norm", plot = plt)
plt.show()
Implementation using Statsmodel library¶
The python statsmodel library offers extensive functionalities and reports for the linear regression model. We will explore the main ones in this section
- As recommended by Statsmodel, import the statsmodel api to access the functions and models.
- Importing statsmodels.api will load most of the public parts of statsmodels. This makes most functions and classes conveniently available within one or two levels, without making the “sm” namespace too crowded.
- Use the add_constant utility function to add a column of 1s to the matrix of observation. This will allow the model to fit an intercept parameter
- Print results summary
import statsmodels.api as sm
#Adding the column of 1s
X = sm.add_constant(x)
#Fit the model and print output
lm2 = sm.OLS(y,X)
results = lm2.fit()
print(results.summary())
Comments¶
As expected, the values for the parameters, test statistics, probabilities, R-squared values etc.. are the same as those calculated manually using numpy.
- The formatted table offers a more concise summary of the information
- Statsmodel also performs a number of tests by default: for example the Durbin Watson, Jarque Bera etc..
Study of influence¶
The Statsmodel result object has a built-in method which returns an Influence object with influence and outlier measures. This "influence" object contains many results, methods and methods which can be called as required. See below for a few interesting ones:
- .cooks_distance - Cooks distance
- .hat_matrix_diag - The leverage for observation points
- .summary_table() - Prints the available influence results
- .summary_frame() - Creates a DataFrame with all available influence results.
Documentation here:
#Creating the influence object from the OLS results
influence = results.get_influence()
#Cooks distance for the x observations
display(influence.cooks_distance[0].round(3))
#Leverage values for the x observations
display(influence.hat_matrix_diag.round(3))
influence.summary_frame()
Influence plots¶
Influence plots show the externally studentized residuals vs the leverage of each observation as measured by the hat matrix (here this is only the leverage as there are no matrices in simple linear regression)
Externally studentized residuals are residuals that are scaled by their standard deviation where:
var(ϵi)=σ2i(1−hi)
The influence of each point can be visualized by the criterion keyword argument. Options are Cook's distance and DFFITS, two measures of influence
Source: http://www.statsmodels.org/dev/examples/notebooks/generated/regression_plots.html#Influence-plots
Leverage-Resid2 Plot¶
Closely related to the influence_plot is the leverage-resid2 plot.
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (16,6))
ax1 = sm.graphics.influence_plot(results, ax=ax1, criterion="cooks")
ax2 = sm.graphics.plot_leverage_resid2(results, ax=ax2)
Implementation using Sklearn¶
There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
Nevertheless we can extract some key values and parameters using the built in methods and attributes, in particular the regression metrics:
Documentation: http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
#Create linear regression object
regr = linear_model.LinearRegression()
#Reshaping the matrix of X
X = x.reshape(-1,1)
# Train the model using the data sets
regr.fit(X, y)
# Make predictions using the same data set
y_pred = regr.predict(X)
#The coefficients
print('Intercept: \n', regr.intercept_)
print('Slope: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y, y_pred))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
Sources¶
Notebook Author: Xavier Bourret Sicotte
Date: May 2018
The dataset, formulas and insights were taken from the (excellent) course at CNAM in Paris:
Other usefull resources are: Lecture 14 at Duke for detailed derivations of most formula