Inferential Statistics - Simple Linear Regression¶
This notebook explores various implementations of statistical inferences of a simple linear regression model.
As the derivations of the model and the associated statistics have been covered in numerous books, lectures and notebooks (see sources) we will focus instead on summary of the main formula and a case study and its Python implementation using three main libraries.
This notebook is divided into four main sections:
- Introduction: notation, model description, formula, hypothesis
- Implementation using Numpy and Pandas
- Implementation using Statsmodel
- Implementation using Scikit Learn
Introduction¶
Libraries¶
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline
from IPython.core.display import HTML
sns.set() #Making seaborn the default styling
Dataset¶
The dataset illustrates the well known problem of trying to predict appartment prices given data on their surface.
- The explanatory (or independent, exogeneous) variable $x$ is the appartment surface area in $m^2$
- The dependent (or endogeneous) variable $y$ is the appartment price in French Francs
x = np.array([28,50,106,196,55,190,110,60,48,35,86,65,32,52,40,260,70,117,90,30,105,52,80,60,140,20,100,28])
y = np.array([650,1400,3250,4000,1340,3950,2500,1600,1250,1250,1750,1500,775,1225,1000,7500,1625,4750,
1890,390,1875,1000,1350,1475,4950,425,2475,425])
Notation¶
The following notation convention will be used throughout:
- $x_i$ is an observation of the explanatory variable
- $y_i$ is an observation of the dependent variable
- $\bar y$ and $\bar x$ are the mean of a set of observations
- $\hat y_i$ is the value predicted by the model, given the observation $x_i$
Image("/Users/User/Desktop/Data/Learning_Notebooks/images/Linear_Reg_Diagram.png")
Decomposition of the sum of squares and notation¶
$SS_{Total} = SS_{Residual} + SS_{Explained}$
- $SS_{Total} = \large \sum_{i-1}^n(y_i - \bar y)^2$ also noted $\large S_{yy}$
- $SS_{Residual} = \large \sum_{i-1}^n(y_i - \hat y_i)^2$ also noted $\large \sum e_i^2$
- $SS_{Expl} = \large \sum_{i-1}^n(\hat y_i - \bar y)^2$ also noted $\large S_{\hat y \hat y}$
Simple linear regression¶
The model based on the true (unobservable) parameters is:
$y_i = \alpha + \beta x_i + \epsilon_i$
Given the observations, we estimate the model and the coefficients of regression with $a$ and $b$ such that:
$y_i = a + b x_i + \hat \epsilon_i$
$\hat y_i = a + bx_i$
$\hat \epsilon_i = y_i - \hat y_i$
Note that $\epsilon_i$ is the unobservable statistical error between the value of $y_i$ and the true population mean. The value $\hat \epsilon_i$ is the observable residual and is an estimate of the statistical error.
Coefficient of determination $R^2$¶
$R^2 = \large \frac{SS_{exp}}{SS_{tot}} = \frac{S_{\hat y \hat y}}{S_{yy}} = \Large \frac{\sum_{i-1}^n(\hat y_i - \bar y)^2}{\sum_{i-1}^n(y_i - \bar y)^2}$
Leverage¶
The leverage of an observcation $x_i$ can be viewed intuitively as a measure of the "influence" of the point on the regression model. It is defined as:
$h_i = \Large \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum(x_i - \bar x)^2}$
Leverage values are bounded between $1/n$ and $1$. The average is $2/n$ and we can consider that the leverage is significant if it is greater than $4/n$
Maximum Likelihood Model - Normally distributed errors¶
Model description¶
Given a sample of size $n$: $\{(Y_i, x_i)\}$ for $i = 1..n$ taken from a population described by the following model:
$Y_i = Y \ | \ x_i = \alpha + \beta x_i + \epsilon_i$
Where:
- $\alpha$ is an unknown constant describing the y intercept of the regression line (called constant term)
- $\beta$ is an unknown constant describing the slope of the regression line (called coefficient of regression)
- $x_i$ is the value of the explanatory variable
- $\epsilon_i$ is a random variable which follows a Normal distribution $\epsilon_i \ i.i.d. \ \sim N(0, \sigma^2)$. It is the statistical error i.e. the difference between the random variable $Y_i$ and the true population mean
- $Y_i$ is a random variable which also follows a Normal distribution $Y_i \ i.i.d. \ \sim N(\alpha + \beta x_i, \sigma^2)$
Assumptions¶
- $E(\epsilon_i) = 0 \ \forall \ i$ Expectation of the random variable $\epsilon_i$ is zero
- $var(\epsilon_i) = \sigma^2 \ \forall \ i$ Variance of the random variable $\epsilon_i$ is finite and defined
- $\epsilon_i$ are i.i.d. (independent, identically distributed) $\Rightarrow \forall \ i \neq j \ cov(\epsilon_i,\epsilon_j) = 0$
- $\epsilon_i$ are normally distributed: $\epsilon_i \ i.i.d. \ \sim N(0, \sigma^2)$
- $x_i$ are observed with no error
Estimators¶
Since $\alpha$, $\beta$ and $\sigma^2$ are unknown population parameters, we will estimate them using:
- B the estimator of $\beta$ with realization $b$ in a given sample
- A the estimator of $\alpha$ with realization $a$ in a given sample
- $S^2_{n-2}$ the estimator of $\sigma^2$ with realization $\hat \sigma^2$ in a given sample
- $\hat \epsilon_i$ or $e$ the residual, an estimator of the statistical error
Hence we have $Y_i = A + Bx_i + e_i$
Espectation and variance of the model¶
- $E(Y_i) = E(Y \ | \ x_i) = \alpha + \beta x_i$
- $Var(Y_i) = Var(Y \ | \ x_i) = \sigma^2$
Image("/Users/User/Desktop/Data/Learning_Notebooks/images/Linear_Reg_Diagram2.png")
Maximum Likelihood Estimators¶
Maximizing the likelihood function and solving the equations gives:
- $\large A = \bar y - B \bar x$
- $B = \large \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}=\frac{\sum y_i x_i - n \bar y \bar x}{\sum x_i^2 - n \bar x^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}=\frac{S_{xy}}{S_{xx}}$
Properties of the estimators:¶
- Expectation: $E(A) = \alpha$ and $E(B) = \beta$
- Variance: $ Var(A) = \sigma^2 \large \left(\frac{1}{n} + \frac{\bar x^2}{\sum (x_i -\bar x)^2} \right)$ and $Var(B) = \large \frac{\sigma^2}{\sum (x_i - \bar x)^2}$
Distribution of the estimators¶
Since A and B are linear combinations of i.i.d. normal random variables $Y_1, Y_2,...,Y_n$ they also follow a normal distribution:
- $A \sim N\left( \alpha, \sigma^2 \large \left(\frac{1}{n} + \frac{\bar x^2}{\sum (x_i -\bar x)^2} \right) \right)$ and $B \sim N \left(\beta, \large \frac{\sigma^2}{\sum (x_i - \bar x)^2} \right)$
It can be shown that the following estimator for the variance of the errors is unbiased and is calculated based on the observed residuals $e_i$:
- $ S^2_{n-2} = \large \frac{1}{n-2} \sum(Y_i - A -B x_i)^2 = \frac{1}{n-2} \sum e_i^2$
Implications of the normal distribution hypothesis¶
Estimators can be Z standardized:
$\large \frac{A - E(A)}{\sqrt{Var(A)}} = \frac{A - \alpha}{\sqrt{\sigma^2 \large \left(\frac{1}{n} + \frac{\bar x^2}{\sum (x_i -\bar x)^2} \right)}} \small \sim N(0,1)$
$\large \frac{B - E(B)}{\sqrt{Var(B)}} = \frac{B - \beta}{\sqrt{\frac{\sigma^2}{\sum (x_i - \bar x)^2} }} \small \sim N(0,1)$
Replacing the true variance $\sigma^2$ with its estimator $S^2_{n-2}$ leads to a student $T$ distribution with $n-2$ degrees of freedom:
$\large \frac{A - \alpha}{S_A} \sim T_{n-2}$
$\large \frac{B - \beta}{S_B} \sim T_{n-2}$
Since A and B are independent of $S^2_{n-2}$
Under the assumption that $\beta = 0$ then the following statistics follow a Chi-squared distribution:
$\large \frac{\sum (\hat Y - \bar Y)^2}{\sigma^2} = \frac{SS_{exp}}{\sigma^2} \sim \chi_1^2$ and $\large \frac{\sum (Y_i - \bar Y)^2}{\sigma^2} =\frac{SS_{tot}}{\sigma^2} \sim \chi_{n-1}^2$
Also, the following follows a Chi-squared sitribution:
$\large \frac{\sum (Y_i - \bar Y_i)^2}{\sigma^2} = \frac{SS_{res}}{\sigma^2} \sim \chi^2_{n-2}$
And hence the ratio of the two follows a Fisher distribution
$\large F = \frac{SS_{exp}}{SS_{res} / (n-2)} \sim F_{1,n-2}$
Hypothesis testing for the parameters $\alpha$ and $\beta$¶
- $H_0 :\alpha = 0 $
- $H_1: \alpha \neq 0$
Test statistic: $T_A = \frac{A}{S_A} \sim T_{n-2}$
- $H_0 :\beta = 0 $
- $H_1: \beta \neq 0$
Test statistic: $T_B = \frac{B}{S_B} \sim T_{n-2}$
Hypothesis testing for the statistical significance of the model as a whole¶
- $H_0 : Y_i = \alpha + \epsilon_i $
- $H_1 : Y_i = \alpha + \beta x_i + \epsilon_i $
Test statistic: $\large F = \frac{SS_{exp}}{SS_{res} / (n-2)} \sim F_{1,n-2}$
Confidence intervals¶
Confidence interval around the mean of the regression line¶
For a given observation $x_0$ we want to construct a C.I. for $E(Y_0) = E(Y | x = x_0) = \alpha + \beta x_0$
- Expectation: $E[Y \ | \ x = x_0] = E(A) + E(B) x_0 = \alpha + \beta x_0$
- Variance: $\text{Var}[Y \ | \ x = x_0] = \sigma^2 \large \left( \frac{1}{n} + \frac{(x_0 - \bar x)^2 }{\sum_{i=1}^n (x_i - \bar x)^2} \right) = \small \sigma^2 h_0$ where $h_0$ is the leverage of point $x_0$.
Since $A + B x_0$ is a linear combination of Normal r.v. it also has a Normal distribution:
$A + B x_0 \sim N \left( \alpha + \beta x_0, \sigma^2 h_0 \right)$ and so $\large \frac{A + Bx_0 - (\alpha + \beta x_0)}{ \sigma \sqrt{h_0} } \small \sim N(0,1)$
Since $\sigma^2$ is unknown, we replace it by its estimator $S^2_{n-2}$ to obtain a pivotal quantity:
$\large \frac{A + Bx_0 - (\alpha + \beta x_0)}{ S_{n-2} \sqrt{h_0} } \small \sim T_{n-2}$
Re-arranging to build a confidence interval of level $1 - \gamma$ :
$IC_{1 - \gamma} [E(Y \ | \ x_0)] = [ A + B x_0 \pm t_{(n-2)(1 - \gamma /2)} S_{n-2} \sqrt{h_0} ]$
Prediction Confidence Interval around a new observation $x_p$¶
Here we are interested in the Expectation and Variance of the residual for a new observation (i.e. one which was not included in the model previously)
- Expectation of $E(Y_p - (A + B x_p)) = E(Y_p) - E(A + B x_p) = 0$
- Variance: $\text{Var}(Y_p - (A + B x_p)) = \text{Var}(Y_p) + \text{Var}(A + B x_p) = \sigma^2 \left(1 + h_p \right)$ Note here that the variance is greater than the variance in the previous C.I.
As previously, replacing the variance with $S_{n-2}^2$ and constructing the pivotal quantity:
$\large \frac{Y_p - (A + B x_p)}{S_{n-2} \sqrt{1 + h_p}} \sim T_{n-2}$
And the prediction confidence interval:
$IC_{1 - \gamma} [E(Y \ | \ x_p)] = [ A + B x_p \pm t_{(n-2)(1 - \gamma /2)} S_{n-2} \sqrt{1 + h_p} ]$
The uncertainty around the prediction interval is greater
Some comments on the residuals¶
If the hypotheses of the model are not violated (i.e. the errors are normally distributed etc..) then the following properties hold. Recall that residuals $e_i$ are the estimators of the true statistical error $\epsilon_i$
- Residuals are not independent: their sum is null
- The variance of the residuals is $\text{Var}(Y_i - (A + B x_i)) = \text{Var}(Y_i) + \text{Var}(A + Bx_i) - 2\text{Cov}(Y_i - (A + B x_i)) = \sigma^2 + \sigma^2 h_i - 2\sigma^2 h_i = \sigma^2(1 - h_i)$
- The variance can be estimated as $S^2_{n-2}(1 - h_i)$
Note: do not confuse the variance of the residuals of the sample $\sigma^2(1 - h_i)$ with the variance of the residuals of a new observation $x_p$ : $\sigma^2(1 + h_p)$
Cook's distance: a measure of influence¶
Defined as $D_i = \Large \frac{(e_i)^2}{2 \sigma^2} \left( \frac{h_i}{(1- h_i)^2} \right)$
Values larger than 1, 4 / n or 4 / (n - P -1) for multiple linear regression are considered to be large and point towards an influential observation
class simple_linear_reg_model(object):
'''Class used to bundle together the data, coefficients, parameters, and statistics
of the simple linear regression model '''
def __init__(self,x,y):
'''Initializing the dataset and the basic computed values:
x,y : datasets
n: number of observations
x_bar, y_bar: means of variables
s_xx, s_yy, s_xy: short hand notation, e.g. s_xx = sum(x - x_bar)^2
'''
self.x = x
self.y = y
self.n = len(x)
self.x_bar = np.mean(self.x)
self.y_bar = np.mean(self.y)
self.s_xx = np.sum((self.x - self.x_bar)**2)
self.s_yy = np.sum((self.y - self.y_bar)**2)
self.s_xy = np.sum((self.x - self.x_bar) * (self.y - self.y_bar))
#Other attributes are created here but initialized in the corresponding methods
self.a, self.b = None, None
self.y_hat, self.epsilon, self.h = None, None, None
self.SS_tot, self.SS_res, self.SS_exp = None, None, None
self.Sn_2, self.S_A, self.S_B = None, None, None
self.R_squared = None
self.F, self.T_A, self.T_B = None, None, None
self.F_pvalue, self.T_A_pvalue, self.T_B_pvalue = None, None, None
self.CI_reg_upp,self.CI_reg_low, self.CI_pred_upp,self.CI_pred_low = None,None,None,None
self.S_epsilon, self.Cook_Distance = None, None
def fitModel(self):
#Fit the linear regression model to the data and compute the coefficients
self.b = self.s_xy / self.s_xx
self.a = self.y_bar - self.b * self.x_bar
#Predicted values y_hat, residuals and leverage
self.y_hat = self.a + self.b * self.x
self.epsilon = self.y - self.y_hat
self.h = (1 / self.n) + (x - self.x_bar)**2 / (self.s_xx)
#Sum of squared
self.SS_tot = np.sum( (self.y - self.y_bar)**2 )
self.SS_res = np.sum( (self.y - self.y_hat)**2 )
self.SS_exp = np.sum( (self.y_hat - self.y_bar)**2 )
#Estimators of the variances
self.Sn_2 = (1/(self.n - 2)) * self.SS_res
self.S_A = self.Sn_2 * ( (1 / self.n) + self.x_bar**2 / self.s_xx )
self.S_B = self.Sn_2 / self.s_xx
#Coefficient of determination: R squared
self.R_squared = self.SS_exp / self.SS_tot
#Test statistics
self.T_A = self.a / np.sqrt(self.S_A)
self.T_B = self.b / np.sqrt(self.S_B)
self.F = self.SS_exp / (self.SS_res / (self.n -2))
#Confidence intervals around the given observations
t = stats.t.ppf(.975,self.n - 2)
self.CI_reg_upp = self.a + self.b * self.x + t * np.sqrt(self.Sn_2 * self.h)
self.CI_reg_low = self.a + self.b * self.x - t * np.sqrt(self.Sn_2 * self.h)
self.CI_pred_upp = self.a + self.b * self.x + t * np.sqrt(self.Sn_2 * (1 + self.h))
self.CI_pred_low = self.a + self.b * self.x - t * np.sqrt(self.Sn_2 * (1 + self.h))
#Variance of the residuals and Cook's distance
self.S_epsilon = self.Sn_2 * (1 - self.h)
self.Cook_Distance = (self.epsilon**2 / (2*self.Sn_2) ) * (self.h / (1 - self.h)**2 )
def pValues(self):
#Compute the p-values associated with the test statistics, given that the null hypothesis
#H_0: no linear relationship between x and y, i.e. coefficients are = 0
#Note that we use the statsmodel library here
#Note - the p_value for the t statistic is for a two sided test: Pr > |t| hence the factor of 2
self.T_A_pvalue = (1 - stats.t.cdf(abs(self.T_A), self.n -2) ) * 2
self.T_B_pvalue = (1 - stats.t.cdf(abs(self.T_B), self.n -2) ) * 2
self.F_pvalue = (1 - stats.f.cdf(self.F, 1, self.n -2 ) )
def printResults(self):
#Print results in a similar format to SAS output
print('Parameter Estimates')
df1 = pd.DataFrame(data = {'1. DF' : [1,1 ], '2. Parameter Estimate':[self.a,self.b],
'3. Standard Error':[np.sqrt(self.S_A),np.sqrt(self.S_B)],
'4. t Value': [self.T_A,self.T_B], '5. Pr > |t|': [self.T_A_pvalue, self.T_B_pvalue]},
index = ['Intercept (a)','SURFACE (b)'])
display(df1.round(3))
df2 = pd.DataFrame(data = {'1. DF' : [1,self.n -2 ,self.n -1],
'2. Sum of Squares':[self.SS_exp,self.SS_res,self.SS_tot], '3. F_value':[self.F,'',''],
'4. Pr > F': [self.F_pvalue,'','']}, index = ['Model (Explained)',
'Error (Residuals)','Total'])
print('Analysis of Variance')
display(df2.round(3))
df3 = pd.DataFrame(data = {'Values': [self.y_bar,self.R_squared]},
index = ['Dependent Mean (y_bar)','R-squared'])
print('Mean and R-squared')
display(df3.round(3))
df4 = pd.DataFrame(data = {'1. Dependent Variable' : self.y, '2. Predicted Value': self.y_hat,
'3. Residual': self.epsilon , '3. Std Error Residuals': np.sqrt(self.S_epsilon) ,
'4. 95% CI Mean -': self.CI_reg_low, '5. 95% CI Mean +': self.CI_reg_upp,
'6. 95% CI Pred -': self.CI_pred_low, '7. 95% CI Pred +':self.CI_pred_upp,
'8. Leverage (h)': self.h, '9. Cook Distance': self.Cook_Distance})
print('Output statistics')
display(df4.round(3))
Initialize the model, perform the fitting and probability calculations and print the results in SAS format¶
Given the very small p-values for both the slope parameter $B$ and the model as a whole, we can confidently reject the null hypothesis and state that there is a statistically significant linear relationship between the explanatory variable $x$ and the dependent variable $y$
lm = simple_linear_reg_model(x,y)
lm.fitModel()
lm.pValues()
lm.printResults()
#Limiting max width of iPython column display
HTML("<style>.rendered_html th {max-width: 90px;}</style>")
Print the resulting scatter plot and regression line with confidence intervals¶
#Plot the x and y values in a scatter
plt.figure(figsize = (12,8))
plt.plot(lm.x, lm.y, marker = '.', linestyle = 'none', label = 'data')
plt.xlabel('Surface / m2')
plt.ylabel('Price / Francs ')
plt.title('Simple Linear Regression: House prices & Surface')
#Make regression line to plot
x_vals = np.array([0,max(lm.x)])
y_vals = lm.a + x_vals * lm.b
plt.plot(x_vals,y_vals, label = 'R2: %0.2f '%(lm.R_squared))
#Plot the Confidence Intervals
plt.fill_between(sorted(lm.x), sorted(lm.CI_reg_upp), sorted(lm.CI_reg_low), alpha = .2, facecolor='grey', linestyle='dashdot', edgecolor = 'red', label = '95% CI Mean' )
plt.plot(sorted(lm.x), sorted(lm.CI_pred_upp), linewidth=1, linestyle='dashdot', color = 'navy')
plt.plot(sorted(lm.x), sorted(lm.CI_pred_low), linewidth=1, linestyle='dashdot', color = 'navy', label = '95% CI Pred')
plt.legend()
plt.xlim(min(lm.x) -2,max(lm.x) +2)
plt.show()
Study of the residuals and influential points¶
Studying the residuals is fundamental as it allows to highlight observations which are
- Potential outliers
- Play an important role in the regression analysis
Moreover, it allows us to confirm the hypothesis of the model, in particular:
- Linearity
- Homoscedasticity
- Normal distribution of the errors $\epsilon_i$
Plotting the residual scatter plot and QQ plot¶
Comments: The residuals appear to be approximately symetrically distributed, with mean close to zero. The distribution is not perfectly normal, in particular the right tail is heavier than normal.
This points towards a violation of the hypothesis of heteroscedasticity, i.e. the residual increases as a function of x. A potential solution would be to perform linear regression on $log(Y)$ instead
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (16,6))
#Figure 1: residual plot
ax1.plot(lm.x, lm.epsilon, marker = '.', linestyle = 'none')
ax1.set_xlabel('Surface \m2 (x)')
ax1.set_ylabel('Regression residual /Francs')
ax1.set_title('Residuals plot')
#Figure 2: Plot of the distribution of the residuals
ax2 = sns.distplot(lm.epsilon,ax = ax2, bins = 15)
ax2.set_ylabel('Probability')
ax2.set_xlabel('Residual value')
ax2.set_title('Distribution of residuals')
#Figure 3: QQ plot comparing residual quantiles to theoretical normal quantiles
ax3 = stats.probplot(lm.epsilon, dist = "norm", plot = plt)
plt.show()
Implementation using Statsmodel library¶
The python statsmodel library offers extensive functionalities and reports for the linear regression model. We will explore the main ones in this section
- As recommended by Statsmodel, import the statsmodel api to access the functions and models.
- Importing statsmodels.api will load most of the public parts of statsmodels. This makes most functions and classes conveniently available within one or two levels, without making the “sm” namespace too crowded.
- Use the add_constant utility function to add a column of 1s to the matrix of observation. This will allow the model to fit an intercept parameter
- Print results summary
import statsmodels.api as sm
#Adding the column of 1s
X = sm.add_constant(x)
#Fit the model and print output
lm2 = sm.OLS(y,X)
results = lm2.fit()
print(results.summary())
Comments¶
As expected, the values for the parameters, test statistics, probabilities, R-squared values etc.. are the same as those calculated manually using numpy.
- The formatted table offers a more concise summary of the information
- Statsmodel also performs a number of tests by default: for example the Durbin Watson, Jarque Bera etc..
Study of influence¶
The Statsmodel result object has a built-in method which returns an Influence object with influence and outlier measures. This "influence" object contains many results, methods and methods which can be called as required. See below for a few interesting ones:
- .cooks_distance - Cooks distance
- .hat_matrix_diag - The leverage for observation points
- .summary_table() - Prints the available influence results
- .summary_frame() - Creates a DataFrame with all available influence results.
Documentation here:
#Creating the influence object from the OLS results
influence = results.get_influence()
#Cooks distance for the x observations
display(influence.cooks_distance[0].round(3))
#Leverage values for the x observations
display(influence.hat_matrix_diag.round(3))
influence.summary_frame()
Influence plots¶
Influence plots show the externally studentized residuals vs the leverage of each observation as measured by the hat matrix (here this is only the leverage as there are no matrices in simple linear regression)
Externally studentized residuals are residuals that are scaled by their standard deviation where:
$var(\epsilon_i) = \sigma^2_i (1 - h_i)$
The influence of each point can be visualized by the criterion keyword argument. Options are Cook's distance and DFFITS, two measures of influence
Source: http://www.statsmodels.org/dev/examples/notebooks/generated/regression_plots.html#Influence-plots
Leverage-Resid2 Plot¶
Closely related to the influence_plot is the leverage-resid2 plot.
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (16,6))
ax1 = sm.graphics.influence_plot(results, ax=ax1, criterion="cooks")
ax2 = sm.graphics.plot_leverage_resid2(results, ax=ax2)
Implementation using Sklearn¶
There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
Nevertheless we can extract some key values and parameters using the built in methods and attributes, in particular the regression metrics:
Documentation: http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
#Create linear regression object
regr = linear_model.LinearRegression()
#Reshaping the matrix of X
X = x.reshape(-1,1)
# Train the model using the data sets
regr.fit(X, y)
# Make predictions using the same data set
y_pred = regr.predict(X)
#The coefficients
print('Intercept: \n', regr.intercept_)
print('Slope: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y, y_pred))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
Sources¶
Notebook Author: Xavier Bourret Sicotte
Date: May 2018
The dataset, formulas and insights were taken from the (excellent) course at CNAM in Paris:
Other usefull resources are: Lecture 14 at Duke for detailed derivations of most formula