Linear Regression Assumptions

Setumo Raphela
4 min readApr 17, 2023

Linear regression is a statistical tool for modeling the relationship between a dependent variable and one or more independent variables. It is important to understand and check the assumptions of linear regression to ensure that the results are valid and reliable.

Below are key assumptions of linear regression and how to mitigate them:

a) Linearity: The relationship between the dependent variable and the independent variable(s) is linear.

To test for linearity, there are several methods that can be used:

Scatter plot: One of the simplest methods is to create a scatter plot of the dependent variable against each independent variable. If the points on the plot form a roughly straight line, then the relationship between the variables is likely linear.

Residual plot: Another way to test for linearity is to plot the residuals (the difference between the observed values and the predicted values) against the fitted values (the predicted values). If the points on the plot are scattered randomly around a horizontal line, then the linearity assumption is likely met.

Cook’s distance: Cook’s distance is a measure of the influence of each observation on the regression coefficients. If there are outliers or influential observations that violate the linearity assumption, they will have a high Cook’s distance.

Durbin-Watson test: The Durbin-Watson test is a statistical test that can be used to check for the presence of autocorrelation (the correlation between the residuals at different points in time). If there is autocorrelation, it can violate the linearity assumption.

b) Independence: The observations are independent of each other. This means that there is no relationship between the residuals of the observations.

c) Homoscedasticity: The variance of the residuals is constant across all levels of the independent variable(s).

d) Normality: The data points /indepent vraibles are normally distributed.

Histogram: One of the simplest methods is to create a histogram of the residuals (the difference between the observed values and the predicted values). If the histogram is roughly bell-shaped, then the residuals are likely normally distributed.

QQ plot: A QQ (quantile-quantile) plot is a graphical method to compare the distribution of the residuals to a normal distribution. If the residuals follow a straight line, then they are likely normally distributed.

Shapiro-Wilk test: The Shapiro-Wilk test is a statistical test that can be used to test for normality. It tests the null hypothesis that the data is normally distributed. If the p-value of the test is greater than the significance level (usually 0.05), then we can fail to reject the null hypothesis and conclude that the data is normally distributed.

Kolmogorov-Smirnov test: The Kolmogorov-Smirnov test is another statistical test that can be used to test for normality. It tests the null hypothesis that the data follows a normal distribution. If the p-value of the test is greater than the significance level (usually 0.05), then we can fail to reject the null hypothesis and conclude that the data is normally distributed.

e) No multicollinearity: The independent variables are not highly correlated with each other.

Multicollinearity can be tested using the following methods:

Correlation matrix: One of the easiest ways to detect multicollinearity is to calculate the correlation matrix between the independent variables. If the correlation between two independent variables is high (i.e., greater than 0.8), then those variables are likely to be multicollinear. A correlation matrix can be created using the pandas corr() method.

Variance Inflation Factor (VIF): The VIF measures the degree to which the variance of the estimated regression coefficients is increased due to multicollinearity in the data. A VIF value greater than 5 or 10 indicates that multicollinearity may be a problem. The VIF can be calculated for each independent variable using the statsmodels package in Python.

Eigenvalues: Another way to detect multicollinearity is to calculate the eigenvalues of the correlation matrix. If any of the eigenvalues are close to 0, then there is likely to be multicollinearity in the data.

Condition Number: The condition number is a measure of how sensitive the regression coefficients are to small changes in the data. A condition number greater than 30 indicates that multicollinearity may be a problem.

f) No influential outliers: There are no extreme values that have a disproportionate effect on the regression coefficients.

Checking these assumptions can be done through various methods, such as residual plots, QQ plots, and statistical tests like the Shapiro-Wilk test for normality and the Breusch-Pagan test for homoscedasticity. If the assumptions are not met, it may be necessary to transform the data or use a different modeling technique.

--

--

Setumo Raphela

Entrepreneur | Data Scientist | AI | Jet Skier | Author |Oracle