Chapter 19. Multicollinearity

Understanding highly correlated regressors, inflated standard errors, and variance inflation factors.

Chapter purpose

Multiple regression allows economists to study the relationship between a dependent variable and several explanatory variables simultaneously. However, problems arise when explanatory variables contain very similar information.

When two or more explanatory variables move together closely, it becomes difficult for the regression model to determine their separate effects. This problem is known as multicollinearity.

In this chapter, we learn what multicollinearity is, how to identify it, and how it affects regression results.

Applied question

Which farm inputs increase crop yield?

Suppose we collect data from 250 farms. Variables include:

  • crop yield
  • fertilizer use
  • irrigation water
  • farm expenditure

Economic theory suggests that all three inputs contribute to crop productivity. However, farms that spend more money often purchase more fertilizer and use more irrigation. As a result, the explanatory variables may become highly correlated.

Economic background

In real-world datasets, explanatory variables often move together.

Examples include:

  • education and work experience
  • income and wealth
  • farm size and machinery ownership
  • fertilizer use and irrigation
  • GDP and energy consumption

When explanatory variables contain overlapping information, the regression model struggles to distinguish their individual contributions.

Multicollinearity is therefore a data problem rather than a modeling mistake.

Key idea

Multicollinearity occurs when explanatory variables are highly correlated.

For example:

[ Corr(X_1,X_2) ]

or

[ Corr(X_1,X_2) ]

In such situations, the regression model receives nearly the same information from multiple variables.

The model can still predict the dependent variable accurately, but coefficient estimates become unstable and difficult to interpret.

Simulating farm data

To illustrate multicollinearity, we create a dataset in which fertilizer use and irrigation are strongly related.

import numpy as np
import pandas as pd

np.random.seed(4107)

n = 250

fertilizer = np.random.normal(200, 40, n)

irrigation = fertilizer * 4 + np.random.normal(0, 20, n)

yield_data = (
    2
    + 0.03 * fertilizer
    + 0.01 * irrigation
    + np.random.normal(0, 2, n)
)

farm_data = pd.DataFrame({
    "Yield": yield_data,
    "Fertilizer": fertilizer,
    "Irrigation": irrigation
})

farm_data.head()
Yield Fertilizer Irrigation
0 17.186780 233.640675 916.976019
1 9.735865 147.492317 605.077263
2 13.934696 171.744352 707.723802
3 15.615381 171.641342 709.244515
4 15.613278 198.740988 828.210820

Exploring relationships

Before estimating a regression model, we should examine relationships among explanatory variables.

farm_data[[
    "Fertilizer",
    "Irrigation"
]].corr()
Fertilizer Irrigation
Fertilizer 1.000000 0.993065
Irrigation 0.993065 1.000000

Interpretation

The correlation coefficient is likely to be very high. This suggests that fertilizer use and irrigation provide similar information.

High correlation does not automatically create a problem, but it raises concerns about multicollinearity.

Visualizing multicollinearity

import matplotlib.pyplot as plt

plt.figure(figsize=(7, 5))

plt.scatter(
    farm_data["Fertilizer"],
    farm_data["Irrigation"],
    alpha=0.7
)

plt.xlabel("Fertilizer")
plt.ylabel("Irrigation")
plt.title("Relationship Between Explanatory Variables")

plt.show()

Interpretation

The points form a tight upward pattern. This indicates that farms using more fertilizer also tend to use more irrigation.

The stronger this relationship becomes, the harder it is to separate the individual effects of the two variables.

Estimating the regression

We estimate the following model:

[ Yield_i= _0+ _1 Fertilizer_i+ _2 Irrigation_i+ u_i ]

import statsmodels.api as sm

X = farm_data[[
    "Fertilizer",
    "Irrigation"
]]

X = sm.add_constant(X)

y = farm_data["Yield"]

model = sm.OLS(y, X).fit()

print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Yield   R-squared:                       0.710
Model:                            OLS   Adj. R-squared:                  0.707
Method:                 Least Squares   F-statistic:                     301.8
Date:                Thu, 11 Jun 2026   Prob (F-statistic):           4.76e-67
Time:                        06:54:34   Log-Likelihood:                -518.46
No. Observations:                 250   AIC:                             1043.
Df Residuals:                     247   BIC:                             1053.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3409      0.619      2.166      0.031       0.122       2.560
Fertilizer     0.0442      0.025      1.753      0.081      -0.005       0.094
Irrigation     0.0072      0.006      1.140      0.255      -0.005       0.020
==============================================================================
Omnibus:                        1.609   Durbin-Watson:                   2.094
Prob(Omnibus):                  0.447   Jarque-Bera (JB):                1.674
Skew:                           0.153   Prob(JB):                        0.433
Kurtosis:                       2.742   Cond. No.                     4.35e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpreting the results

Suppose the model produces:

  • a high (R^2) value
  • large standard errors
  • insignificant coefficients

At first glance, this may seem contradictory.

Students often ask:

If the model fits well, why are the coefficients insignificant?

The answer is multicollinearity.

The model explains yield effectively, but it struggles to determine which explanatory variable deserves credit for the explanation.

Why multicollinearity matters

Multicollinearity increases uncertainty surrounding coefficient estimates.

As correlation among explanatory variables rises:

  • standard errors increase
  • confidence intervals become wider
  • coefficients become unstable
  • signs may change unexpectedly
  • statistical significance may disappear

The regression model becomes less precise.

Variance Inflation Factors

A common diagnostic tool is the Variance Inflation Factor.

The VIF measures how strongly an explanatory variable can be predicted by the remaining explanatory variables.

VIF Interpretation
Below 5 Usually acceptable
5 to 10 Potential concern
Above 10 Serious concern

These thresholds are guidelines rather than strict rules.

Calculating VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()

vif_data["Variable"] = X.columns

vif_data["VIF"] = [
    variance_inflation_factor(
        X.values,
        i
    )
    for i in range(X.shape[1])
]

vif_data
Variable VIF
0 const 25.538392
1 Fertilizer 72.351089
2 Irrigation 72.351089

Interpretation

Large VIF values indicate substantial overlap among explanatory variables.

The larger the VIF, the greater the inflation of coefficient uncertainty.

Correlation heatmap

When working with many explanatory variables, a heatmap can provide a useful overview.

import seaborn as sns

plt.figure(figsize=(6, 5))

sns.heatmap(
    farm_data.corr(),
    annot=True,
    cmap="coolwarm"
)

plt.title("Correlation Matrix")

plt.show()

Interpretation

Heatmaps quickly identify highly related variables. However, correlation alone does not fully diagnose multicollinearity because several variables may jointly create the problem.

VIF measures provide a more complete assessment.

Why multicollinearity does not create bias

One of the most common misconceptions is that multicollinearity biases regression coefficients.

This is not generally true.

Multicollinearity primarily affects precision rather than accuracy. The model still estimates the correct relationship on average, but uncertainty increases substantially.

Consequently, coefficient estimates become less reliable.

What can economists do?

Possible responses include:

Collect more data

Additional observations often improve precision.

Remove redundant variables

If two variables measure nearly the same concept, one may be omitted.

Combine variables

Closely related variables can sometimes be merged into an index.

Use economic theory

Theory should guide variable selection. Researchers should not remove variables solely because VIF values appear large.

WarningCommon mistake

Do not drop variables automatically because the VIF is high. Variable selection should be guided by economics, not diagnostics alone.

Common mistakes

Mistake 1: Looking only at (R^2)

A high (R^2) does not guarantee reliable coefficient estimates.

Mistake 2: Treating correlation as proof

High correlation suggests multicollinearity but does not prove it.

Mistake 3: Dropping variables automatically

Variables should not be removed solely because VIF exceeds an arbitrary threshold.

Mistake 4: Confusing bias with imprecision

Multicollinearity usually increases variance rather than creating bias.

Mistake 5: Ignoring economic theory

Variable selection should be guided by economics, not diagnostics alone.

Key takeaways

  • Multicollinearity occurs when explanatory variables contain overlapping information.
  • Highly correlated explanatory variables are difficult to separate statistically.
  • Multicollinearity increases standard errors.
  • Coefficients may become unstable and insignificant.
  • Prediction may remain strong even when interpretation becomes difficult.
  • Variance Inflation Factors provide a useful diagnostic tool.
  • Multicollinearity generally affects precision rather than bias.
  • Economic theory remains essential when choosing explanatory variables.

Looking ahead

Multicollinearity makes it difficult to isolate the separate effects of explanatory variables. In the next chapter, we examine endogeneity, which can produce biased coefficient estimates and threaten causal interpretation.