Chapter 20. Endogeneity and Instrumental Variables
Understanding omitted variable bias, reverse causality, measurement error, and the logic of instrumental variables.
Chapter purpose
Regression analysis is often used to estimate the effect of one variable on another. However, obtaining a statistically significant coefficient does not necessarily mean that the estimated relationship is causal.
One of the most serious challenges in applied econometrics is endogeneity. Endogeneity occurs when an explanatory variable is correlated with the error term. When this happens, the estimated coefficients may be biased and misleading.
In this chapter, we learn what endogeneity is, why it occurs, and how economists use instrumental variables to address the problem.
Applied question
Does education increase earnings?
Suppose we estimate the following relationship:
[ Income_i = _0 + _1 Education_i + u_i ]
where income represents annual earnings and education represents years of schooling.
Most people expect education to increase earnings. However, individuals differ in many ways that are difficult to observe.
Some individuals may have:
greater motivation
better problem-solving skills
higher natural ability
stronger family support
These factors may influence both education and earnings. As a result, the estimated relationship between education and income may not reflect the true causal effect of education.
Economic background
Economists are often interested in causal questions.
Examples include:
Does education increase income?
Does fertilizer increase crop yield?
Does advertising increase sales?
Does foreign aid promote economic growth?
Does trade liberalization increase exports?
Simple correlations rarely provide convincing answers. The challenge is that many economic variables influence one another simultaneously.
As a result, causal interpretation requires caution.
Key idea
The classical regression model assumes:
[ Cov(X,u)=0 ]
This means that the explanatory variable is unrelated to the error term.
Endogeneity occurs when:
[ Cov(X,u) ]
When this assumption fails, OLS estimates become biased.
Unlike heteroskedasticity or multicollinearity, endogeneity threatens the validity of the coefficient estimate itself.
A simple example
Suppose we estimate:
[ Income_i = _0 + _1 Education_i + u_i ]
The error term contains many omitted factors:
ability
motivation
family background
social networks
Suppose more able individuals obtain more education.
Ability therefore affects education and income. Ability enters the error term because it is unobserved.
Consequently:
[ Cov(Education,u) ]
The OLS estimate is biased.
Understanding omitted variable bias
Omitted variable bias occurs when three conditions hold:
A relevant variable is omitted.
The omitted variable affects the dependent variable.
The omitted variable is correlated with an explanatory variable.
In our example:
Variable
Affects education?
Affects income?
Ability
Yes
Yes
Because ability satisfies both conditions, it creates bias.
Visualizing the problem
A simple causal diagram helps clarify the issue.
Ability
↘
↘
Education → Income
↗
↗
Family Background
Interpretation
Education influences income. However, ability and family background influence both education and income.
If these variables are omitted, the estimated effect of education captures more than education alone.
Simulating endogeneity
We create a dataset where ability affects both education and income.
Instrumental variables provide one strategy for addressing endogeneity.
Valid instruments must be relevant and exogenous.
Establishing causality is often the most challenging task in applied economics.
Looking ahead
Throughout this chapter, we examined individual econometric problems that can weaken empirical conclusions. In the next chapter, we bring everything together and learn how economists evaluate model credibility.