Chapter 17. Heteroskedasticity and Robust Standard Errors

Detecting non-constant error variance and correcting inference using robust standard errors.

Chapter purpose

Regression analysis is one of the most widely used tools in applied economics. However, obtaining coefficient estimates is only the first step. Economists must also determine whether the reported standard errors, confidence intervals, and hypothesis tests are reliable.

One of the most common violations of the classical regression assumptions is heteroskedasticity. Heteroskedasticity occurs when the variability of the regression errors changes across observations. In practice, larger farms, firms, households, or countries often exhibit greater variability than smaller ones.

Ignoring heteroskedasticity can lead to misleading conclusions about statistical significance. In this chapter, we learn how to identify heteroskedasticity, formally test for it, and correct inference using robust standard errors.

Applied question

Do larger farms earn higher income?

Suppose we collect information from 200 farms and estimate the relationship between annual farm income and farm size.

Economic theory suggests that larger farms generally earn higher income. However, larger farms may also face greater uncertainty because they operate on a larger scale and are more exposed to weather shocks, market fluctuations, and production risks.

As a result, income variability may increase with farm size.

Economic background

Many econometric applications involve observations with very different scales.

Examples include:

small and large farms
small and large firms
low-income and high-income households
small and large countries

When the variability of outcomes increases with size, the assumption of constant error variance may be violated.

For example, a small farm may earn between 4,000 and 6,000 OMR annually, while a large farm may earn between 20,000 and 60,000 OMR. The average income increases with farm size, but so does the uncertainty surrounding income.

Key idea

The classical linear regression model assumes that the variance of the error term is constant across all observations:

[ Var(u_i)=^2 ]

This assumption is known as homoskedasticity.

Heteroskedasticity occurs when the error variance changes across observations:

[ Var(u_i)^2 ]

In practical terms, some observations are measured with greater uncertainty than others.

The presence of heteroskedasticity does not necessarily bias the estimated regression coefficients. However, it can make standard errors unreliable, leading to incorrect confidence intervals and hypothesis tests.

Simulating farm data

To illustrate heteroskedasticity, we create a synthetic dataset in which larger farms have both higher income and greater income variability.

import numpy as np
import pandas as pd

np.random.seed(4107)

n = 200
land = np.random.uniform(5, 500, n)

error = np.random.normal(
    0,
    land * 20,
    n
)

income = 5000 + 120 * land + error

farm_data = pd.DataFrame({
    "Income": income,
    "Land": land
})

farm_data.head()

	Income	Land
0	24672.435314	138.996820
1	45479.198779	325.219332
2	21225.941326	116.010498
3	16007.881758	116.506283
4	11574.007515	58.509175

Visualizing the relationship

Before estimating a regression model, it is always useful to inspect the data visually.

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))

plt.scatter(
    farm_data["Land"],
    farm_data["Income"],
    alpha=0.7
)

plt.xlabel("Farm Size (hectares)")
plt.ylabel("Annual Income (OMR)")
plt.title("Farm Income and Farm Size")

plt.show()

Interpretation

The graph shows a clear positive relationship between farm size and income.

Larger farms tend to earn higher income. However, the spread of observations becomes wider as farm size increases. Small farms cluster relatively closely together, whereas large farms exhibit much greater variation.

This visual pattern suggests heteroskedasticity.

Estimating the regression model

We estimate a simple linear regression model:

[ Income_i = _0 + _1 Land_i + u_i ]

where (Income_i) is annual farm income, (Land_i) is farm size, and (u_i) is the random error term.

import statsmodels.api as sm

X = sm.add_constant(farm_data["Land"])
y = farm_data["Income"]

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Income   R-squared:                       0.902
Model:                            OLS   Adj. R-squared:                  0.901
Method:                 Least Squares   F-statistic:                     1815.
Date:                Thu, 11 Jun 2026   Prob (F-statistic):          1.14e-101
Time:                        06:54:14   Log-Likelihood:                -2010.0
No. Observations:                 200   AIC:                             4024.
Df Residuals:                     198   BIC:                             4031.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5031.2341    838.222      6.002      0.000    3378.246    6684.222
Land         120.8008      2.835     42.606      0.000     115.210     126.392
==============================================================================
Omnibus:                       15.725   Durbin-Watson:                   1.895
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               39.693
Skew:                          -0.248   Prob(JB):                     2.40e-09
Kurtosis:                       5.125   Cond. No.                         622.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting the coefficient

Suppose the estimated slope coefficient equals 118.

The interpretation is:

On average, each additional hectare of land is associated with approximately 118 OMR higher annual farm income.

This interpretation focuses on the average relationship. The regression coefficient does not describe the variability of income around that relationship.

Residual analysis

A residual plot is often the simplest way to detect heteroskedasticity.

residuals = model.resid
fitted = model.fittedvalues

plt.figure(figsize=(8, 5))

plt.scatter(
    fitted,
    residuals,
    alpha=0.7
)

plt.axhline(
    y=0,
    linestyle="--"
)

plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residuals versus Fitted Values")

plt.show()

Interpretation

If the homoskedasticity assumption holds, the residuals should display a roughly constant vertical spread across all fitted values.

Instead, the residual plot resembles a funnel. Residuals become increasingly dispersed as fitted values increase. This pattern is one of the most common indicators of heteroskedasticity.

The Breusch-Pagan test

Visual inspection is useful, but economists often supplement graphs with formal statistical tests.

The Breusch-Pagan test evaluates whether the variance of the residuals changes systematically across observations.

Hypotheses

Null hypothesis:

[ H_0: ]

Alternative hypothesis:

[ H_1: ]

from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(
    model.resid,
    model.model.exog
)

labels = [
    "LM Statistic",
    "LM p-value",
    "F Statistic",
    "F p-value"
]

for name, value in zip(labels, bp_test):
    print(name, value)

LM Statistic 24.852888240654703
LM p-value 6.187626749033513e-07
F Statistic 28.095649549796637
F p-value 3.0665931032837663e-07

Interpretation

A small p-value indicates evidence against the null hypothesis of constant variance.

If the p-value is below 0.05, heteroskedasticity is typically considered statistically significant.

Why heteroskedasticity matters

Many students believe that heteroskedasticity changes coefficient estimates. This is usually incorrect.

The primary problem is that heteroskedasticity affects standard errors.

As a result:

confidence intervals may be inaccurate
hypothesis tests may be misleading
statistically significant results may become insignificant
insignificant results may appear significant

Consequently, economists focus on correcting inference rather than changing coefficient estimates.

Robust standard errors

One of the most common solutions is to estimate robust standard errors.

Robust standard errors adjust the estimated variability of the coefficients without changing the coefficient estimates themselves.

robust_model = model.get_robustcov_results()

print(robust_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Income   R-squared:                       0.902
Model:                            OLS   Adj. R-squared:                  0.901
Method:                 Least Squares   F-statistic:                     1964.
Date:                Thu, 11 Jun 2026   Prob (F-statistic):          1.00e-104
Time:                        06:54:14   Log-Likelihood:                -2010.0
No. Observations:                 200   AIC:                             4024.
Df Residuals:                     198   BIC:                             4031.
Df Model:                           1                                         
Covariance Type:                  HC1                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5031.2341    490.888     10.249      0.000    4063.194    5999.274
Land         120.8008      2.726     44.312      0.000     115.425     126.177
==============================================================================
Omnibus:                       15.725   Durbin-Watson:                   1.895
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               39.693
Skew:                          -0.248   Prob(JB):                     2.40e-09
Kurtosis:                       5.125   Cond. No.                         622.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

Comparing standard errors

comparison = pd.DataFrame({
    "OLS Standard Error": model.bse,
    "Robust Standard Error": robust_model.bse
})

comparison

	OLS Standard Error	Robust Standard Error
const	838.221999	490.888159
Land	2.835285	2.726142

Interpretation

The coefficient estimates remain unchanged because the regression line itself is unchanged. However, the estimated uncertainty surrounding the coefficients may increase or decrease.

The purpose of robust standard errors is to provide more reliable inference when heteroskedasticity is present.

Why robust standard errors are common

Many modern empirical studies routinely report robust standard errors.

Reasons include:

heteroskedasticity is common in economic data
robust standard errors are easy to calculate
they improve the reliability of hypothesis testing
reporting robust results increases empirical credibility

For cross-sectional data, robust standard errors are often considered good practice.

Common mistake

Do not assume that heteroskedasticity biases OLS coefficients. The main concern is usually unreliable standard errors and misleading inference.

Common mistakes

Mistake 1: Ignoring residual plots

Students often focus only on coefficient estimates and p-values. Residual plots frequently reveal important problems that summary tables cannot.

Mistake 2: Believing heteroskedasticity biases coefficients

The main concern is usually unreliable standard errors rather than biased coefficients.

Mistake 3: Assuming large samples eliminate heteroskedasticity

Large samples do not prevent heteroskedasticity from occurring.

Mistake 4: Applying log transformations automatically

Logarithmic transformations sometimes reduce heteroskedasticity but should not be applied without economic justification.

Mistake 5: Reporting only conventional standard errors

When heteroskedasticity is suspected, robust standard errors are often preferable.

Key takeaways

Heteroskedasticity occurs when error variance is not constant across observations.
Larger economic units often exhibit greater variability than smaller units.
Residual plots provide a useful visual diagnostic.
The Breusch-Pagan test provides a formal statistical test.
Heteroskedasticity mainly affects standard errors and inference.
OLS coefficient estimates often remain unchanged.
Robust standard errors provide more reliable hypothesis tests and confidence intervals.
Careful diagnostics improve the credibility of empirical research.

Looking ahead

Heteroskedasticity involves changing error variance across observations. In the next chapter, we examine a different problem that arises in time-series data: autocorrelation.