Chapter 12. Prediction and Goodness of Fit

Chapter Purpose

Regression models can be used for prediction, but prediction is useful only when model performance is assessed carefully. This chapter uses the Milk Data model to explain fitted values, residuals, R², MAE, and RMSE.

Applied Question

In Chapter 10 we estimated:

\[ \widehat{Price}=516.6+417.0\times Volume1000 \]

Can this equation predict milk prices well?

Fitted Values

A fitted value is the value predicted by the regression equation:

\[ \widehat{Y} \]

X = sm.add_constant(milk_data["Volume1000"])
y = milk_data["Price"]
model = sm.OLS(y, X).fit()
milk_data["Predicted"] = model.predict(X)
milk_data[["Price", "Predicted"]].head()
Price Predicted
0 490 884.955783
1 1960 2136.034010
2 1980 2044.288273
3 540 780.699264
4 490 884.955783

Prediction Errors

The residual is:

\[ Residual_i = Actual_i - Predicted_i \]

milk_data["Residual"] = milk_data["Price"] - milk_data["Predicted"]

Residuals measure prediction error.

Goodness of Fit

The model produces:

\[ R^2 = 0.274 \]

Package volume explains approximately 27.4% of the variation in milk prices. The remaining 72.6% is explained by other factors not included in this simple model.

model.rsquared
np.float64(0.27387839998291263)

Prediction Accuracy

Using the actual Milk Data regression:

\[ MAE = 437.6 \]

\[ RMSE = 772.0 \]

MAE means that predictions differ from actual prices by about 438 price units on average. RMSE is larger because it penalizes large prediction errors more strongly.

mae = mean_absolute_error(milk_data["Price"], milk_data["Predicted"])
rmse = root_mean_squared_error(milk_data["Price"], milk_data["Predicted"])
mae, rmse
(437.5678085533861, 772.0000524580844)

Actual Versus Predicted Plot

plt.scatter(milk_data["Price"], milk_data["Predicted"])
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual versus Predicted Prices")
plt.show()

Residual Plot

plt.scatter(milk_data["Predicted"], milk_data["Residual"])
plt.axhline(y=0, linestyle="--")
plt.xlabel("Predicted Price")
plt.ylabel("Residual")
plt.title("Residual Plot")
plt.show()

A good residual plot should show random scatter with no systematic pattern.

Prediction Versus Explanation

Prediction and explanation are related but not identical. Prediction focuses on accuracy. Explanation focuses on economic interpretation. A model can predict well without being causal, and a model can be useful for explanation even with modest prediction power.

Warning

A high R² does not prove causality and does not guarantee that the model is correctly specified.

What We Learned From the Milk Data

  • R² = 0.274
  • MAE = 437.6
  • RMSE = 772.0

Volume is useful but not sufficient. Additional variables are needed to improve prediction and explanation.

Key Takeaways

  • Fitted values are predicted values.
  • Residuals measure prediction errors.
  • Volume explains 27.4% of milk price variation.
  • MAE is approximately 437.6 price units.
  • RMSE is approximately 772.0 price units.
  • Prediction and explanation are distinct objectives.