Chapter 25. XGBoost and Model Comparison

A practical introduction to boosting, XGBoost, RMSE, MAE, R-squared, and test-sample model comparison.

Chapter purpose

Random Forests improve prediction by combining many independent trees. Boosting follows a different logic. It builds trees sequentially, allowing each new tree to learn from previous prediction errors.

One of the most widely used boosting algorithms is XGBoost, short for Extreme Gradient Boosting.

This chapter introduces XGBoost and shows how to compare predictive models using test-sample performance.

Applied question

Which model predicts milk prices most accurately: linear regression, decision trees, Random Forests, or XGBoost?

Key idea

Different models can produce different predictions from the same data. Rather than choosing a model because it is popular, we compare models using out-of-sample prediction performance.

Boosting and model comparison diagram.

Minimal concept

Random Forest:

Many trees → independent learning → average predictions

XGBoost:

Tree 1 → learn from mistakes → Tree 2 → learn from mistakes → Tree 3 → final prediction

25.1 Why boosting was developed

Decision Trees are simple but unstable. Random Forests improve stability by averaging many trees. However, Random Forests do not explicitly focus on correcting earlier mistakes.

Boosting takes a different approach. Each new tree is built to improve the errors made by earlier trees. Instead of working independently, the trees cooperate.

This sequential learning process often produces highly accurate predictions.

25.2 What is XGBoost?

XGBoost is an efficient implementation of gradient boosting. It is designed to improve prediction accuracy, reduce overfitting, handle large datasets efficiently, and capture nonlinear relationships.

A simplified learning process is:

  1. Build a small tree.
  2. Calculate prediction errors.
  3. Build another tree focused on those errors.
  4. Repeat many times.
  5. Combine all trees into a final prediction.

Unlike regression models, XGBoost does not produce easily interpretable coefficients. Its primary objective is prediction.

25.3 Optional XGBoost code for students

The website version does not execute XGBoost during rendering to keep deployment lightweight. Students may run the optional XGBoost code in Google Colab if the package is installed.

from xgboost import XGBRegressor

xgb_model = XGBRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=4,
    random_state=4107
)

xgb_model.fit(X_train, y_train)
xgb_predictions = xgb_model.predict(X_test)

Interpretation

The model creates many small trees. Each tree attempts to improve on previous predictions. The final prediction combines information from all trees. Because the trees are combined into an ensemble, XGBoost is usually less transparent than a regression model with a small number of interpretable coefficients.

25.4 Measuring prediction performance

The most common performance measures are RMSE, MAE, and test-sample (R^2).

[ RMSE = ]

[ MAE = |Actual_i - Predicted_i| ]

For predictive analysis, RMSE and MAE are often more informative than (R^2) because they directly measure prediction errors.

25.5 Comparing multiple models

The executable website example compares three lightweight models already supported by the site-build environment:

  1. Linear Regression
  2. Decision Tree
  3. Random Forest

XGBoost remains part of the conceptual comparison, but it is not executed during website rendering.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

models = {
    "Linear Regression": ols_predictions,
    "Decision Tree": tree_predictions,
    "Random Forest": forest_predictions
}

results = []

for name, pred in models.items():
    rmse = mean_squared_error(y_test, pred) ** 0.5
    mae = mean_absolute_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    results.append([name, rmse, mae, r2])

results_df = pd.DataFrame(
    results,
    columns=["Model", "RMSE", "MAE", "R2"]
)

print(results_df)
               Model        RMSE         MAE        R2
0  Linear Regression  874.205098  487.462162  0.259411
1      Decision Tree  449.050208  371.252731  0.804593
2      Random Forest  412.600757  302.753016  0.835028

Static teaching results table:

Model Example RMSE Example MAE Example R² Website rendering status
Linear Regression 0.31 0.23 0.71 Executed with scikit-learn.
Decision Tree 0.28 0.20 0.75 Executed with scikit-learn.
Random Forest 0.22 0.16 0.84 Executed with scikit-learn.
XGBoost 0.19 0.14 0.88 Shown conceptually, not executed on this website.

Interpretation

The executable comparison shows how RMSE, MAE, and test-sample (R^2) can be calculated for several models using the same test sample. XGBoost can be added by students in a notebook environment when the package is available. A lower prediction error does not necessarily mean the model is the most useful model for economic explanation.

25.6 Visual comparison of models

import matplotlib.pyplot as plt

plt.figure(figsize=(7, 5))
plt.bar(results_df["Model"], results_df["RMSE"])
plt.title("Prediction Error by Model")
plt.ylabel("RMSE")
plt.show()

The shortest bar represents the smallest prediction error.

25.7 Why simpler models still matter

It is tempting to always choose the model with the lowest error. But simpler models remain valuable.

If a policymaker asks how much package volume affects price, a regression model may be more appropriate because coefficients can be interpreted. If a supermarket asks what price to expect next week, prediction accuracy may be more important.

The best predictive model is not always the best explanatory model.

25.8 Beyond accuracy

Prediction accuracy is important, but it is not the only consideration. Researchers should also consider interpretability, computational cost, transparency, reproducibility, and economic relevance.

A small improvement in RMSE may not justify a substantial increase in complexity.

WarningCommon mistake

Do not declare the model with the highest (R^2) as the best model without considering RMSE, MAE, interpretability, and the purpose of the analysis.

Key takeaway

  • XGBoost builds trees sequentially.
  • Each new tree learns from previous prediction errors.
  • XGBoost often produces accurate predictions.
  • RMSE and MAE are useful measures of predictive performance.
  • Model comparison should use test-sample results.
  • Better prediction does not imply better economic explanation.

Looking ahead

In the next chapter, we examine feature importance and discuss what machine learning models can and cannot tell us about economic relationships.