Chapter 15. Dummy Variables and Categorical Factors

Chapter Purpose

Many product characteristics are qualitative rather than numerical. Brand, fat content, package type, location, and flavor cannot enter regression directly unless they are converted into numerical form. Dummy variables solve this problem.

This chapter uses the actual Milk Data dataset to estimate how brand and fat content are associated with milk prices after controlling for package volume.

Applied Question

Do brand and fat content help explain milk prices after controlling for package volume?

What Is a Dummy Variable?

A dummy variable takes only two values:

\[ 0 \quad \text{or} \quad 1 \]

It indicates whether a characteristic is absent or present.

Reference Categories

In this chapter, the reference categories are:

Variable Reference Category
Brand Almarai
Fat Full

All categorical coefficients are interpreted relative to these reference groups.

Note

Dummy coefficients are not absolute prices. They are differences relative to the reference category.

Regression Model

\[ Price_i = \beta_0 + \beta_1 Volume1000_i + Brand_i + Fat_i + u_i \]

Python Implementation

model_dummy = smf.ols(
    'Price ~ Volume1000 + C(Brand, Treatment(reference="Almarai")) + C(Fat, Treatment(reference="Full"))',
    data=milk_data
).fit()
print(model_dummy.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Price   R-squared:                       0.490
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     10.81
Date:                Thu, 11 Jun 2026   Prob (F-statistic):           2.28e-24
Time:                        06:53:53   Log-Likelihood:                -2035.8
No. Observations:                 258   AIC:                             4116.
Df Residuals:                     236   BIC:                             4194.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
========================================================================================================================
                                                           coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------
Intercept                                              317.1692    115.421      2.748      0.006      89.782     544.556
C(Brand, Treatment(reference="Almarai"))[T.Activia]   -139.4504    684.833     -0.204      0.839   -1488.618    1209.717
C(Brand, Treatment(reference="Almarai"))[T.AlAin]      279.3965    180.433      1.548      0.123     -76.068     634.861
C(Brand, Treatment(reference="Almarai"))[T.AlMudish]  -505.3474    277.583     -1.821      0.070   -1052.205      41.511
C(Brand, Treatment(reference="Almarai"))[T.AlRawabi]  -126.8266    351.853     -0.360      0.719    -820.001     566.348
C(Brand, Treatment(reference="Almarai"))[T.AlSafi]     231.8314    690.133      0.336      0.737   -1127.777    1591.440
C(Brand, Treatment(reference="Almarai"))[T.Alpro]      686.6883    685.323      1.002      0.317    -663.444    2036.820
C(Brand, Treatment(reference="Almarai"))[T.Arla]       474.0133    491.535      0.964      0.336    -494.343    1442.370
C(Brand, Treatment(reference="Almarai"))[T.Balade]     145.7496    351.657      0.414      0.679    -547.037     838.537
C(Brand, Treatment(reference="Almarai"))[T.Hayatna]     21.0844    402.556      0.052      0.958    -771.977     814.146
C(Brand, Treatment(reference="Almarai"))[T.KDD]        278.0683    685.246      0.406      0.685   -1071.913    1628.049
C(Brand, Treatment(reference="Almarai"))[T.Koita]      603.5810    488.556      1.235      0.218    -358.908    1566.070
C(Brand, Treatment(reference="Almarai"))[T.Lacnor]    -276.2208    146.455     -1.886      0.061    -564.747      12.305
C(Brand, Treatment(reference="Almarai"))[T.Lulu]       -96.7056    161.223     -0.600      0.549    -414.326     220.914
C(Brand, Treatment(reference="Almarai"))[T.Marmum]    -242.4741    350.856     -0.691      0.490    -933.685     448.736
C(Brand, Treatment(reference="Almarai"))[T.Mazoon]    -203.8579    145.869     -1.398      0.164    -491.229      83.514
C(Brand, Treatment(reference="Almarai"))[T.Nada]      -159.2864    262.207     -0.607      0.544    -675.851     357.278
C(Brand, Treatment(reference="Almarai"))[T.Other]      815.9841    144.863      5.633      0.000     530.594    1101.374
C(Brand, Treatment(reference="Almarai"))[T.Watani]     266.6087    256.795      1.038      0.300    -239.294     772.511
C(Fat, Treatment(reference="Full"))[T.Low]              38.3926    103.947      0.369      0.712    -166.390     243.175
C(Fat, Treatment(reference="Full"))[T.Other]           589.0777    152.796      3.855      0.000     288.059     890.096
Volume1000                                             432.7499     39.547     10.943      0.000     354.840     510.660
==============================================================================
Omnibus:                      149.606   Durbin-Watson:                   1.773
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1037.496
Skew:                           2.286   Prob(JB):                    5.14e-226
Kurtosis:                      11.695   Cond. No.                         33.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Actual Regression Results

The regression uses 258 milk products.

\[ R^2 = 0.490 \]

\[ Adjusted\;R^2 = 0.445 \]

The model explains about 49.0% of the variation in milk prices.

Main Coefficients

Variable Coefficient Std. Error p-value
Intercept 317.17 115.42 0.006
Volume1000 432.75 39.55 <0.001
Fat: Low 38.39 103.95 0.712
Fat: Other 589.08 152.80 <0.001

Interpretation of Volume

Holding brand and fat content constant, a 1,000 ml increase in package volume is associated with an average price increase of about 433 price units.

Interpreting Fat Coefficients

The reference group is Full fat.

Low-fat products are estimated to be about 38 price units more expensive than full-fat products, but this difference is not statistically significant.

Products in the Other fat category are associated with prices about 589 price units higher than full-fat products, holding volume and brand constant. This result is statistically significant.

Brand Coefficients

The reference brand is Almarai.

Brand Coefficient Std. Error p-value
Activia -139.45 684.83 0.839
AlAin 279.40 180.43 0.123
AlMudish -505.35 277.58 0.070
AlRawabi -126.83 351.85 0.719
AlSafi 231.83 690.13 0.737
Alpro 686.69 685.32 0.317
Arla 474.01 491.53 0.336
Balade 145.75 351.66 0.679
Hayatna 21.08 402.56 0.958
KDD 278.07 685.25 0.685
Koita 603.58 488.56 0.218
Lacnor -276.22 146.45 0.061
Lulu -96.71 161.22 0.549
Marmum -242.47 350.86 0.490
Mazoon -203.86 145.87 0.164
Nada -159.29 262.21 0.544
Other 815.98 144.86 <0.001
Watani 266.61 256.79 0.300

Interpreting Brand Effects

Every brand coefficient is interpreted relative to Almarai.

The Other brand group is associated with prices about 816 price units higher than Almarai products, holding volume and fat content constant. This result is statistically significant.

Lacnor products are estimated to be about 276 price units cheaper than Almarai products. This result is marginally significant at the 10% level but not at the 5% level.

AlMudish products are estimated to be about 505 price units cheaper than Almarai products. This result is also marginally significant at the 10% level.

Important Warning About Small Brand Groups

Some brands appear only a few times in the dataset. Coefficients for small groups should be interpreted cautiously because large standard errors indicate imprecision.

Warning

Do not overinterpret dummy coefficients for categories with very few observations.

Why R² Increased

The volume-only model had:

\[ R^2 = 0.274 \]

After adding brand and fat:

\[ R^2 = 0.490 \]

Brand and fat content contain useful economic information.

The Dummy Variable Trap

If a categorical variable has (K) categories, include only (K-1) dummy variables. The omitted category becomes the reference group. Including all categories with an intercept creates perfect multicollinearity.

Key Takeaways

  • Dummy variables allow qualitative characteristics to enter regression models.
  • The reference brand is Almarai.
  • The reference fat category is Full.
  • Holding brand and fat constant, a 1,000 ml increase in volume is associated with a price increase of about 433 units.
  • Adding brand and fat increases R² from 0.274 to 0.490.
  • The Other fat category and Other brand group are statistically significant.
  • Some brand coefficients are imprecise because of small category counts.