Chapter 5. Descriptive Statistics for Food and Agricultural Data
NREC4107 - Applied Econometrics
Opening purpose
This chapter updates the descriptive statistics using the actual course milk dataset:
Milk_Data_S2025n.csv
The purpose is to summarize the dataset before drawing graphs or estimating models. Descriptive statistics help us understand what is in the data, how variables are measured, and whether there are values that require attention.
This chapter does not assume anything beyond what is observed in the dataset.
Applied question
What does the milk dataset contain, and how can we summarize it before visualization and regression?
Key idea
Descriptive statistics are the first step in applied econometrics. They help us answer basic questions such as:
- How many observations are in the dataset?
- Which variables are numeric?
- Which variables are categorical?
- Are there missing values?
- Are there repeated rows?
- How are price, size, pieces, and volume distributed?
- Are there unusually high or low values that should be inspected?
Descriptive statistics describe the data. They do not explain causality.
Loading the dataset in Google Colab
The examples below assume that the dataset is stored in Google Drive as:
../data/Milk_Data_S2025n.csv
Students should change the file path if they saved the dataset somewhere else.
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
data_path = "../data/Milk_Data_S2025n.csv"
milk_data = pd.read_csv(data_path)
milk_data.head()Dataset structure
The attached dataset contains 258 rows and 11 columns.
The observed columns are:
milk_data.columns.tolist()The columns in the dataset are:
['Location', 'Type', 'Brand', 'Fat', 'Fresh', 'Price', 'Package', 'Size', 'Pieces', 'Flavor', 'Volume', 'Price_per_1000_volume']
The numeric variables are Price, Size, Pieces, and Volume. The other variables are categorical text variables.
milk_data.info()Missing values and repeated rows
The dataset contains 0 missing values across all columns.
milk_data.isnull().sum()The dataset also contains 48 fully duplicated rows.
milk_data.duplicated().sum()A duplicated row should not be deleted automatically. It may be a repeated product entry, a repeated observation, or a data entry duplicate. The correct decision depends on how the data were collected. For now, we report the duplication count and keep the data unchanged.
Checking the constructed volume variable
The dataset already contains a Volume column. We can verify whether it equals:
[ Volume = Size Pieces ]
volume_check = milk_data["Volume"] == milk_data["Size"] * milk_data["Pieces"]
volume_check.value_counts()In this dataset, the number of rows where Volume does not match Size × Pieces is 0.
(milk_data["Volume"] != milk_data["Size"] * milk_data["Pieces"]).sum()This means the recorded Volume column is consistent with Size × Pieces for all rows in the attached dataset.
Descriptive statistics for numeric variables
The table below summarizes the numeric variables in the dataset.
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Price | 258 | 1,028.08 | 907.73 | 100 | 450 | 850.00 | 1,250.00 | 4,990.00 |
| Size | 258 | 857.29 | 766.05 | 120 | 200 | 1,000.00 | 1,000.00 | 3,800.00 |
| Pieces | 258 | 2.50 | 3.18 | 1 | 1 | 1.00 | 4.00 | 24.00 |
| Volume | 258 | 1,343.21 | 1,139.13 | 120 | 500 | 1,000.00 | 2,000.00 | 4,000.00 |
The median price is 850.00, while the mean price is 1,028.08. Since the mean is higher than the median, the price distribution is likely influenced by higher-priced observations.
The median volume is 1,000.00, while the mean volume is 1,343.21. This shows that product volumes vary substantially across observations.
Categorical variable summary
The table below summarizes the main categorical variables.
| Variable | Unique categories | Most common category | Most common count |
|---|---|---|---|
| Location | 2 | Oman | 130 |
| Type | 3 | Milk | 233 |
| Brand | 19 | Almarai | 58 |
| Fat | 3 | Full | 161 |
| Fresh | 2 | Yes | 142 |
| Package | 2 | Carton | 131 |
| Flavor | 2 | No | 165 |
The most common product type is Milk, with 233 observations. The most common brand is Almarai, with 58 observations. The dataset has 19 brand categories.
These are observed frequencies in the dataset. They should not be interpreted as market shares unless the data collection process was designed to represent market shares.
Frequency tables
Frequency tables help us inspect categorical variables.
milk_data["Location"].value_counts()milk_data["Type"].value_counts()milk_data["Brand"].value_counts()milk_data["Fat"].value_counts()milk_data["Package"].value_counts()These tables are useful because categorical variables often become grouping variables in graphs and dummy variables in regression models.
Price per 1000 units of volume
Total price can be misleading when package volumes differ. A larger package may have a higher total price but a lower price per unit.
To avoid assuming the physical unit beyond the dataset, we define:
[ Price per 1000 volume = ]
milk_data["Price_per_1000_volume"] = (milk_data["Price"] / milk_data["Volume"]) * 1000
milk_data[["Price", "Volume", "Price_per_1000_volume"]].head()If the class confirms that Volume is measured in milliliters, this variable can be interpreted as price per liter. Without that confirmation, it should be described as price per 1000 units of recorded volume.
The descriptive statistics are:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Price | 258 | 1,028.08 | 907.73 | 100 | 450 | 850.00 | 1,250.00 | 4,990.00 |
| Volume | 258 | 1,343.21 | 1,139.13 | 120 | 500 | 1,000.00 | 2,000.00 | 4,000.00 |
| Price_per_1000_volume | 258 | 977.94 | 933.50 | 347.5 | 548.12 | 666.67 | 1,050.69 | 7,600.00 |
The median value of Price_per_1000_volume is 666.67.
IQR rule for unusual values
The interquartile range rule is one simple way to flag unusual values. It does not prove that a value is wrong.
The rule is:
[ Lower = Q1 - 1.5 IQR ]
[ Upper = Q3 + 1.5 IQR ]
Values outside this range are flagged for inspection.
| Variable | Q1 | Median | Q3 | IQR | Lower rule | Upper rule | Flagged values |
|---|---|---|---|---|---|---|---|
| Price | 450 | 850.00 | 1,250.00 | 800.00 | -750.00 | 2,450.00 | 13 |
| Size | 200 | 1,000.00 | 1,000.00 | 800.00 | -1,000.00 | 2,200.00 | 8 |
| Pieces | 1 | 1.00 | 4.00 | 3.00 | -3.50 | 8.50 | 6 |
| Volume | 500 | 1,000.00 | 2,000.00 | 1,500.00 | -1,750.00 | 4,250.00 | 0 |
| Price_per_1000_volume | 548.12 | 666.67 | 1,050.69 | 502.57 | -205.73 | 1,804.55 | 21 |
The variable Volume has 0 values flagged by this rule. The variable Price_per_1000_volume has 21 flagged values. These observations should be inspected before any decision is made.
Correlation among numeric variables
Correlation provides a first look at linear association among numeric variables.
numeric_data = milk_data[["Price", "Size", "Pieces", "Volume", "Price_per_1000_volume"]]
numeric_data.corr()For the attached dataset, the correlation matrix is:
| Price | Size | Pieces | Volume | Price_per_1000_volume | |
|---|---|---|---|---|---|
| Price | 1 | 0.339 | 0.244 | 0.523 | 0.53 |
| Size | 0.339 | 1 | -0.331 | 0.557 | -0.15 |
| Pieces | 0.244 | -0.331 | 1 | 0.316 | -0.115 |
| Volume | 0.523 | 0.557 | 0.316 | 1 | -0.27 |
| Price_per_1000_volume | 0.53 | -0.15 | -0.115 | -0.27 | 1 |
The correlation between Price and Volume is 0.523. This means total price and volume are positively associated in the dataset.
The correlation between Volume and Price_per_1000_volume is -0.270. This suggests that larger volume is associated with lower price per 1000 units of volume in this dataset.
These are associations only. They do not prove causality.
Interpretation
The dataset is suitable for introductory exploratory data analysis because it contains both numeric and categorical variables.
Important observed facts are:
- the dataset has 258 rows and 11 columns
- there are 0 missing values
- there are 48 fully duplicated rows
VolumematchesSize × Piecesin all rowsPrice,Size,Pieces, andVolumeare numeric- the dataset has 19 brand categories
Price_per_1000_volumeis useful for comparing products with different volumes
The descriptive statistics also show why graphs are needed. Means, medians, quartiles, and correlations give a compact summary, but they do not show the full shape of the distributions.
Common mistakes
- Assuming the price unit without checking dataset documentation.
- Treating duplicated rows as errors without knowing how the data were collected.
- Comparing total price across products with very different volumes.
- Reporting only the mean and ignoring the median and quartiles.
- Treating correlation as causation.
- Deleting unusual values only because they were flagged by the IQR rule.
Key takeaway
- Descriptive statistics provide the first empirical summary of the dataset.
- The attached milk dataset has 258 observations, 11 variables, and no missing values.
- The
Volumevariable is consistent withSize × Pieces. Price_per_1000_volumehelps compare products with different recorded volumes.- Descriptive statistics prepare the data for graphs and regression, but they do not establish causality.