Chapter 28. Writing the Data Section
Applied Econometrics for Natural Resource Economics
Chapter Purpose
The data section tells the reader what data were used, where the data came from, how the variables were defined, and whether the sample has limitations. A good data section gives enough information for another researcher to understand and evaluate the analysis.
Applied Question
How do I write a clear and professional data section for an empirical project?
Key Idea
The data section should answer four basic questions:
- Where did the data come from?
- What is the unit of observation?
- What variables are used?
- What are the main limitations of the data?
A strong data section is factual, transparent, and precise. It should not exaggerate what the data can show.
What Belongs in a Data Section?
A standard data section usually includes the data source, sample period, unit of observation, number of observations, variable definitions, descriptive statistics, cleaning decisions, and limitations.
Data Source
Start by identifying the source clearly.
Good examples:
- The dataset was collected from supermarket milk product prices in Oman.
- Trade data were obtained from UN Comtrade.
- Macroeconomic indicators were downloaded from the World Bank.
- Food price data were obtained from FAOSTAT.
Avoid vague wording such as:
Data were taken from the internet.
Unit of Observation
The unit of observation tells the reader what each row represents.
| Dataset Type | Unit of Observation |
|---|---|
| Milk product data | One milk product |
| Country trade data | One country pair in one year |
| Time-series data | One month |
| Household survey | One household |
| Farm survey | One farm |
Example: Milk Product Dataset
In the milk product dataset, each row represents one observed milk product.
| Variable | Description |
|---|---|
| Price | Product price |
| Size | Package size |
| Pieces | Number of pieces in the package |
| Volume | Size multiplied by pieces |
| Brand | Product brand |
| Fat | Fat category |
| Fresh | Freshness category |
| Package | Package type |
| Flavor | Flavor category |
| Location | Store or market location |
The constructed variable is:
[ Volume_i = Size_i imes Pieces_i ]
Python Example: Inspecting Data
import pandas as pd
milk_data = pd.read_csv("../data/Milk_Data_S2025n.csv")
milk_data.head()milk_data.info()milk_data.describe(include="all")Checking Missing Values
milk_data.isnull().sum()Example wording:
The dataset was checked for missing observations. Observations with missing values in key regression variables were excluded from the final estimation sample.
Do not hide missing values. A short honest explanation is better than pretending the dataset is perfect.
Creating Constructed Variables
milk_data["Volume"] = milk_data["Size"] * milk_data["Pieces"]
milk_data["Unit_Price"] = milk_data["Price"] / milk_data["Volume"]Example wording:
Total volume was constructed by multiplying package size by the number of pieces. Unit price was calculated as price divided by total volume.
Descriptive Statistics Table
summary_table = milk_data[["Price", "Size", "Pieces", "Volume"]].agg(
["count", "mean", "std", "min", "max"]
).T
summary_table.round(3)Do not only paste the table. Explain what it shows.
Data Limitations
Every dataset has limitations. Possible limitations include small sample size, limited time period, missing characteristics, measurement error, non-random sampling, and lack of causal identification.
Example:
The dataset is useful for studying price differences across observed milk products, but it does not contain information on promotions, wholesale costs, or consumer demand. Therefore, the results should be interpreted as associations rather than causal effects.
Limitations do not destroy a project. They clarify what the project can and cannot claim.
Sample Data Section
This study uses a cross-sectional dataset of milk products observed in retail markets. Each observation represents one milk product. The main outcome variable is product price. The key explanatory variable is total package volume, constructed as package size multiplied by the number of pieces. Additional product characteristics include brand, fat content, package type, flavor, freshness, and location.
The dataset was checked for missing values and inconsistent entries before analysis. Total volume and unit price were constructed from the original variables. Observations with missing values in key regression variables were excluded from the estimation sample. The dataset allows an applied analysis of price variation across milk products, but the results should be interpreted as conditional associations rather than causal effects.
Checklist
- What is the data source?
- What is the sample period?
- What is the unit of observation?
- How many observations are used?
- What is the dependent variable?
- What are the explanatory variables?
- Were any variables constructed?
- Were missing values handled?
- What are the main limitations?
Chapter Summary
The data section gives credibility to the empirical project. It should be accurate, transparent, and directly connected to the regression model.
- The data section explains the source, structure, and variables.
- The unit of observation must be clearly stated.
- Constructed variables should be defined carefully.
- Descriptive statistics should be reported and briefly interpreted.
- Data limitations should be acknowledged honestly.