Chapter 2. Economic Questions and Data Types
NREC4107 - Applied Econometrics
Opening purpose
This chapter explains how empirical work begins with a clear question and a careful understanding of data types.
We use the actual course milk dataset:
Milk_Data_S2025n.csv
The dataset contains 258 observations and 11 variables. Each row is an observed milk product entry. Each column describes one feature of that product entry.
Applied question
What kinds of economic questions can we ask using the milk dataset, and what types of variables do we have?
Key idea
A weak empirical project begins with data only.
A stronger empirical project begins with a question.
The dataset can contain many variables, but the researcher must decide:
- what the outcome variable is
- which explanatory variables are relevant
- what type of data structure is being used
- whether the question is descriptive, associational, predictive, or causal
Loading the dataset
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
data_path = "../data/Milk_Data_S2025n.csv"
milk_data = pd.read_csv(data_path)
milk_data.head()Students should change the file path if they saved the dataset in a different Google Drive folder.
Observed dataset structure
milk_data.shapeFor the attached dataset, the shape is:
(258, 11)
This means the dataset has 258 rows and 11 columns.
milk_data.columns.tolist()The observed columns are:
['Location', 'Type', 'Brand', 'Fat', 'Fresh', 'Price', 'Package', 'Size', 'Pieces', 'Flavor', 'Volume']
Variable types
The table below describes the observed variables in the dataset.
| Variable | Observed type | Role in this course | Unique values |
|---|---|---|---|
| Location | categorical | Categorical characteristic | 2 |
| Type | categorical | Categorical characteristic | 3 |
| Brand | categorical | Categorical characteristic | 19 |
| Fat | categorical | Categorical characteristic | 3 |
| Fresh | categorical | Categorical characteristic | 2 |
| Price | numeric | Numeric measurement | 106 |
| Package | categorical | Categorical characteristic | 2 |
| Size | numeric | Numeric measurement | 21 |
| Pieces | numeric | Numeric measurement | 8 |
| Flavor | categorical | Categorical characteristic | 2 |
| Volume | numeric | Numeric measurement | 33 |
The numeric variables are:
['Price', 'Size', 'Pieces', 'Volume']
The categorical variables are:
['Location', 'Type', 'Brand', 'Fat', 'Fresh', 'Package', 'Flavor']
Inspecting variable types in Python
milk_data.dtypesPython data types are not always the same as economic variable types. For example, a categorical variable may be stored as object, while a numeric variable may be stored as int64 or float64.
A researcher should inspect both the software type and the economic meaning of the variable.
Unit of observation
The unit of observation is what each row represents.
In this dataset, each row is an observed product entry with product characteristics and price. This matters because interpretation depends on the row unit.
For example, a row is not a household, not a farm, not a monthly time period, and not a country. It is a product observation in the dataset.
Cross-sectional data
The milk dataset is best treated as cross-sectional for the purposes of these introductory chapters. It records product observations rather than following the same product repeatedly through time.
This means we can compare products with different characteristics, such as:
- different brands
- different package types
- different volumes
- different product types
- different fat categories
It does not allow us to study price changes over time unless a time variable is added.
Possible empirical questions
The observed dataset suggests several applied questions:
| Observed dataset feature | Possible applied question | Question type |
|---|---|---|
| Products have different recorded volumes | Are larger-volume products associated with higher total prices? | Association |
| Products have different total prices | Do high-price products also have high unit-price style values? | Association |
| Brands are observed across multiple products | Are brand differences visible after accounting for volume? | Association with controls |
| Product type includes Milk, Laban, and Kefir | Do product types differ in price or unit-price style values? | Association with categories |
| Package type is recorded as a categorical variable | Is package type associated with price differences? | Association with categories |
| Fat, Fresh, and Flavor are recorded as categorical variables | Do product characteristics help explain observed price variation? | Association with controls |
These questions are mostly descriptive or associational. A causal study would require a stronger design.
Choosing a dependent variable
A dependent variable is the outcome we want to explain or predict.
Possible dependent variables in this dataset include:
PricePrice_per_1000_volume, if constructed fromPriceandVolume
Total price and unit-price style values answer different questions.
For example:
milk_data["Price_per_1000_volume"] = (milk_data["Price"] / milk_data["Volume"]) * 1000If the class confirms that Volume is measured in milliliters, this variable can be interpreted as price per liter. Without that confirmation, it should be described as price per 1000 units of recorded volume.
Choosing explanatory variables
Explanatory variables are variables that may help explain or predict the dependent variable.
In this dataset, possible explanatory variables include:
SizePiecesVolumeLocationTypeBrandFatFreshPackageFlavor
The choice should be based on economic reasoning, not only on availability.
Interpretation
A clear question makes the analysis easier to understand.
For example:
Are larger-volume products associated with higher total prices?
This question identifies:
- dependent variable:
Price - explanatory variable:
Volume - question type: association
- data structure: cross-sectional product observations
A vague question such as “analyze the milk data” is weaker because it does not define the outcome or the comparison.
Common mistakes
- Starting with a dataset but no question.
- Confusing the software data type with the economic meaning.
- Forgetting the unit of observation.
- Treating cross-sectional data as time-series data.
- Making causal claims from associational data.
- Including all variables without explaining why they matter.
Key takeaway
- Empirical work should begin with a clear question.
- The attached milk dataset has 258 product observations and 11 variables.
Price,Size,Pieces, andVolumeare numeric variables.Location,Type,Brand,Fat,Fresh,Package, andFlavorare categorical variables.- Most early questions from this dataset should be interpreted as descriptive or associational.