Chapter 2. Economic Questions and Data Types

NREC4107 - Applied Econometrics

Opening purpose

This chapter explains how empirical work begins with a clear question and a careful understanding of data types.

We use the actual course milk dataset:

Milk_Data_S2025n.csv

The dataset contains 258 observations and 11 variables. Each row is an observed milk product entry. Each column describes one feature of that product entry.

Applied question

What kinds of economic questions can we ask using the milk dataset, and what types of variables do we have?

Key idea

A weak empirical project begins with data only.

A stronger empirical project begins with a question.

The dataset can contain many variables, but the researcher must decide:

  • what the outcome variable is
  • which explanatory variables are relevant
  • what type of data structure is being used
  • whether the question is descriptive, associational, predictive, or causal

Loading the dataset

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

data_path = "../data/Milk_Data_S2025n.csv"
milk_data = pd.read_csv(data_path)

milk_data.head()

Students should change the file path if they saved the dataset in a different Google Drive folder.

Observed dataset structure

milk_data.shape

For the attached dataset, the shape is:

(258, 11)

This means the dataset has 258 rows and 11 columns.

milk_data.columns.tolist()

The observed columns are:

['Location', 'Type', 'Brand', 'Fat', 'Fresh', 'Price', 'Package', 'Size', 'Pieces', 'Flavor', 'Volume']

Variable types

The table below describes the observed variables in the dataset.

Variable Observed type Role in this course Unique values
Location categorical Categorical characteristic 2
Type categorical Categorical characteristic 3
Brand categorical Categorical characteristic 19
Fat categorical Categorical characteristic 3
Fresh categorical Categorical characteristic 2
Price numeric Numeric measurement 106
Package categorical Categorical characteristic 2
Size numeric Numeric measurement 21
Pieces numeric Numeric measurement 8
Flavor categorical Categorical characteristic 2
Volume numeric Numeric measurement 33

The numeric variables are:

['Price', 'Size', 'Pieces', 'Volume']

The categorical variables are:

['Location', 'Type', 'Brand', 'Fat', 'Fresh', 'Package', 'Flavor']

Inspecting variable types in Python

milk_data.dtypes

Python data types are not always the same as economic variable types. For example, a categorical variable may be stored as object, while a numeric variable may be stored as int64 or float64.

A researcher should inspect both the software type and the economic meaning of the variable.

Unit of observation

The unit of observation is what each row represents.

In this dataset, each row is an observed product entry with product characteristics and price. This matters because interpretation depends on the row unit.

For example, a row is not a household, not a farm, not a monthly time period, and not a country. It is a product observation in the dataset.

Cross-sectional data

The milk dataset is best treated as cross-sectional for the purposes of these introductory chapters. It records product observations rather than following the same product repeatedly through time.

This means we can compare products with different characteristics, such as:

  • different brands
  • different package types
  • different volumes
  • different product types
  • different fat categories

It does not allow us to study price changes over time unless a time variable is added.

Possible empirical questions

The observed dataset suggests several applied questions:

Observed dataset feature Possible applied question Question type
Products have different recorded volumes Are larger-volume products associated with higher total prices? Association
Products have different total prices Do high-price products also have high unit-price style values? Association
Brands are observed across multiple products Are brand differences visible after accounting for volume? Association with controls
Product type includes Milk, Laban, and Kefir Do product types differ in price or unit-price style values? Association with categories
Package type is recorded as a categorical variable Is package type associated with price differences? Association with categories
Fat, Fresh, and Flavor are recorded as categorical variables Do product characteristics help explain observed price variation? Association with controls

These questions are mostly descriptive or associational. A causal study would require a stronger design.

Choosing a dependent variable

A dependent variable is the outcome we want to explain or predict.

Possible dependent variables in this dataset include:

  • Price
  • Price_per_1000_volume, if constructed from Price and Volume

Total price and unit-price style values answer different questions.

For example:

milk_data["Price_per_1000_volume"] = (milk_data["Price"] / milk_data["Volume"]) * 1000

If the class confirms that Volume is measured in milliliters, this variable can be interpreted as price per liter. Without that confirmation, it should be described as price per 1000 units of recorded volume.

Choosing explanatory variables

Explanatory variables are variables that may help explain or predict the dependent variable.

In this dataset, possible explanatory variables include:

  • Size
  • Pieces
  • Volume
  • Location
  • Type
  • Brand
  • Fat
  • Fresh
  • Package
  • Flavor

The choice should be based on economic reasoning, not only on availability.

Interpretation

A clear question makes the analysis easier to understand.

For example:

Are larger-volume products associated with higher total prices?

This question identifies:

  • dependent variable: Price
  • explanatory variable: Volume
  • question type: association
  • data structure: cross-sectional product observations

A vague question such as “analyze the milk data” is weaker because it does not define the outcome or the comparison.

Common mistakes

  • Starting with a dataset but no question.
  • Confusing the software data type with the economic meaning.
  • Forgetting the unit of observation.
  • Treating cross-sectional data as time-series data.
  • Making causal claims from associational data.
  • Including all variables without explaining why they matter.

Key takeaway

  • Empirical work should begin with a clear question.
  • The attached milk dataset has 258 product observations and 11 variables.
  • Price, Size, Pieces, and Volume are numeric variables.
  • Location, Type, Brand, Fat, Fresh, Package, and Flavor are categorical variables.
  • Most early questions from this dataset should be interpreted as descriptive or associational.