34 min read

> "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."

Learning Objectives

  • Classify variables as categorical (nominal, ordinal) or numerical (discrete, continuous)
  • Distinguish between populations and samples in real-world contexts
  • Identify observational units and variables in a dataset
  • Read and interpret data tables and data dictionaries
  • Recognize different levels of measurement and why they matter

Chapter 2: Types of Data and the Language of Statistics

"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." — Clifford Stoll, astronomer and author

Chapter Overview

Here's something that trips up a lot of students: they jump into formulas and graphs without first understanding the type of data they're working with. Then they calculate an average of zip codes, try to build a bar chart of heights, or feed categorical labels into a regression model — and everything goes sideways.

It's like trying to cook without knowing the difference between tablespoons and cups. You might get lucky, but eventually you'll put three cups of salt in a recipe that called for three tablespoons. The dish is ruined, and you don't even know why.

This chapter teaches you the vocabulary that prevents those mistakes. By the end, you'll be able to look at any dataset — a spreadsheet, a research paper, a government database — and immediately identify what kind of data you're working with, what operations make sense, and what tools to reach for. This is the language statisticians speak every day, and once you've got it, everything else in this course builds on top of it.

In this chapter, you will learn to: - Classify variables as categorical (nominal, ordinal) or numerical (discrete, continuous) - Distinguish between populations and samples in real-world contexts - Identify observational units and variables in a dataset - Read and interpret data tables and data dictionaries - Recognize different levels of measurement and why they matter

Fast Track: If you've taken AP Statistics or a prior course, skim Sections 2.1-2.3 and jump to Section 2.5 ("Data Dictionaries"). Complete quiz questions 1, 5, 10, and 15 to verify your foundation.

Deep Dive: After this chapter, read Case Study 1 (electronic health records) to see how data classification decisions have life-or-death consequences in healthcare.


2.1 Observational Units and Variables: The Building Blocks

Before we can classify data types, we need two foundational ideas. Every dataset in the world — whether it's a spreadsheet of patient records, a database of basketball stats, or a table of streaming metrics — is organized around two things: observational units and variables.

What's an Observational Unit?

An observational unit (sometimes called a "case" or "unit of observation") is the individual entity that you're collecting data about. It's the "who" or "what" of your dataset.

In a medical study, each patient is an observational unit. In a survey, each respondent is an observational unit. In Sam Okafor's basketball data, each game (or each shot attempt) could be an observational unit — and that choice matters, as we'll see.

Here's a simple test: look at one row of a dataset. What does that row represent? That's your observational unit.

What's a Variable?

A variable is a characteristic or property that can take different values across your observational units. It's the "what" you're measuring or recording about each unit.

In Dr. Maya Chen's disease surveillance data, variables might include patient age, zip code, diagnosis, date of symptom onset, and whether the patient was hospitalized. Each of these characteristics varies across patients — that's why we call them variables.

Let's look at a concrete example. Here's a small dataset that Dr. Maya Chen might work with during flu season:

Patient ID Age Zip Code Diagnosis Hospitalized Days to Recovery
P-1001 34 90210 Influenza A No 7
P-1002 67 90045 Influenza B Yes 14
P-1003 8 90210 Influenza A No 5
P-1004 45 90301 Influenza A No 9
P-1005 72 90045 Influenza B Yes 21
P-1006 29 90301 Influenza A No 6

In this dataset: - Observational unit: Each patient (one per row) - Variables: Patient ID, Age, Zip Code, Diagnosis, Hospitalized, Days to Recovery (one per column)

Notice that every row has the same set of variables, but the values differ from row to row. Patient P-1001 is 34 years old; Patient P-1005 is 72. That variation across rows is exactly what makes these variables — and exactly what makes statistics interesting.

Intuition: Think of a dataset like a seating chart for a dinner party. Each person (observational unit) sits in a row. The columns are the questions you ask everyone: "What's your name? How old are you? What's your favorite food?" The answers (values) differ from person to person — that's why we call them variables.

Why This Matters

Here's the practical payoff: once you identify the observational unit, you can figure out what level of analysis makes sense. If your observational unit is "each student in a class," then calculating the average grade tells you something about that specific class. If your observational unit is "each school in a district," then averaging tells you something about the district's schools.

Getting the observational unit wrong is one of the most common — and most consequential — mistakes in data analysis. We'll see examples of this throughout the book, especially when we discuss ecological fallacy in Chapter 27.


2.2 Categorical vs. Numerical: The Big Split

Now for the concept that will follow you through every chapter of this course. Every variable you encounter falls into one of two fundamental categories:

Categorical variables place each observational unit into a group or category. The values are labels or names, not quantities.

Numerical variables assign each observational unit a number that represents a quantity — something you can meaningfully add, subtract, or average.

This sounds simple, and the basic distinction usually is. But the edges can be tricky, and getting it wrong leads to real problems. Let's build your intuition.

Categorical Variables: Names and Groups

A categorical variable (also called a qualitative variable) records a quality or category. The values answer the question "what type?" or "which group?"

From Dr. Maya Chen's flu data: - Diagnosis (Influenza A, Influenza B) — categories, not quantities - Hospitalized (Yes, No) — two groups - Zip Code (90210, 90045, 90301) — yes, even though these are numbers!

Wait — zip codes are numbers, so aren't they numerical? This is the trap that catches students every semester. Here's the key question: does it make sense to do arithmetic with these values?

What's the "average zip code" of these patients? You could calculate 90210 + 90045 + 90210 + ... and divide by 6. You'd get a number. But that number would be completely meaningless. You can't walk to "average zip code 90177." Zip codes are labels — they identify locations, not quantities. The fact that they happen to be written with digits doesn't make them numerical.

Common Pitfall: Numbers are not always numerical variables. Jersey numbers, Social Security numbers, phone numbers, zip codes, and ID numbers are all categorical. The test: does arithmetic (addition, subtraction, averaging) produce a meaningful result? If not, it's categorical.

Numerical Variables: Quantities You Can Calculate With

A numerical variable (also called a quantitative variable) records a measurable quantity. The values answer the question "how much?" or "how many?"

From Dr. Chen's data: - Age (34, 67, 8, 45, 72, 29) — meaningful to average, compare, subtract - Days to Recovery (7, 14, 5, 9, 21, 6) — meaningful to calculate "Patient P-1005 took 15 more days than Patient P-1003"

These pass the arithmetic test. The average age of these patients is (34 + 67 + 8 + 45 + 72 + 29) / 6 = 42.5 years, and that number actually means something.

The Quick Test

When you're unsure whether a variable is categorical or numerical, ask yourself these two questions:

  1. Does it make sense to calculate an average? If yes → numerical. If the average is meaningless → categorical.
  2. Do the values represent quantities, or labels? Quantities → numerical. Labels → categorical.

Here's a cheat sheet using examples from our four anchor characters:

Variable Character Categorical or Numerical? Why?
Disease type (flu strain) Dr. Chen Categorical Labels for different strains; you can't average "Influenza A" and "Influenza B"
Patient age Dr. Chen Numerical A quantity; average age is meaningful
User subscription tier (Free, Basic, Premium) Alex Rivera Categorical Labels for groups; averaging doesn't make sense
Daily watch time (minutes) Alex Rivera Numerical A measured quantity; average watch time is meaningful
Race/ethnicity of defendant Prof. Washington Categorical Labels for demographic groups
Risk score (1-10) assigned by algorithm Prof. Washington It depends! See "The Gray Areas" below
Player position (guard, forward, center) Sam Okafor Categorical Labels for roles
Points scored per game Sam Okafor Numerical A counted quantity; average points is meaningful

Check Your Understanding (try to answer without scrolling up)

  1. In your own words, what's the difference between a categorical and a numerical variable?
  2. A dataset contains a column called "Customer Rating" with values 1 through 5 (where 1 = "Very Unsatisfied" and 5 = "Very Satisfied"). Is this categorical or numerical? Defend your answer.
  3. Is "Phone Number" a categorical or numerical variable? Why?

Verify

  1. A categorical variable places observations into groups or categories (labels). A numerical variable records a quantity that you can meaningfully do arithmetic with.
  2. This is a genuine gray area! The ratings are ordinal categories (we'll define this in the next section) — they have a meaningful order (5 > 4 > 3 > 2 > 1) but the "distances" between values aren't necessarily equal. In practice, many analysts treat 1-5 ratings as numerical and calculate averages (e.g., "average rating: 4.2 stars"), but this is technically a simplification. We'll discuss this nuance in Section 2.3.
  3. Categorical. Even though phone numbers contain digits, they're labels — identifiers for specific phone lines. Averaging two phone numbers produces nonsense.

2.3 Going Deeper: Nominal, Ordinal, Discrete, and Continuous

The categorical/numerical split is the big divide. But each side has subtypes that matter for choosing the right analysis and the right graph. Let's break each one down.

Categorical Subtypes: Nominal vs. Ordinal

Not all categories are created equal. Some have a natural order; others don't.

Nominal variables are categorical variables where the categories have no inherent order. "Nominal" comes from the Latin word for "name" — these are just names.

Examples: - Blood type (A, B, AB, O) — there's no sense in which type A is "more" than type B - Eye color (brown, blue, green, hazel) — no natural ranking - Diagnosis (Influenza A, Influenza B, COVID-19) — different categories, not ranked - Alex Rivera's user device type (mobile, tablet, desktop, smart TV) — no inherent order

Ordinal variables are categorical variables where the categories have a meaningful order, but the distances between categories aren't necessarily equal.

Examples: - Education level (high school, bachelor's, master's, doctorate) — there's a clear ordering - Pain scale (none, mild, moderate, severe) — "severe" is worse than "mild," but is the gap between "mild" and "moderate" the same as between "moderate" and "severe"? We don't know. - Military rank (private, corporal, sergeant, lieutenant) — ordered but not equally spaced - Likert scale responses (strongly disagree, disagree, neutral, agree, strongly agree) — ordered categories

Here's the critical distinction: with ordinal variables, you can say "A is more/higher/better than B," but you can't say "the difference between A and B equals the difference between B and C." The ordering is meaningful, but the spacing is not.

Real Talk About Ordinal Data: Here's a controversy that statisticians actually argue about: should you calculate the average of ordinal data? If a survey uses a 1-5 scale (strongly disagree to strongly agree), is it okay to report "the average response was 3.7"?

Strictly speaking, no — because the distances between 1 and 2, 2 and 3, etc. aren't guaranteed to be equal. But in practice, researchers do it constantly because it's useful and the results are usually reasonable. You'll see "average Likert score" in thousands of published papers.

Our advice: know the rule, understand why purists object, and recognize that treating ordinal data as numerical is a simplification that sometimes works and sometimes doesn't. When in doubt, use methods designed for ordinal data (we'll cover some in Chapter 21).

Numerical Subtypes: Discrete vs. Continuous

Numerical variables also come in two flavors.

Discrete variables take on countable values — typically whole numbers, with gaps between possible values. You can count them.

Examples: - Number of emergency room visits (0, 1, 2, 3, ...) — you can't have 2.7 visits - Number of three-pointers made in a game (0, 1, 2, ..., 15) — whole numbers only - Number of children in a household (0, 1, 2, 3, ...) — no fractional children - Number of episodes watched on StreamVibe (0, 1, 2, ...) — you either watched an episode or you didn't

Continuous variables can take on any value within a range, including fractions and decimals. You measure them.

Examples: - Height (5.583 feet, 172.4 cm) — any value on a continuous scale - Temperature (98.6 degrees, 37.0 degrees) — can be measured to arbitrary precision - Time spent watching (47.3 minutes, 2.15 hours) — measured, not counted - Blood pressure (120/80 mmHg) — measured on a continuous scale

Intuition: Here's a quick rule of thumb. Ask: "Do I count it, or do I measure it?" - Count → discrete - Measure → continuous

This works for most cases. You count the number of patients. You measure their temperature.

The Gray Areas

Let me be honest: the boundaries between these categories aren't always crystal clear. Real data is messy, and there are genuine ambiguities.

Age: Is it discrete or continuous? Technically, age is continuous — time passes continuously. But we usually report age in whole years (34, not 34.267), making it look discrete. In practice, most analysts treat age as continuous.

Money: Your bank balance might be $1,247.83 — continuous? Or is it discrete because it's counted in cents? In practice, dollar amounts are treated as continuous when the values span a wide range.

Risk scores (1-10): Professor Washington examines risk scores assigned by a predictive policing algorithm. Is a 1-10 score ordinal (ordered categories) or numerical (discrete)? It depends on how it was constructed. If the numbers come from a mathematical model and a score of 8 really is "twice as risky" as a score of 4, it's numerical. If the numbers are arbitrary labels where 8 just means "more risky" than 4 with no precise meaning for the gap, it's ordinal.

Don't let these gray areas paralyze you. In practice, the classification decision usually becomes clear when you ask: "What analysis am I planning to do, and does the data type support it?"

The Complete Classification Tree

Here's the full picture, all in one place:

                        Variable
                       /         \
              Categorical       Numerical
              /        \        /        \
         Nominal    Ordinal  Discrete  Continuous
Type Has meaningful order? Has meaningful distances? Arithmetic makes sense? Examples
Nominal No No No Blood type, zip code, eye color
Ordinal Yes No Limited Pain scale, education level, Likert ratings
Discrete Yes Yes Yes Number of siblings, goals scored, defects counted
Continuous Yes Yes Yes Height, weight, temperature, time

2.4 Populations, Samples, Parameters, and Statistics (Revisited)

You met the terms population and sample in Chapter 1. Now let's deepen your understanding and add two crucial companion terms.

Population vs. Sample: The Full Picture

Recall from Chapter 1 that a population is the entire group you want to study, and a sample is the subset you actually observe. But here's what we didn't emphasize enough last time: the same group of people can be either a population or a sample, depending on your question.

Let's use Alex Rivera's StreamVibe data to make this concrete.

Suppose StreamVibe has 8.2 million subscribers. Alex randomly selects 5,000 users to test the new recommendation algorithm. In this scenario: - Population: All 8.2 million StreamVibe subscribers - Sample: The 5,000 users selected for the test

But now suppose Alex's boss asks a different question: "Among just the 5,000 test users, what was the average watch time?" Now those 5,000 users are the population — because they're the entire group the boss cares about. No inference needed.

Same people, different roles. Whether a group is a population or a sample depends on the question you're asking.

Parameters vs. Statistics: The Vocabulary of Inference

This is where we add two new terms that will become essential starting in Chapter 11.

A parameter is a number that describes a population. It's the truth — the actual value you'd get if you could measure every single member of the population. Parameters are usually unknown because we rarely have access to the entire population.

A statistic is a number that describes a sample. It's what you actually calculate from the data you have. Statistics are known — you calculated them — but they're estimates of the unknown parameters.

Population Sample
Who? Everyone you want to study The subset you actually observe
Number that describes it Parameter Statistic
Known or unknown? Usually unknown Known (you calculated it)
Goal What you want to learn What you use to estimate

Here's a concrete example. Sam Okafor wants to know Daria Kowalczyk's "true" three-point shooting ability — the percentage she would shoot if she took an infinite number of shots under identical conditions. That true percentage is a parameter. It's fixed but unknown.

What Sam actually has is her shooting percentage this season: 38% on 65 attempts. That 38% is a statistic — a number calculated from a sample of shots (the 65 she's taken so far). It's his best estimate of the parameter, but it's not exactly right. If Daria took another 65 shots, she might shoot 35% or 41%. The statistic varies from sample to sample; the parameter does not.

Intuition: A parameter is the bullseye on a dartboard. A statistic is where your dart actually lands. You're aiming for the parameter, and with good technique (proper sampling) you'll land close. But you'll almost never hit the exact center.

Math Anxiety Note: Don't worry — we won't do formal parameter estimation until Chapters 11-12. For now, just internalize the vocabulary: parameters describe populations, statistics describe samples. If you can remember that, you're golden.

Check Your Understanding (try to answer without scrolling up)

  1. Dr. Maya Chen surveys 2,000 residents of a county about their flu vaccination status. 64% of respondents say they've been vaccinated. Is 64% a parameter or a statistic?
  2. What would the corresponding parameter be?
  3. In Alex Rivera's A/B test, if the average watch time for the 5,000 test users is 47 minutes, is 47 minutes a parameter or a statistic?

Verify

  1. It's a statistic — it's calculated from a sample (2,000 residents), not the entire county population.
  2. The corresponding parameter would be the true vaccination rate of all residents in the county. This is unknown — the survey estimates it.
  3. It depends on the question! If Alex wants to know about ALL StreamVibe subscribers, then 47 minutes is a statistic (sample estimate). If the question is specifically about these 5,000 test users, then 47 minutes is a parameter (it describes the entire group of interest).

2.5 Data Dictionaries: The Rosetta Stone of Datasets

Imagine you're handed a spreadsheet with 50 columns and 10,000 rows. The column headers say things like BP_SYS, DX_CODE, LOS, and ADM_TYPE. What do these mean? What values are valid? How was each variable measured?

Without a data dictionary, you're lost.

A data dictionary (also called a codebook or metadata file) is a document that describes every variable in a dataset: its name, its type, what it measures, what values it can take, and how it was collected.

Here's what a data dictionary looks like for Dr. Maya Chen's flu surveillance data:

Variable Name Description Type Valid Values Notes
patient_id Unique patient identifier Nominal (categorical) P-1001 through P-9999 Not used in analysis; for tracking only
age Patient age at time of diagnosis Continuous (numerical) 0-120 Recorded in whole years; ages < 1 recorded as 0
zip_code Patient's residential zip code Nominal (categorical) 5-digit U.S. zip codes Used for geographic analysis, not arithmetic
diagnosis Flu strain identified by lab test Nominal (categorical) Influenza A, Influenza B, Unspecified "Unspecified" if lab test not performed
hospitalized Whether patient was hospitalized Nominal (categorical) Yes, No Binary variable
days_to_recovery Days from symptom onset to symptom resolution Discrete (numerical) 1-90 Self-reported; some patients lost to follow-up (recorded as NA)
collection_date Date the data was recorded Continuous (numerical) Dates in MM/DD/YYYY format Used for temporal analysis

Why Data Dictionaries Matter

Data dictionaries aren't just documentation busywork. They prevent real mistakes:

  1. They prevent misclassification. Without the data dictionary, someone might try to calculate the average zip code — a meaningless number. The dictionary makes clear that zip_code is categorical.

  2. They explain missing values. The note about days_to_recovery tells you that NA means "lost to follow-up," not "zero days" or "the patient died." This matters enormously for analysis.

  3. They ensure reproducibility. If another researcher wants to replicate Dr. Chen's analysis, the data dictionary tells them exactly how each variable was defined and measured.

  4. They're required in professional settings. In healthcare, government, and most research contexts, a dataset without a data dictionary is considered incomplete. Many journals won't publish research unless the data dictionary is available.

Reading Data Dictionaries in the Wild

Real data dictionaries can be much more complex than our example. When you encounter one, look for these key elements:

  • Variable name: The column header in the actual data file
  • Description: What the variable represents in plain language
  • Type: Categorical (nominal/ordinal) or numerical (discrete/continuous)
  • Valid values: What values are allowed — either a list (for categorical) or a range (for numerical)
  • Missing value codes: How missing data is represented (NA, -99, blank, etc.)
  • Units: For numerical variables — is it inches or centimeters? Days or hours?

Building a Data Dictionary with Python

Here's a quick preview of how to explore a dataset's structure in Python using pandas (you'll learn pandas properly in Chapter 3):

import pandas as pd

# Load a dataset (we'll learn this properly in Chapter 3)
df = pd.read_csv("flu_surveillance.csv")

# See the first few rows
print(df.head())

# Check what Python thinks each column's data type is
print(df.dtypes)

Output:

patient_id            object    # "object" usually means text/categorical
age                    int64    # integer — Python sees this as numerical
zip_code               int64    # Python thinks this is numerical (but WE know it's categorical!)
diagnosis             object    # text — categorical
hospitalized          object    # text — categorical
days_to_recovery     float64   # float — numerical (float because some values are NA)
collection_date       object    # text — needs to be converted to a date type

Notice something important: Python got the zip code wrong. It sees digits and assumes they're numerical. This is exactly why data dictionaries matter — the software can't always tell the difference. You need to know your data well enough to correct these misclassifications.

In a spreadsheet (Excel or Google Sheets), you'd check the data type by selecting a column and looking at the cell formatting. Numbers formatted as "General" or "Number" are treated as numerical; those formatted as "Text" are treated as categorical. But again, the software might guess wrong — zip codes stored as numbers will lose their leading zeros (02138 becomes 2138), which can cause problems.

Productive Struggle

Look at the dataset you chose for your Data Detective Portfolio (from Chapter 1). Without looking up the official data dictionary, try to create your own: 1. List every variable (column) in your dataset 2. For each variable, classify it as nominal, ordinal, discrete, or continuous 3. Note any variables that seem ambiguous — where you're not sure of the classification

After you've tried, find the official documentation for your dataset and compare. Where did you agree? Where did you disagree? What did you learn from the discrepancy?

This is a genuine challenge — even experienced data analysts disagree on some classifications. The goal isn't perfection; it's building the habit of thinking carefully about data types before diving into analysis.


2.6 Levels of Measurement: Why the Hierarchy Matters

You've now learned four types of variables: nominal, ordinal, discrete, and continuous. These aren't just labels — they form a hierarchy that determines what you can and can't do with your data. This hierarchy is called the levels of measurement, and it was first formalized by psychologist Stanley Stevens in 1946.

The Four Levels

Level What You Can Do Example Operations Allowed
Nominal Classify, count frequencies, find the mode Blood type (A, B, AB, O) =, ≠
Ordinal All of nominal + rank, compare (greater/less) Pain level (none, mild, moderate, severe) =, ≠, <, >
Interval All of ordinal + measure exact differences Temperature in Fahrenheit (32°F, 72°F, 100°F) =, ≠, <, >, +, −
Ratio All of interval + compute meaningful ratios Height in inches (60", 72") =, ≠, <, >, +, −, ×, ÷

Wait — interval and ratio? Those are new. Let me explain.

Interval level variables have meaningful, equal distances between values, but no true zero point. Temperature in Fahrenheit is the classic example: the difference between 40°F and 50°F is the same as between 80°F and 90°F (both 10°F). But 0°F doesn't mean "no temperature" — it's just an arbitrary point on the scale. And you can't say "80°F is twice as hot as 40°F" because the zero isn't meaningful.

Ratio level variables have everything interval has, plus a true zero point. Height, weight, income, and time are ratio variables: 0 inches means no height, 0 dollars means no income, and it does make sense to say "6 feet is twice as tall as 3 feet."

How This Matters in Practice

"So what?" you might be thinking. "Why do I care whether temperature is interval or ratio?"

Because the level of measurement determines which statistical operations are valid:

  • Nominal data: You can count frequencies and find the mode (most common category). That's about it. You can calculate percentages ("40% of patients had Influenza A"). You cannot meaningfully rank, average, or compute differences.

  • Ordinal data: You can do everything you can with nominal data, plus you can rank and compare. "More patients rated their pain as severe than mild." But you cannot assume the intervals are equal, so averaging is questionable.

  • Interval data: You can add and subtract meaningfully. "Today is 15 degrees warmer than yesterday." But ratios are problematic — "twice as hot" doesn't work with Fahrenheit.

  • Ratio data: Everything is fair game. Averages, differences, ratios — all meaningful. "The average height is 67 inches. The tallest student is 1.2 times taller than the shortest."

Here's how this connects to our anchor examples:

Variable Level What Sam Okafor Can Do With It
Player position Nominal Count how many guards vs. forwards; find the most common position
Draft round (1st, 2nd, undrafted) Ordinal Rank players by draft round; compare who was drafted higher
Points scored per game Ratio Calculate averages, compare differences, say "Player A scores twice as much as Player B"
Plus/minus rating (+5, -3, 0) Interval Calculate differences; the zero means "even," not "nothing"

Intuition: Think of the levels as a ladder. Each rung up adds more things you can do: - Nominal: name it - Ordinal: name it + rank it - Interval: name it + rank it + measure exact differences - Ratio: name it + rank it + measure exact differences + compute ratios

You can always "go down" the ladder (treat ratio data as ordinal) but you can't "go up" (treat nominal data as ratio).

Math Anxiety Note: The interval vs. ratio distinction matters most in advanced applications. For most of this course, the critical distinction is categorical vs. numerical. If you remember that split and can also distinguish nominal from ordinal, you're ahead of the game.

Check Your Understanding (try to answer without scrolling up)

  1. What's the difference between ordinal and interval data?
  2. Is "year of birth" (e.g., 1998, 2003) interval or ratio? Can you say someone born in 2000 was "born twice as late" as someone born in 1000?
  3. Sam wants to calculate the average number of rebounds per game for each player. What level of measurement must "rebounds per game" be at minimum for this to make sense?

Verify

  1. Ordinal data has a meaningful order but unequal (or unknown) spacing between categories. Interval data has both a meaningful order AND equal spacing between values — but no true zero.
  2. Year of birth is interval, not ratio. There's no "year zero" in many calendar systems, and even where there is, it's arbitrary. Saying "2000 is twice as late as 1000" doesn't make meaningful sense. But you CAN say "the difference between 1998 and 2003 is 5 years."
  3. Interval level at minimum (for the average to be meaningful, differences must be equal). In practice, rebounds per game is ratio level (0 rebounds = truly none), so the average is perfectly valid.

2.7 Cross-Sectional vs. Longitudinal Data

There's one more distinction we need before you're fluent in the language of data. It has to do with when the data was collected.

Cross-sectional data is collected at one point in time (or during one short period). It's a snapshot. Think of it like a photograph — it captures everyone at the same moment.

Examples: - A survey of 1,000 adults conducted in March 2026 about their exercise habits - Dr. Chen's flu surveillance data from one flu season - A Census conducted in a particular year

Longitudinal data is collected from the same observational units at multiple points in time. It's a movie, not a photograph — you see how things change.

Examples: - A study that measures the same patients' blood pressure every 6 months for 10 years - Alex Rivera tracking the same users' watch time every week for a year before and after the algorithm change - The Framingham Heart Study, which has followed residents of Framingham, Massachusetts since 1948

Why This Distinction Matters

Cross-sectional and longitudinal data answer different questions:

  • Cross-sectional: "How are things right now?" or "How do different groups compare at this moment?"
  • Longitudinal: "How do things change over time?" or "What happens to the same individuals as time passes?"

This connects directly to the correlation vs. causation theme from Chapter 1. Suppose Dr. Chen finds, in cross-sectional data, that people who exercise regularly have lower rates of heart disease. Does exercise cause lower heart disease? Not necessarily — maybe healthier people are more able to exercise in the first place. Cross-sectional data captures a snapshot, not a story of change.

But if she follows the same people for 20 years (longitudinal data) and finds that those who started exercising developed less heart disease than those who didn't, the causal argument gets stronger (though still not airtight — we'll formalize this in Chapter 4).

Intuition: Cross-sectional data is like looking at a photo album page — everyone posed together at one moment. Longitudinal data is like a time-lapse video — you see the same people changing over months or years.


2.8 Putting It All Together: A Real Dataset Walkthrough

Let's apply everything we've learned to a dataset that Alex Rivera might work with at StreamVibe. Here's a sample of user viewing data:

User ID Age Plan Genre Preference Episodes This Week Avg Session (min) Joined Date Satisfaction (1-5)
U-4401 23 Free Comedy 12 34.7 2024-08-15 4
U-4402 45 Premium Drama 5 68.2 2022-01-03 5
U-4403 31 Basic Sci-Fi 8 42.1 2023-06-20 3
U-4404 19 Free Comedy 22 25.6 2025-01-10 2
U-4405 56 Premium Documentary 3 91.3 2021-03-18 5
U-4406 28 Basic Drama 9 38.4 2024-11-02 4

Step 1: Identify the observational unit.

Each row represents one user. The observational unit is an individual StreamVibe subscriber.

Step 2: Classify every variable.

Variable Type Subtype Level of Measurement Reasoning
User ID Categorical Nominal Nominal Labels for identification; no meaningful order or arithmetic
Age Numerical Continuous* Ratio Measured quantity with a true zero; reported in whole years
Plan Categorical Ordinal Ordinal Free < Basic < Premium has a meaningful order (by price and features)
Genre Preference Categorical Nominal Nominal Labels for categories; no inherent ranking
Episodes This Week Numerical Discrete Ratio Counted (whole numbers); 0 episodes = none
Avg Session (min) Numerical Continuous Ratio Measured duration; 0 minutes = no watching
Joined Date Numerical Continuous Interval Points on a time scale; "twice as late" doesn't make sense
Satisfaction (1-5) Categorical Ordinal Ordinal Ordered categories; distances between 1-2, 2-3, etc. may not be equal

*Age is technically continuous but often recorded as discrete whole numbers. We treat it as continuous in most analyses.

Step 3: Build the data dictionary.

A professional data dictionary for this dataset would document each variable's valid range, how it was collected (self-reported? system-recorded?), what missing values look like, and any caveats. For example, "Avg Session (min)" might have a note: "Calculated by StreamVibe's system as total minutes watched divided by number of sessions; sessions shorter than 30 seconds are excluded."

Step 4: Ask what analyses make sense.

Now that you know the variable types, you can start thinking about appropriate analyses:

  • Categorical variables: Bar charts, frequency tables, percentages (Chapter 5)
  • Numerical variables: Histograms, averages, standard deviations (Chapters 5-6)
  • Relationships: Is satisfaction related to plan type? (Both categorical → chi-square test, Chapter 19.) Is age related to watch time? (Both numerical → correlation, Chapter 22.)
  • Comparisons: Do Premium users watch more than Free users? (Categorical grouping + numerical outcome → t-test, Chapter 16.)

We're getting ahead of ourselves — you'll learn all these techniques in later chapters. The point for now is: knowing your variable types tells you which tools to reach for. Every statistical technique has requirements about what kinds of variables it works with. Get the classification right, and the rest follows naturally.


2.9 The Human Stories Behind the Categories

Before we wrap up, I want to surface a theme we introduced in Chapter 1: the human stories behind the data. This matters more than you might think when it comes to data types and classification.

Every time we assign someone to a category — a diagnosis code, a risk score, a racial classification — we're making a decision about how to represent a complex human being with a simple label. Those decisions have consequences.

Professor James Washington sees this every day in his research on predictive policing algorithms. When a risk assessment tool classifies a defendant as "high risk" (ordinal category) based on variables like prior convictions (discrete numerical), neighborhood (nominal categorical), and age at first arrest (continuous numerical), those data types aren't just abstract labels. They represent real choices about what gets measured, how it gets measured, and what gets left out.

Consider: "neighborhood" is a nominal categorical variable. But which neighborhoods get labeled "high crime"? Often, they're neighborhoods that have been heavily policed — which means more arrests, which means more data points, which means the algorithm thinks they're more dangerous. The data type (categorical: high crime / low crime) looks objective. But the values that end up in that category are shaped by decades of policing decisions.

Or consider Dr. Maya Chen's flu data. She records "race/ethnicity" as a categorical variable. But racial categories are socially constructed, vary across cultures and time periods, and may not capture the lived experiences of people who identify with multiple groups. The categories we choose shape the stories the data can tell — and the stories it can't.

This doesn't mean we should avoid categorizing data. We need categories to do analysis. But we should always remember: behind every data point is a person, and behind every category is a choice.


2.10 Project Checkpoint: Building Your Data Dictionary

Project Checkpoint

Your task for Chapter 2:

Open the dataset you chose in Chapter 1 for your Data Detective Portfolio. Complete the following:

  1. Identify the observational unit. What does each row represent?
  2. List all variables (columns) in your dataset.
  3. Classify each variable as: - Categorical (nominal or ordinal) OR Numerical (discrete or continuous) - Identify the level of measurement (nominal, ordinal, interval, or ratio)
  4. Build a data dictionary in a table format, like the one in Section 2.5. Include: variable name, description, type, valid values, and any notes about how the variable was measured.
  5. Flag any ambiguous variables — ones where the classification isn't clear-cut. Write a sentence explaining why you chose the classification you did.
  6. Identify the data structure: Is your dataset cross-sectional or longitudinal? How do you know?

Example: If you chose the World Happiness Report: - Observational unit: One country in one year - Variables: Country name (nominal), Year (interval), Happiness score (continuous/ratio), GDP per capita (continuous/ratio), Social support (continuous/ratio), Healthy life expectancy (continuous/ratio), Freedom to make life choices (continuous/ratio), Generosity (continuous/ratio), Perceptions of corruption (continuous/ratio) - Data structure: Panel data (multiple countries measured across multiple years — a form of longitudinal data)

What this connects to: In Chapter 3, you'll use Python to programmatically inspect your data types and verify your manual classification. In Chapter 5, you'll use your variable classifications to choose the right graph for each variable.


2.11 Chapter Summary

Let's recap what you've learned — the vocabulary you'll use every day for the rest of this course.

The Classification System

Category Subcategory Description Example
Categorical Nominal Categories without order Blood type, eye color, diagnosis
Categorical Ordinal Categories with meaningful order Education level, pain scale, Likert ratings
Numerical Discrete Counted quantities (whole numbers) Number of siblings, goals scored
Numerical Continuous Measured quantities (any value in a range) Height, temperature, time

Key Vocabulary

Term Definition
Observational unit The individual entity each row of data describes
Variable A characteristic that varies across observational units
Categorical variable A variable whose values are categories or labels
Numerical variable A variable whose values are meaningful quantities
Nominal Categorical without order
Ordinal Categorical with order
Discrete Numerical, countable values
Continuous Numerical, measurable on a continuous scale
Level of measurement The hierarchy (nominal → ordinal → interval → ratio) that determines valid operations
Data dictionary A document describing every variable in a dataset
Parameter A number describing a population (usually unknown)
Statistic A number describing a sample (calculated from data)
Cross-sectional Data collected at one point in time
Longitudinal Data collected from the same units over multiple time points

Decision Flowchart: What Type of Variable Is This?

Does the variable record a category/label, or a quantity?
│
├── Category/label → CATEGORICAL
│   │
│   ├── Do the categories have a natural order?
│   │   ├── No  → NOMINAL (blood type, diagnosis)
│   │   └── Yes → ORDINAL (education level, pain scale)
│   │
│
└── Quantity → NUMERICAL
    │
    ├── Is the variable counted (whole numbers only)?
    │   ├── Yes → DISCRETE (number of children, episodes watched)
    │   └── No  → CONTINUOUS (height, temperature, time)

Key Takeaways

  1. Every variable is either categorical or numerical — and getting this right determines which tools and analyses are appropriate.
  2. Numbers aren't always numerical variables. Zip codes, ID numbers, and phone numbers are categorical despite being made of digits.
  3. Parameters describe populations; statistics describe samples. Most real-world analysis uses sample statistics to estimate population parameters.
  4. Data dictionaries are essential. They prevent misclassification, ensure reproducibility, and document assumptions.
  5. Classification decisions have real consequences. How we categorize variables — especially when they describe people — shapes what stories the data can and cannot tell.

Spaced Review

These questions revisit concepts from Chapter 1 to strengthen your long-term retention.

SR.1. Without looking back at Chapter 1, explain the difference between descriptive and inferential statistics. Give a new example of each (one you haven't used before).

Verify **Descriptive statistics** summarizes and presents data you already have — no generalizing beyond the data. **Inferential statistics** uses sample data to draw conclusions about a larger population. Example answers will vary. A good descriptive example: "The average temperature in my city last July was 87°F." A good inferential example: "Based on a survey of 500 customers, we estimate that 72% of all customers prefer the new design."

SR.2. What are the four pillars of a statistical investigation? (Try to recall them before checking.)

Verify 1. Ask a good question 2. Collect (or find) the data 3. Analyze the data 4. Interpret and communicate results Reference: Chapter 1, Section 1.3

SR.3. In Chapter 1, you learned that statistical thinking is a "threshold concept." Explain how the variable classification system you learned in this chapter (Chapter 2) is an example of statistical thinking in action.

Verify Statistical thinking involves seeing data through a lens of variation and uncertainty. The variable classification system applies this by forcing you to think carefully about *what kind* of variation a variable captures before jumping to analysis. A statistically thoughtful person doesn't just see numbers — they ask "what kind of numbers?" and "what operations make sense?" This is the habit of mind that Chapter 1 introduced as the foundation of statistical thinking.

What's Next

In Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks, you'll set up the tools you'll use throughout this course. You'll load your dataset into Python, use pandas to inspect data types programmatically, and start exploring your data. The variable classification skills you just learned will immediately come into play — you'll see how Python represents categorical and numerical data, and you'll learn to correct it when Python guesses wrong.

Before moving on, complete the exercises and quiz to solidify your understanding. Pay special attention to the exercises about classifying real-world variables — this is a skill you'll use in every remaining chapter.


Chapter 2 Exercises → exercises.md

Chapter 2 Quiz → quiz.md

Case Study: Data Types in Electronic Health Records → case-study-01.md

Case Study: Classifying Data at Scale — Social Media Challenges → case-study-02.md