> "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."
Learning Objectives
- Classify variables as categorical (nominal, ordinal) or numerical (discrete, continuous)
- Distinguish between populations and samples in real-world contexts
- Identify observational units and variables in a dataset
- Read and interpret data tables and data dictionaries
- Recognize different levels of measurement and why they matter
In This Chapter
- Chapter Overview
- 2.1 Observational Units and Variables: The Building Blocks
- 2.2 Categorical vs. Numerical: The Big Split
- 2.3 Going Deeper: Nominal, Ordinal, Discrete, and Continuous
- 2.4 Populations, Samples, Parameters, and Statistics (Revisited)
- 2.5 Data Dictionaries: The Rosetta Stone of Datasets
- 2.6 Levels of Measurement: Why the Hierarchy Matters
- 2.7 Cross-Sectional vs. Longitudinal Data
- 2.8 Putting It All Together: A Real Dataset Walkthrough
- 2.9 The Human Stories Behind the Categories
- 2.10 Project Checkpoint: Building Your Data Dictionary
- 2.11 Chapter Summary
- Spaced Review
- What's Next
- Chapter 2 Exercises → exercises.md
- Chapter 2 Quiz → quiz.md
- Case Study: Data Types in Electronic Health Records → case-study-01.md
- Case Study: Classifying Data at Scale — Social Media Challenges → case-study-02.md
Chapter 2: Types of Data and the Language of Statistics
"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." — Clifford Stoll, astronomer and author
Chapter Overview
Here's something that trips up a lot of students: they jump into formulas and graphs without first understanding the type of data they're working with. Then they calculate an average of zip codes, try to build a bar chart of heights, or feed categorical labels into a regression model — and everything goes sideways.
It's like trying to cook without knowing the difference between tablespoons and cups. You might get lucky, but eventually you'll put three cups of salt in a recipe that called for three tablespoons. The dish is ruined, and you don't even know why.
This chapter teaches you the vocabulary that prevents those mistakes. By the end, you'll be able to look at any dataset — a spreadsheet, a research paper, a government database — and immediately identify what kind of data you're working with, what operations make sense, and what tools to reach for. This is the language statisticians speak every day, and once you've got it, everything else in this course builds on top of it.
In this chapter, you will learn to: - Classify variables as categorical (nominal, ordinal) or numerical (discrete, continuous) - Distinguish between populations and samples in real-world contexts - Identify observational units and variables in a dataset - Read and interpret data tables and data dictionaries - Recognize different levels of measurement and why they matter
Fast Track: If you've taken AP Statistics or a prior course, skim Sections 2.1-2.3 and jump to Section 2.5 ("Data Dictionaries"). Complete quiz questions 1, 5, 10, and 15 to verify your foundation.
Deep Dive: After this chapter, read Case Study 1 (electronic health records) to see how data classification decisions have life-or-death consequences in healthcare.
2.1 Observational Units and Variables: The Building Blocks
Before we can classify data types, we need two foundational ideas. Every dataset in the world — whether it's a spreadsheet of patient records, a database of basketball stats, or a table of streaming metrics — is organized around two things: observational units and variables.
What's an Observational Unit?
An observational unit (sometimes called a "case" or "unit of observation") is the individual entity that you're collecting data about. It's the "who" or "what" of your dataset.
In a medical study, each patient is an observational unit. In a survey, each respondent is an observational unit. In Sam Okafor's basketball data, each game (or each shot attempt) could be an observational unit — and that choice matters, as we'll see.
Here's a simple test: look at one row of a dataset. What does that row represent? That's your observational unit.
What's a Variable?
A variable is a characteristic or property that can take different values across your observational units. It's the "what" you're measuring or recording about each unit.
In Dr. Maya Chen's disease surveillance data, variables might include patient age, zip code, diagnosis, date of symptom onset, and whether the patient was hospitalized. Each of these characteristics varies across patients — that's why we call them variables.
Let's look at a concrete example. Here's a small dataset that Dr. Maya Chen might work with during flu season:
| Patient ID | Age | Zip Code | Diagnosis | Hospitalized | Days to Recovery |
|---|---|---|---|---|---|
| P-1001 | 34 | 90210 | Influenza A | No | 7 |
| P-1002 | 67 | 90045 | Influenza B | Yes | 14 |
| P-1003 | 8 | 90210 | Influenza A | No | 5 |
| P-1004 | 45 | 90301 | Influenza A | No | 9 |
| P-1005 | 72 | 90045 | Influenza B | Yes | 21 |
| P-1006 | 29 | 90301 | Influenza A | No | 6 |
In this dataset: - Observational unit: Each patient (one per row) - Variables: Patient ID, Age, Zip Code, Diagnosis, Hospitalized, Days to Recovery (one per column)
Notice that every row has the same set of variables, but the values differ from row to row. Patient P-1001 is 34 years old; Patient P-1005 is 72. That variation across rows is exactly what makes these variables — and exactly what makes statistics interesting.
Intuition: Think of a dataset like a seating chart for a dinner party. Each person (observational unit) sits in a row. The columns are the questions you ask everyone: "What's your name? How old are you? What's your favorite food?" The answers (values) differ from person to person — that's why we call them variables.
Why This Matters
Here's the practical payoff: once you identify the observational unit, you can figure out what level of analysis makes sense. If your observational unit is "each student in a class," then calculating the average grade tells you something about that specific class. If your observational unit is "each school in a district," then averaging tells you something about the district's schools.
Getting the observational unit wrong is one of the most common — and most consequential — mistakes in data analysis. We'll see examples of this throughout the book, especially when we discuss ecological fallacy in Chapter 27.
2.2 Categorical vs. Numerical: The Big Split
Now for the concept that will follow you through every chapter of this course. Every variable you encounter falls into one of two fundamental categories:
Categorical variables place each observational unit into a group or category. The values are labels or names, not quantities.
Numerical variables assign each observational unit a number that represents a quantity — something you can meaningfully add, subtract, or average.
This sounds simple, and the basic distinction usually is. But the edges can be tricky, and getting it wrong leads to real problems. Let's build your intuition.
Categorical Variables: Names and Groups
A categorical variable (also called a qualitative variable) records a quality or category. The values answer the question "what type?" or "which group?"
From Dr. Maya Chen's flu data: - Diagnosis (Influenza A, Influenza B) — categories, not quantities - Hospitalized (Yes, No) — two groups - Zip Code (90210, 90045, 90301) — yes, even though these are numbers!
Wait — zip codes are numbers, so aren't they numerical? This is the trap that catches students every semester. Here's the key question: does it make sense to do arithmetic with these values?
What's the "average zip code" of these patients? You could calculate 90210 + 90045 + 90210 + ... and divide by 6. You'd get a number. But that number would be completely meaningless. You can't walk to "average zip code 90177." Zip codes are labels — they identify locations, not quantities. The fact that they happen to be written with digits doesn't make them numerical.
Common Pitfall: Numbers are not always numerical variables. Jersey numbers, Social Security numbers, phone numbers, zip codes, and ID numbers are all categorical. The test: does arithmetic (addition, subtraction, averaging) produce a meaningful result? If not, it's categorical.
Numerical Variables: Quantities You Can Calculate With
A numerical variable (also called a quantitative variable) records a measurable quantity. The values answer the question "how much?" or "how many?"
From Dr. Chen's data: - Age (34, 67, 8, 45, 72, 29) — meaningful to average, compare, subtract - Days to Recovery (7, 14, 5, 9, 21, 6) — meaningful to calculate "Patient P-1005 took 15 more days than Patient P-1003"
These pass the arithmetic test. The average age of these patients is (34 + 67 + 8 + 45 + 72 + 29) / 6 = 42.5 years, and that number actually means something.
The Quick Test
When you're unsure whether a variable is categorical or numerical, ask yourself these two questions:
- Does it make sense to calculate an average? If yes → numerical. If the average is meaningless → categorical.
- Do the values represent quantities, or labels? Quantities → numerical. Labels → categorical.
Here's a cheat sheet using examples from our four anchor characters:
| Variable | Character | Categorical or Numerical? | Why? |
|---|---|---|---|
| Disease type (flu strain) | Dr. Chen | Categorical | Labels for different strains; you can't average "Influenza A" and "Influenza B" |
| Patient age | Dr. Chen | Numerical | A quantity; average age is meaningful |
| User subscription tier (Free, Basic, Premium) | Alex Rivera | Categorical | Labels for groups; averaging doesn't make sense |
| Daily watch time (minutes) | Alex Rivera | Numerical | A measured quantity; average watch time is meaningful |
| Race/ethnicity of defendant | Prof. Washington | Categorical | Labels for demographic groups |
| Risk score (1-10) assigned by algorithm | Prof. Washington | It depends! | See "The Gray Areas" below |
| Player position (guard, forward, center) | Sam Okafor | Categorical | Labels for roles |
| Points scored per game | Sam Okafor | Numerical | A counted quantity; average points is meaningful |
Check Your Understanding (try to answer without scrolling up)
- In your own words, what's the difference between a categorical and a numerical variable?
- A dataset contains a column called "Customer Rating" with values 1 through 5 (where 1 = "Very Unsatisfied" and 5 = "Very Satisfied"). Is this categorical or numerical? Defend your answer.
- Is "Phone Number" a categorical or numerical variable? Why?
Verify
- A categorical variable places observations into groups or categories (labels). A numerical variable records a quantity that you can meaningfully do arithmetic with.
- This is a genuine gray area! The ratings are ordinal categories (we'll define this in the next section) — they have a meaningful order (5 > 4 > 3 > 2 > 1) but the "distances" between values aren't necessarily equal. In practice, many analysts treat 1-5 ratings as numerical and calculate averages (e.g., "average rating: 4.2 stars"), but this is technically a simplification. We'll discuss this nuance in Section 2.3.
- Categorical. Even though phone numbers contain digits, they're labels — identifiers for specific phone lines. Averaging two phone numbers produces nonsense.
2.3 Going Deeper: Nominal, Ordinal, Discrete, and Continuous
The categorical/numerical split is the big divide. But each side has subtypes that matter for choosing the right analysis and the right graph. Let's break each one down.
Categorical Subtypes: Nominal vs. Ordinal
Not all categories are created equal. Some have a natural order; others don't.
Nominal variables are categorical variables where the categories have no inherent order. "Nominal" comes from the Latin word for "name" — these are just names.
Examples: - Blood type (A, B, AB, O) — there's no sense in which type A is "more" than type B - Eye color (brown, blue, green, hazel) — no natural ranking - Diagnosis (Influenza A, Influenza B, COVID-19) — different categories, not ranked - Alex Rivera's user device type (mobile, tablet, desktop, smart TV) — no inherent order
Ordinal variables are categorical variables where the categories have a meaningful order, but the distances between categories aren't necessarily equal.
Examples: - Education level (high school, bachelor's, master's, doctorate) — there's a clear ordering - Pain scale (none, mild, moderate, severe) — "severe" is worse than "mild," but is the gap between "mild" and "moderate" the same as between "moderate" and "severe"? We don't know. - Military rank (private, corporal, sergeant, lieutenant) — ordered but not equally spaced - Likert scale responses (strongly disagree, disagree, neutral, agree, strongly agree) — ordered categories
Here's the critical distinction: with ordinal variables, you can say "A is more/higher/better than B," but you can't say "the difference between A and B equals the difference between B and C." The ordering is meaningful, but the spacing is not.
Real Talk About Ordinal Data: Here's a controversy that statisticians actually argue about: should you calculate the average of ordinal data? If a survey uses a 1-5 scale (strongly disagree to strongly agree), is it okay to report "the average response was 3.7"?
Strictly speaking, no — because the distances between 1 and 2, 2 and 3, etc. aren't guaranteed to be equal. But in practice, researchers do it constantly because it's useful and the results are usually reasonable. You'll see "average Likert score" in thousands of published papers.
Our advice: know the rule, understand why purists object, and recognize that treating ordinal data as numerical is a simplification that sometimes works and sometimes doesn't. When in doubt, use methods designed for ordinal data (we'll cover some in Chapter 21).
Numerical Subtypes: Discrete vs. Continuous
Numerical variables also come in two flavors.
Discrete variables take on countable values — typically whole numbers, with gaps between possible values. You can count them.
Examples: - Number of emergency room visits (0, 1, 2, 3, ...) — you can't have 2.7 visits - Number of three-pointers made in a game (0, 1, 2, ..., 15) — whole numbers only - Number of children in a household (0, 1, 2, 3, ...) — no fractional children - Number of episodes watched on StreamVibe (0, 1, 2, ...) — you either watched an episode or you didn't
Continuous variables can take on any value within a range, including fractions and decimals. You measure them.
Examples: - Height (5.583 feet, 172.4 cm) — any value on a continuous scale - Temperature (98.6 degrees, 37.0 degrees) — can be measured to arbitrary precision - Time spent watching (47.3 minutes, 2.15 hours) — measured, not counted - Blood pressure (120/80 mmHg) — measured on a continuous scale
Intuition: Here's a quick rule of thumb. Ask: "Do I count it, or do I measure it?" - Count → discrete - Measure → continuous
This works for most cases. You count the number of patients. You measure their temperature.
The Gray Areas
Let me be honest: the boundaries between these categories aren't always crystal clear. Real data is messy, and there are genuine ambiguities.
Age: Is it discrete or continuous? Technically, age is continuous — time passes continuously. But we usually report age in whole years (34, not 34.267), making it look discrete. In practice, most analysts treat age as continuous.
Money: Your bank balance might be $1,247.83 — continuous? Or is it discrete because it's counted in cents? In practice, dollar amounts are treated as continuous when the values span a wide range.
Risk scores (1-10): Professor Washington examines risk scores assigned by a predictive policing algorithm. Is a 1-10 score ordinal (ordered categories) or numerical (discrete)? It depends on how it was constructed. If the numbers come from a mathematical model and a score of 8 really is "twice as risky" as a score of 4, it's numerical. If the numbers are arbitrary labels where 8 just means "more risky" than 4 with no precise meaning for the gap, it's ordinal.
Don't let these gray areas paralyze you. In practice, the classification decision usually becomes clear when you ask: "What analysis am I planning to do, and does the data type support it?"
The Complete Classification Tree
Here's the full picture, all in one place:
Variable
/ \
Categorical Numerical
/ \ / \
Nominal Ordinal Discrete Continuous
| Type | Has meaningful order? | Has meaningful distances? | Arithmetic makes sense? | Examples |
|---|---|---|---|---|
| Nominal | No | No | No | Blood type, zip code, eye color |
| Ordinal | Yes | No | Limited | Pain scale, education level, Likert ratings |
| Discrete | Yes | Yes | Yes | Number of siblings, goals scored, defects counted |
| Continuous | Yes | Yes | Yes | Height, weight, temperature, time |
2.4 Populations, Samples, Parameters, and Statistics (Revisited)
You met the terms population and sample in Chapter 1. Now let's deepen your understanding and add two crucial companion terms.
Population vs. Sample: The Full Picture
Recall from Chapter 1 that a population is the entire group you want to study, and a sample is the subset you actually observe. But here's what we didn't emphasize enough last time: the same group of people can be either a population or a sample, depending on your question.
Let's use Alex Rivera's StreamVibe data to make this concrete.
Suppose StreamVibe has 8.2 million subscribers. Alex randomly selects 5,000 users to test the new recommendation algorithm. In this scenario: - Population: All 8.2 million StreamVibe subscribers - Sample: The 5,000 users selected for the test
But now suppose Alex's boss asks a different question: "Among just the 5,000 test users, what was the average watch time?" Now those 5,000 users are the population — because they're the entire group the boss cares about. No inference needed.
Same people, different roles. Whether a group is a population or a sample depends on the question you're asking.
Parameters vs. Statistics: The Vocabulary of Inference
This is where we add two new terms that will become essential starting in Chapter 11.
A parameter is a number that describes a population. It's the truth — the actual value you'd get if you could measure every single member of the population. Parameters are usually unknown because we rarely have access to the entire population.
A statistic is a number that describes a sample. It's what you actually calculate from the data you have. Statistics are known — you calculated them — but they're estimates of the unknown parameters.
| Population | Sample | |
|---|---|---|
| Who? | Everyone you want to study | The subset you actually observe |
| Number that describes it | Parameter | Statistic |
| Known or unknown? | Usually unknown | Known (you calculated it) |
| Goal | What you want to learn | What you use to estimate |
Here's a concrete example. Sam Okafor wants to know Daria Kowalczyk's "true" three-point shooting ability — the percentage she would shoot if she took an infinite number of shots under identical conditions. That true percentage is a parameter. It's fixed but unknown.
What Sam actually has is her shooting percentage this season: 38% on 65 attempts. That 38% is a statistic — a number calculated from a sample of shots (the 65 she's taken so far). It's his best estimate of the parameter, but it's not exactly right. If Daria took another 65 shots, she might shoot 35% or 41%. The statistic varies from sample to sample; the parameter does not.
Intuition: A parameter is the bullseye on a dartboard. A statistic is where your dart actually lands. You're aiming for the parameter, and with good technique (proper sampling) you'll land close. But you'll almost never hit the exact center.
Math Anxiety Note: Don't worry — we won't do formal parameter estimation until Chapters 11-12. For now, just internalize the vocabulary: parameters describe populations, statistics describe samples. If you can remember that, you're golden.
Check Your Understanding (try to answer without scrolling up)
- Dr. Maya Chen surveys 2,000 residents of a county about their flu vaccination status. 64% of respondents say they've been vaccinated. Is 64% a parameter or a statistic?
- What would the corresponding parameter be?
- In Alex Rivera's A/B test, if the average watch time for the 5,000 test users is 47 minutes, is 47 minutes a parameter or a statistic?
Verify
- It's a statistic — it's calculated from a sample (2,000 residents), not the entire county population.
- The corresponding parameter would be the true vaccination rate of all residents in the county. This is unknown — the survey estimates it.
- It depends on the question! If Alex wants to know about ALL StreamVibe subscribers, then 47 minutes is a statistic (sample estimate). If the question is specifically about these 5,000 test users, then 47 minutes is a parameter (it describes the entire group of interest).
2.5 Data Dictionaries: The Rosetta Stone of Datasets
Imagine you're handed a spreadsheet with 50 columns and 10,000 rows. The column headers say things like BP_SYS, DX_CODE, LOS, and ADM_TYPE. What do these mean? What values are valid? How was each variable measured?
Without a data dictionary, you're lost.
A data dictionary (also called a codebook or metadata file) is a document that describes every variable in a dataset: its name, its type, what it measures, what values it can take, and how it was collected.
Here's what a data dictionary looks like for Dr. Maya Chen's flu surveillance data:
| Variable Name | Description | Type | Valid Values | Notes |
|---|---|---|---|---|
patient_id |
Unique patient identifier | Nominal (categorical) | P-1001 through P-9999 | Not used in analysis; for tracking only |
age |
Patient age at time of diagnosis | Continuous (numerical) | 0-120 | Recorded in whole years; ages < 1 recorded as 0 |
zip_code |
Patient's residential zip code | Nominal (categorical) | 5-digit U.S. zip codes | Used for geographic analysis, not arithmetic |
diagnosis |
Flu strain identified by lab test | Nominal (categorical) | Influenza A, Influenza B, Unspecified | "Unspecified" if lab test not performed |
hospitalized |
Whether patient was hospitalized | Nominal (categorical) | Yes, No | Binary variable |
days_to_recovery |
Days from symptom onset to symptom resolution | Discrete (numerical) | 1-90 | Self-reported; some patients lost to follow-up (recorded as NA) |
collection_date |
Date the data was recorded | Continuous (numerical) | Dates in MM/DD/YYYY format | Used for temporal analysis |
Why Data Dictionaries Matter
Data dictionaries aren't just documentation busywork. They prevent real mistakes:
-
They prevent misclassification. Without the data dictionary, someone might try to calculate the average zip code — a meaningless number. The dictionary makes clear that
zip_codeis categorical. -
They explain missing values. The note about
days_to_recoverytells you that NA means "lost to follow-up," not "zero days" or "the patient died." This matters enormously for analysis. -
They ensure reproducibility. If another researcher wants to replicate Dr. Chen's analysis, the data dictionary tells them exactly how each variable was defined and measured.
-
They're required in professional settings. In healthcare, government, and most research contexts, a dataset without a data dictionary is considered incomplete. Many journals won't publish research unless the data dictionary is available.
Reading Data Dictionaries in the Wild
Real data dictionaries can be much more complex than our example. When you encounter one, look for these key elements:
- Variable name: The column header in the actual data file
- Description: What the variable represents in plain language
- Type: Categorical (nominal/ordinal) or numerical (discrete/continuous)
- Valid values: What values are allowed — either a list (for categorical) or a range (for numerical)
- Missing value codes: How missing data is represented (NA, -99, blank, etc.)
- Units: For numerical variables — is it inches or centimeters? Days or hours?
Building a Data Dictionary with Python
Here's a quick preview of how to explore a dataset's structure in Python using pandas (you'll learn pandas properly in Chapter 3):
import pandas as pd
# Load a dataset (we'll learn this properly in Chapter 3)
df = pd.read_csv("flu_surveillance.csv")
# See the first few rows
print(df.head())
# Check what Python thinks each column's data type is
print(df.dtypes)
Output:
patient_id object # "object" usually means text/categorical
age int64 # integer — Python sees this as numerical
zip_code int64 # Python thinks this is numerical (but WE know it's categorical!)
diagnosis object # text — categorical
hospitalized object # text — categorical
days_to_recovery float64 # float — numerical (float because some values are NA)
collection_date object # text — needs to be converted to a date type
Notice something important: Python got the zip code wrong. It sees digits and assumes they're numerical. This is exactly why data dictionaries matter — the software can't always tell the difference. You need to know your data well enough to correct these misclassifications.
In a spreadsheet (Excel or Google Sheets), you'd check the data type by selecting a column and looking at the cell formatting. Numbers formatted as "General" or "Number" are treated as numerical; those formatted as "Text" are treated as categorical. But again, the software might guess wrong — zip codes stored as numbers will lose their leading zeros (02138 becomes 2138), which can cause problems.
Productive Struggle
Look at the dataset you chose for your Data Detective Portfolio (from Chapter 1). Without looking up the official data dictionary, try to create your own: 1. List every variable (column) in your dataset 2. For each variable, classify it as nominal, ordinal, discrete, or continuous 3. Note any variables that seem ambiguous — where you're not sure of the classification
After you've tried, find the official documentation for your dataset and compare. Where did you agree? Where did you disagree? What did you learn from the discrepancy?
This is a genuine challenge — even experienced data analysts disagree on some classifications. The goal isn't perfection; it's building the habit of thinking carefully about data types before diving into analysis.
2.6 Levels of Measurement: Why the Hierarchy Matters
You've now learned four types of variables: nominal, ordinal, discrete, and continuous. These aren't just labels — they form a hierarchy that determines what you can and can't do with your data. This hierarchy is called the levels of measurement, and it was first formalized by psychologist Stanley Stevens in 1946.
The Four Levels
| Level | What You Can Do | Example | Operations Allowed |
|---|---|---|---|
| Nominal | Classify, count frequencies, find the mode | Blood type (A, B, AB, O) | =, ≠ |
| Ordinal | All of nominal + rank, compare (greater/less) | Pain level (none, mild, moderate, severe) | =, ≠, <, > |
| Interval | All of ordinal + measure exact differences | Temperature in Fahrenheit (32°F, 72°F, 100°F) | =, ≠, <, >, +, − |
| Ratio | All of interval + compute meaningful ratios | Height in inches (60", 72") | =, ≠, <, >, +, −, ×, ÷ |
Wait — interval and ratio? Those are new. Let me explain.
Interval level variables have meaningful, equal distances between values, but no true zero point. Temperature in Fahrenheit is the classic example: the difference between 40°F and 50°F is the same as between 80°F and 90°F (both 10°F). But 0°F doesn't mean "no temperature" — it's just an arbitrary point on the scale. And you can't say "80°F is twice as hot as 40°F" because the zero isn't meaningful.
Ratio level variables have everything interval has, plus a true zero point. Height, weight, income, and time are ratio variables: 0 inches means no height, 0 dollars means no income, and it does make sense to say "6 feet is twice as tall as 3 feet."
How This Matters in Practice
"So what?" you might be thinking. "Why do I care whether temperature is interval or ratio?"
Because the level of measurement determines which statistical operations are valid:
-
Nominal data: You can count frequencies and find the mode (most common category). That's about it. You can calculate percentages ("40% of patients had Influenza A"). You cannot meaningfully rank, average, or compute differences.
-
Ordinal data: You can do everything you can with nominal data, plus you can rank and compare. "More patients rated their pain as severe than mild." But you cannot assume the intervals are equal, so averaging is questionable.
-
Interval data: You can add and subtract meaningfully. "Today is 15 degrees warmer than yesterday." But ratios are problematic — "twice as hot" doesn't work with Fahrenheit.
-
Ratio data: Everything is fair game. Averages, differences, ratios — all meaningful. "The average height is 67 inches. The tallest student is 1.2 times taller than the shortest."
Here's how this connects to our anchor examples:
| Variable | Level | What Sam Okafor Can Do With It |
|---|---|---|
| Player position | Nominal | Count how many guards vs. forwards; find the most common position |
| Draft round (1st, 2nd, undrafted) | Ordinal | Rank players by draft round; compare who was drafted higher |
| Points scored per game | Ratio | Calculate averages, compare differences, say "Player A scores twice as much as Player B" |
| Plus/minus rating (+5, -3, 0) | Interval | Calculate differences; the zero means "even," not "nothing" |
Intuition: Think of the levels as a ladder. Each rung up adds more things you can do: - Nominal: name it - Ordinal: name it + rank it - Interval: name it + rank it + measure exact differences - Ratio: name it + rank it + measure exact differences + compute ratios
You can always "go down" the ladder (treat ratio data as ordinal) but you can't "go up" (treat nominal data as ratio).
Math Anxiety Note: The interval vs. ratio distinction matters most in advanced applications. For most of this course, the critical distinction is categorical vs. numerical. If you remember that split and can also distinguish nominal from ordinal, you're ahead of the game.
Check Your Understanding (try to answer without scrolling up)
- What's the difference between ordinal and interval data?
- Is "year of birth" (e.g., 1998, 2003) interval or ratio? Can you say someone born in 2000 was "born twice as late" as someone born in 1000?
- Sam wants to calculate the average number of rebounds per game for each player. What level of measurement must "rebounds per game" be at minimum for this to make sense?
Verify
- Ordinal data has a meaningful order but unequal (or unknown) spacing between categories. Interval data has both a meaningful order AND equal spacing between values — but no true zero.
- Year of birth is interval, not ratio. There's no "year zero" in many calendar systems, and even where there is, it's arbitrary. Saying "2000 is twice as late as 1000" doesn't make meaningful sense. But you CAN say "the difference between 1998 and 2003 is 5 years."
- Interval level at minimum (for the average to be meaningful, differences must be equal). In practice, rebounds per game is ratio level (0 rebounds = truly none), so the average is perfectly valid.
2.7 Cross-Sectional vs. Longitudinal Data
There's one more distinction we need before you're fluent in the language of data. It has to do with when the data was collected.
Cross-sectional data is collected at one point in time (or during one short period). It's a snapshot. Think of it like a photograph — it captures everyone at the same moment.
Examples: - A survey of 1,000 adults conducted in March 2026 about their exercise habits - Dr. Chen's flu surveillance data from one flu season - A Census conducted in a particular year
Longitudinal data is collected from the same observational units at multiple points in time. It's a movie, not a photograph — you see how things change.
Examples: - A study that measures the same patients' blood pressure every 6 months for 10 years - Alex Rivera tracking the same users' watch time every week for a year before and after the algorithm change - The Framingham Heart Study, which has followed residents of Framingham, Massachusetts since 1948
Why This Distinction Matters
Cross-sectional and longitudinal data answer different questions:
- Cross-sectional: "How are things right now?" or "How do different groups compare at this moment?"
- Longitudinal: "How do things change over time?" or "What happens to the same individuals as time passes?"
This connects directly to the correlation vs. causation theme from Chapter 1. Suppose Dr. Chen finds, in cross-sectional data, that people who exercise regularly have lower rates of heart disease. Does exercise cause lower heart disease? Not necessarily — maybe healthier people are more able to exercise in the first place. Cross-sectional data captures a snapshot, not a story of change.
But if she follows the same people for 20 years (longitudinal data) and finds that those who started exercising developed less heart disease than those who didn't, the causal argument gets stronger (though still not airtight — we'll formalize this in Chapter 4).
Intuition: Cross-sectional data is like looking at a photo album page — everyone posed together at one moment. Longitudinal data is like a time-lapse video — you see the same people changing over months or years.
2.8 Putting It All Together: A Real Dataset Walkthrough
Let's apply everything we've learned to a dataset that Alex Rivera might work with at StreamVibe. Here's a sample of user viewing data:
| User ID | Age | Plan | Genre Preference | Episodes This Week | Avg Session (min) | Joined Date | Satisfaction (1-5) |
|---|---|---|---|---|---|---|---|
| U-4401 | 23 | Free | Comedy | 12 | 34.7 | 2024-08-15 | 4 |
| U-4402 | 45 | Premium | Drama | 5 | 68.2 | 2022-01-03 | 5 |
| U-4403 | 31 | Basic | Sci-Fi | 8 | 42.1 | 2023-06-20 | 3 |
| U-4404 | 19 | Free | Comedy | 22 | 25.6 | 2025-01-10 | 2 |
| U-4405 | 56 | Premium | Documentary | 3 | 91.3 | 2021-03-18 | 5 |
| U-4406 | 28 | Basic | Drama | 9 | 38.4 | 2024-11-02 | 4 |
Step 1: Identify the observational unit.
Each row represents one user. The observational unit is an individual StreamVibe subscriber.
Step 2: Classify every variable.
| Variable | Type | Subtype | Level of Measurement | Reasoning |
|---|---|---|---|---|
| User ID | Categorical | Nominal | Nominal | Labels for identification; no meaningful order or arithmetic |
| Age | Numerical | Continuous* | Ratio | Measured quantity with a true zero; reported in whole years |
| Plan | Categorical | Ordinal | Ordinal | Free < Basic < Premium has a meaningful order (by price and features) |
| Genre Preference | Categorical | Nominal | Nominal | Labels for categories; no inherent ranking |
| Episodes This Week | Numerical | Discrete | Ratio | Counted (whole numbers); 0 episodes = none |
| Avg Session (min) | Numerical | Continuous | Ratio | Measured duration; 0 minutes = no watching |
| Joined Date | Numerical | Continuous | Interval | Points on a time scale; "twice as late" doesn't make sense |
| Satisfaction (1-5) | Categorical | Ordinal | Ordinal | Ordered categories; distances between 1-2, 2-3, etc. may not be equal |
*Age is technically continuous but often recorded as discrete whole numbers. We treat it as continuous in most analyses.
Step 3: Build the data dictionary.
A professional data dictionary for this dataset would document each variable's valid range, how it was collected (self-reported? system-recorded?), what missing values look like, and any caveats. For example, "Avg Session (min)" might have a note: "Calculated by StreamVibe's system as total minutes watched divided by number of sessions; sessions shorter than 30 seconds are excluded."
Step 4: Ask what analyses make sense.
Now that you know the variable types, you can start thinking about appropriate analyses:
- Categorical variables: Bar charts, frequency tables, percentages (Chapter 5)
- Numerical variables: Histograms, averages, standard deviations (Chapters 5-6)
- Relationships: Is satisfaction related to plan type? (Both categorical → chi-square test, Chapter 19.) Is age related to watch time? (Both numerical → correlation, Chapter 22.)
- Comparisons: Do Premium users watch more than Free users? (Categorical grouping + numerical outcome → t-test, Chapter 16.)
We're getting ahead of ourselves — you'll learn all these techniques in later chapters. The point for now is: knowing your variable types tells you which tools to reach for. Every statistical technique has requirements about what kinds of variables it works with. Get the classification right, and the rest follows naturally.
2.9 The Human Stories Behind the Categories
Before we wrap up, I want to surface a theme we introduced in Chapter 1: the human stories behind the data. This matters more than you might think when it comes to data types and classification.
Every time we assign someone to a category — a diagnosis code, a risk score, a racial classification — we're making a decision about how to represent a complex human being with a simple label. Those decisions have consequences.
Professor James Washington sees this every day in his research on predictive policing algorithms. When a risk assessment tool classifies a defendant as "high risk" (ordinal category) based on variables like prior convictions (discrete numerical), neighborhood (nominal categorical), and age at first arrest (continuous numerical), those data types aren't just abstract labels. They represent real choices about what gets measured, how it gets measured, and what gets left out.
Consider: "neighborhood" is a nominal categorical variable. But which neighborhoods get labeled "high crime"? Often, they're neighborhoods that have been heavily policed — which means more arrests, which means more data points, which means the algorithm thinks they're more dangerous. The data type (categorical: high crime / low crime) looks objective. But the values that end up in that category are shaped by decades of policing decisions.
Or consider Dr. Maya Chen's flu data. She records "race/ethnicity" as a categorical variable. But racial categories are socially constructed, vary across cultures and time periods, and may not capture the lived experiences of people who identify with multiple groups. The categories we choose shape the stories the data can tell — and the stories it can't.
This doesn't mean we should avoid categorizing data. We need categories to do analysis. But we should always remember: behind every data point is a person, and behind every category is a choice.
2.10 Project Checkpoint: Building Your Data Dictionary
Project Checkpoint
Your task for Chapter 2:
Open the dataset you chose in Chapter 1 for your Data Detective Portfolio. Complete the following:
- Identify the observational unit. What does each row represent?
- List all variables (columns) in your dataset.
- Classify each variable as: - Categorical (nominal or ordinal) OR Numerical (discrete or continuous) - Identify the level of measurement (nominal, ordinal, interval, or ratio)
- Build a data dictionary in a table format, like the one in Section 2.5. Include: variable name, description, type, valid values, and any notes about how the variable was measured.
- Flag any ambiguous variables — ones where the classification isn't clear-cut. Write a sentence explaining why you chose the classification you did.
- Identify the data structure: Is your dataset cross-sectional or longitudinal? How do you know?
Example: If you chose the World Happiness Report: - Observational unit: One country in one year - Variables: Country name (nominal), Year (interval), Happiness score (continuous/ratio), GDP per capita (continuous/ratio), Social support (continuous/ratio), Healthy life expectancy (continuous/ratio), Freedom to make life choices (continuous/ratio), Generosity (continuous/ratio), Perceptions of corruption (continuous/ratio) - Data structure: Panel data (multiple countries measured across multiple years — a form of longitudinal data)
What this connects to: In Chapter 3, you'll use Python to programmatically inspect your data types and verify your manual classification. In Chapter 5, you'll use your variable classifications to choose the right graph for each variable.
2.11 Chapter Summary
Let's recap what you've learned — the vocabulary you'll use every day for the rest of this course.
The Classification System
| Category | Subcategory | Description | Example |
|---|---|---|---|
| Categorical | Nominal | Categories without order | Blood type, eye color, diagnosis |
| Categorical | Ordinal | Categories with meaningful order | Education level, pain scale, Likert ratings |
| Numerical | Discrete | Counted quantities (whole numbers) | Number of siblings, goals scored |
| Numerical | Continuous | Measured quantities (any value in a range) | Height, temperature, time |
Key Vocabulary
| Term | Definition |
|---|---|
| Observational unit | The individual entity each row of data describes |
| Variable | A characteristic that varies across observational units |
| Categorical variable | A variable whose values are categories or labels |
| Numerical variable | A variable whose values are meaningful quantities |
| Nominal | Categorical without order |
| Ordinal | Categorical with order |
| Discrete | Numerical, countable values |
| Continuous | Numerical, measurable on a continuous scale |
| Level of measurement | The hierarchy (nominal → ordinal → interval → ratio) that determines valid operations |
| Data dictionary | A document describing every variable in a dataset |
| Parameter | A number describing a population (usually unknown) |
| Statistic | A number describing a sample (calculated from data) |
| Cross-sectional | Data collected at one point in time |
| Longitudinal | Data collected from the same units over multiple time points |
Decision Flowchart: What Type of Variable Is This?
Does the variable record a category/label, or a quantity?
│
├── Category/label → CATEGORICAL
│ │
│ ├── Do the categories have a natural order?
│ │ ├── No → NOMINAL (blood type, diagnosis)
│ │ └── Yes → ORDINAL (education level, pain scale)
│ │
│
└── Quantity → NUMERICAL
│
├── Is the variable counted (whole numbers only)?
│ ├── Yes → DISCRETE (number of children, episodes watched)
│ └── No → CONTINUOUS (height, temperature, time)
Key Takeaways
- Every variable is either categorical or numerical — and getting this right determines which tools and analyses are appropriate.
- Numbers aren't always numerical variables. Zip codes, ID numbers, and phone numbers are categorical despite being made of digits.
- Parameters describe populations; statistics describe samples. Most real-world analysis uses sample statistics to estimate population parameters.
- Data dictionaries are essential. They prevent misclassification, ensure reproducibility, and document assumptions.
- Classification decisions have real consequences. How we categorize variables — especially when they describe people — shapes what stories the data can and cannot tell.
Spaced Review
These questions revisit concepts from Chapter 1 to strengthen your long-term retention.
SR.1. Without looking back at Chapter 1, explain the difference between descriptive and inferential statistics. Give a new example of each (one you haven't used before).
Verify
**Descriptive statistics** summarizes and presents data you already have — no generalizing beyond the data. **Inferential statistics** uses sample data to draw conclusions about a larger population. Example answers will vary. A good descriptive example: "The average temperature in my city last July was 87°F." A good inferential example: "Based on a survey of 500 customers, we estimate that 72% of all customers prefer the new design."SR.2. What are the four pillars of a statistical investigation? (Try to recall them before checking.)
Verify
1. Ask a good question 2. Collect (or find) the data 3. Analyze the data 4. Interpret and communicate results Reference: Chapter 1, Section 1.3SR.3. In Chapter 1, you learned that statistical thinking is a "threshold concept." Explain how the variable classification system you learned in this chapter (Chapter 2) is an example of statistical thinking in action.
Verify
Statistical thinking involves seeing data through a lens of variation and uncertainty. The variable classification system applies this by forcing you to think carefully about *what kind* of variation a variable captures before jumping to analysis. A statistically thoughtful person doesn't just see numbers — they ask "what kind of numbers?" and "what operations make sense?" This is the habit of mind that Chapter 1 introduced as the foundation of statistical thinking.What's Next
In Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks, you'll set up the tools you'll use throughout this course. You'll load your dataset into Python, use pandas to inspect data types programmatically, and start exploring your data. The variable classification skills you just learned will immediately come into play — you'll see how Python represents categorical and numerical data, and you'll learn to correct it when Python guesses wrong.
Before moving on, complete the exercises and quiz to solidify your understanding. Pay special attention to the exercises about classifying real-world variables — this is a skill you'll use in every remaining chapter.