Key Takeaways: Types of Data and the Language of Statistics

One-Sentence Summary

Every variable is either categorical (labels and groups) or numerical (measurable quantities), and correctly classifying your variables before analysis is the single most important step you can take to avoid meaningless results.

Core Concepts at a Glance

Concept Definition Why It Matters
Observational unit The individual entity each row of data describes Defines what your dataset is "about"; must be identified first
Variable A characteristic that varies across observational units Each column captures one measurable or classifiable property
Categorical variable Values are labels or group names Use frequency counts, bar charts, mode — NOT averages
Numerical variable Values are measurable quantities Use averages, histograms, standard deviation — full arithmetic
Data dictionary Documentation describing every variable in a dataset Prevents misclassification, ensures reproducibility
Parameter A number describing a population (usually unknown) The "truth" you're trying to estimate
Statistic A number describing a sample (calculated from data) Your best estimate of the unknown parameter

The Classification System

                        Variable
                       /         \
              Categorical       Numerical
              /        \        /        \
         Nominal    Ordinal  Discrete  Continuous
         (labels)  (ranked)  (counted)  (measured)
Type Order? Equal spacing? Arithmetic? Quick examples
Nominal No No No Blood type, zip code, diagnosis
Ordinal Yes No Limited Pain scale, letter grades, Likert ratings
Discrete Yes Yes Yes Number of siblings, goals scored
Continuous Yes Yes Yes Height, weight, temperature, time

Quick Decision Flowchart

Step 1: Does the variable record a category/label, or a quantity? - Category → Categorical - Quantity → Numerical

Step 2 (Categorical): Do the categories have a natural order? - No → Nominal - Yes → Ordinal

Step 2 (Numerical): Is the variable counted (whole numbers only) or measured (any value)? - Counted → Discrete - Measured → Continuous

The Numbers Trap

Not all numbers are numerical variables. If arithmetic (averaging, subtracting) produces a meaningless result, the variable is categorical — regardless of whether the values are digits.

Looks numerical, but ISN'T Why
Zip codes (90210) "Average zip code" is meaningless
Phone numbers Can't add two phone numbers
Social Security numbers Labels, not quantities
Jersey numbers Player #24 isn't "twice" player #12
Coded responses (1 = Yes, 2 = No) Codes are labels in disguise

Levels of Measurement Hierarchy

Ratio     → ratios meaningful (height, income)
Interval  → differences meaningful (temperature °F, calendar year)
Ordinal   → order meaningful (pain scale, rankings)
Nominal   → equality only (blood type, zip code)

Each level up unlocks more valid operations. You can always treat higher-level data as lower-level (ratio as ordinal), but never lower as higher (nominal as ratio).

Population vs. Sample, Parameter vs. Statistic

Population Sample
What is it? Everyone you want to study The subset you actually observe
Descriptive number Parameter (unknown) Statistic (known)
Analogy The bullseye Where your dart lands

Cross-Sectional vs. Longitudinal

Cross-Sectional Longitudinal
Analogy Photograph Time-lapse video
Time points One Multiple
Best for Comparing groups at one moment Tracking change over time
Causal claims Weak Stronger (but still not guaranteed)

Key Connections

Forward Connection Why It Matters
Chapter 3 (Data Toolkit) Python's dtypes tells you what Python thinks the type is — you need to verify
Chapter 5 (Graphs) Variable type determines graph choice: bar chart (categorical) vs. histogram (numerical)
Chapter 6 (Summaries) Mean/SD for numerical; mode/frequency for categorical
Chapters 14-16 (Inference) Test choice depends on variable type: z-test (proportions) vs. t-test (means)
Chapter 19 (Chi-Square) Designed specifically for categorical data analysis
Chapter 22 (Regression) Requires numerical outcome variable; categorical predictors need special handling

Anchor Example Updates

Person What You Learned About Their Data
Dr. Maya Chen Flu surveillance data: mix of nominal (diagnosis, zip code), ordinal (severity), and numerical (age, days to recovery) variables
Alex Rivera StreamVibe data: "watch time" definition matters; genre classification is complex; engagement tiers are ordinal, not numerical
Prof. Washington Risk scores: ordinal or numerical depending on construction; racial categories are nominal with deep consequences
Sam Okafor Basketball stats: position (nominal), draft round (ordinal), points/game (ratio); shooting percentage is a statistic estimating a parameter

Common Mistakes to Avoid

  1. Averaging zip codes, ID numbers, or codes — just because it's made of digits doesn't make it numerical
  2. Treating ordinal as continuous without acknowledging the simplification — the average of 1-5 ratings is common but technically approximate
  3. Ignoring the data dictionary — coded values (77 = "Don't know") can corrupt calculations
  4. Confusing Python's dtype with statistical type — Python sees digits; you see meaning
  5. Forgetting that classification decisions shape analysis — who gets counted, who gets categorized, and what gets measured are not neutral choices