Case Study 2: Patient Records and Data Types — When Numbers Aren't Really Numbers

Contributors to Introduction to Data Science

Case Study 2: Patient Records and Data Types — When Numbers Aren't Really Numbers

Tier 3 — Illustrative Example: This case study uses Elena, one of our anchor characters, in a constructed scenario designed to illustrate why data types matter in data science. The clinic, patient records, and specific data values are fictional. The data quality issues described — ZIP codes losing leading zeros, patient IDs being treated as numbers, dates causing calculation errors — are real and extremely common problems encountered by public health analysts, healthcare data workers, and data scientists across every industry.

The Setting

Elena works for a county public health department. It's Tuesday morning, and she's received a new file from a partner clinic — a spreadsheet of vaccination records that she needs to merge into the county's master database. The file has six columns:

Column	Example Value	What It Represents
patient_id	00847	Unique patient identifier
zip_code	02134	Patient's ZIP code
age	34	Patient's age in years
visit_date	01/15/2024	Date of vaccination visit
doses_received	2	Number of doses received
vaccination_rate	0.73	Proportion of population vaccinated in patient's ZIP

Elena opens the file, glances at it, and thinks, "This looks straightforward — mostly numbers." But she's about to discover one of the most fundamental lessons in data science: not everything that looks like a number should be treated as one.

The Problem: Numbers That Aren't Numbers

Elena starts entering the data into her Python notebook, treating each column the way it looks:

# Elena's first attempt — treating everything at face value
patient_id = 847
zip_code = 02134
age = 34
visit_date = "01/15/2024"
doses_received = 2
vaccination_rate = 0.73

She runs the cell and immediately hits her first error:

SyntaxError: leading zeros in decimal integer literals
             are not permitted; use an 0o prefix for
             octal literals

The ZIP code 02134 is being interpreted as a number with a leading zero, which Python doesn't allow for integers (leading zeros are syntax for octal notation in some languages, and Python refuses them to avoid confusion).

Elena's instinct is to just remove the leading zero:

zip_code = 2134

This runs without error. But she's just made a data integrity mistake that could have serious consequences. The ZIP code 02134 is a real ZIP code — it's in Boston. The number 2134 is meaningless as a ZIP code. That leading zero is part of the data, not a formatting artifact.

Here's the correct approach:

zip_code = "02134"    # ZIP code is text, not a number

🚪 Why This Is a Threshold Concept

This is one of the most important ideas in all of data science, and it goes beyond Python syntax: the data type you choose encodes your understanding of what the data means. A ZIP code is a label, not a quantity. You'd never add two ZIP codes together (02134 + 90210 = ?), average them, or sort them numerically and expect the result to be meaningful. ZIP codes are identifiers — they happen to be made of digits, but they're fundamentally text.

This distinction is so important that experienced data scientists often say: if you wouldn't do arithmetic with it, it's not a number — it's a string.

Patient IDs: The Same Problem, Higher Stakes

Elena fixes the ZIP code and moves on to patient ID:

patient_id = 847   # Removed the leading zeros...

But the original value was 00847. If another record has patient_id = 847, are they the same patient? In the clinic's system, 00847 and 00848 are adjacent IDs, and the leading zeros might indicate a fixed-width format used by their database. If Elena stores these as integers, she loses that structure.

Even worse: what if patient IDs can contain letters? Some systems use IDs like MN-00847 or A00847. If Elena has built her code assuming patient IDs are integers, she'll get a crash the moment a non-numeric ID shows up.

The fix is the same:

patient_id = "00847"    # Patient ID is text, not a number

Dates: The Trickiest Type of All

Elena looks at the date column: 01/15/2024. She correctly stores it as a string:

visit_date = "01/15/2024"

But now she wants to calculate how many days ago the visit was. She tries:

today = "03/10/2024"
days_since = today - visit_date

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Of course — you can't subtract strings. But you also can't just convert dates to numbers. 01152024 - 03102024 gives a meaningless result. Dates have structure (months, days, years) that simple numbers don't capture.

For now, Elena can extract useful information using string slicing:

visit_date = "01/15/2024"

month = visit_date[:2]
day = visit_date[3:5]
year = visit_date[6:]

print(f"Month: {month}")
print(f"Day: {day}")
print(f"Year: {year}")

Month: 01
Day: 15
Year: 2024

She can convert the year to an integer for arithmetic:

visit_year = int(year)
current_year = 2024
years_in_system = current_year - visit_year
print(f"Years in system: {years_in_system}")

Years in system: 0

This is basic and limited — she can't calculate the exact number of days between two dates with just string slicing and integer arithmetic. In Chapter 11, she'll learn about Python's datetime module and eventually pandas date handling, which make date arithmetic natural. But the key lesson is clear: dates are neither numbers nor plain strings. They're their own kind of thing, with their own rules.

Age and Doses: When Numbers Are Numbers

Now Elena considers age and doses_received. These are genuine numbers:

Age: You can average ages, compare them (34 > 28), sort patients by age, compute age ranges. Age is a quantity — arithmetic makes sense.
Doses received: You can sum doses across patients, compute average doses, check if doses_received >= 2. Doses are a count — a true integer.

age = 34
doses_received = 2

# These operations make sense
print(f"Patient is over 65: {age > 65}")
print(f"Fully vaccinated: {doses_received >= 2}")

Patient is over 65: False
Fully vaccinated: True

The vaccination rate is a genuine float too — it represents a measured proportion:

vaccination_rate = 0.73
print(f"ZIP code rate: {vaccination_rate * 100:.1f}%")
print(f"Above target: {vaccination_rate >= 0.85}")

ZIP code rate: 73.0%
Above target: False

The Complete Fix: A Type-Aware Data Entry

Elena rewrites her data entry with correct types and clear documentation:

# Patient Record — with correct data types
# Identifiers (text — no arithmetic meaning)
patient_id = "00847"
zip_code = "02134"

# Quantities (numbers — arithmetic is meaningful)
age = 34
doses_received = 2
vaccination_rate = 0.73

# Dates (text for now — special handling needed)
visit_date = "01/15/2024"

# Type verification
print(f"patient_id:       {patient_id:>10}  type: {type(patient_id).__name__}")
print(f"zip_code:         {zip_code:>10}  type: {type(zip_code).__name__}")
print(f"age:              {age:>10}  type: {type(age).__name__}")
print(f"doses_received:   {doses_received:>10}  type: {type(doses_received).__name__}")
print(f"vaccination_rate: {vaccination_rate:>10}  type: {type(vaccination_rate).__name__}")
print(f"visit_date:       {visit_date:>10}  type: {type(visit_date).__name__}")

patient_id:            00847  type: str
zip_code:              02134  type: str
age:                      34  type: int
doses_received:            2  type: int
vaccination_rate:       0.73  type: float
visit_date:       01/15/2024  type: str

Now Elena has a rule she can apply to every dataset she encounters.

The Decision Framework: Number or String?

Elena writes the following decision framework in a Markdown cell in her notebook — a reference she'll use for every future project:

HOW TO DECIDE: Is this column a number or a string?

Ask yourself: Would I ever do arithmetic with this value?

  YES (add, subtract, multiply, divide, average, compare magnitude)
    → Store as int or float
    → Examples: age, temperature, price, count, rate, score

  NO (it's an identifier, label, code, or category)
    → Store as str
    → Examples: ZIP code, phone number, ID number, SSN,
      product code, country code, date (until you convert it)

WARNING SIGNS that a "number" is really a string:
  - Leading zeros matter (ZIP: 02134, ID: 00847)
  - The value has a fixed width/format (SSN: 123-45-6789)
  - Adding two values together would be meaningless
  - The value is a code in a coding system (ICD-10: E11.9)

The Consequence of Getting It Wrong

To drive the point home, Elena demonstrates what happens when types are wrong. She creates two scenarios with the same data but different type choices:

# WRONG: ZIP code as integer
zip_wrong = 2134
print(f"ZIP (wrong): {zip_wrong}")
# The leading zero is gone forever

# RIGHT: ZIP code as string
zip_right = "02134"
print(f"ZIP (right): {zip_right}")

ZIP (wrong): 2134
ZIP (right): 02134

# WRONG: Age as string
age_wrong = "34"
# age_wrong + 1 would cause TypeError
# age_wrong > 65 would compare alphabetically, not numerically!
print(f"'34' > '9': {'34' > '9'}")   # Alphabetic comparison!

# RIGHT: Age as integer
age_right = 34
print(f"34 > 9: {34 > 9}")           # Numeric comparison

'34' > '9': False
34 > 9: True

That last one is especially dangerous. When comparing strings, Python uses alphabetical order. The character '3' comes before '9' in the ASCII table, so '34' > '9' is False — even though the number 34 is obviously greater than 9. If Elena were filtering patients by age using string comparisons, she could silently exclude patients from her analysis without any error message. The code would run, produce results, and the results would be wrong.

This is the kind of bug that doesn't crash your program — it corrupts your conclusions. In data science, silent errors are far more dangerous than loud ones.

What Elena Learned

Elena adds a summary to her notebook:

Data types are decisions about meaning. Choosing int vs. str for a column isn't just a technical detail — it's a statement about what the data represents and what operations make sense.
If you wouldn't do math with it, don't store it as a number. ZIP codes, phone numbers, patient IDs, Social Security numbers, product codes — these are all strings, regardless of how they look.
Leading zeros are the canary in the coal mine. If a value has meaningful leading zeros (ZIP code 02134), storing it as an integer destroys information. This is irreversible — once the zero is gone, you can't get it back.
String comparisons are not numeric comparisons. "34" > "9" is False because Python compares strings character by character using alphabetical order. This can silently corrupt data analysis without raising any error.
Dates are their own category. They look like numbers, but arithmetic on date strings is meaningless. They need special handling — which she'll learn in Chapter 11.
The type() function is your first line of defense. When something doesn't work the way you expect, check type(). Nine times out of ten, the issue is a type mismatch.

Discussion Questions

Phone numbers look like numbers but should be stored as strings. Can you think of three other real-world values that are made of digits but shouldn't be treated as numbers? For each, explain what goes wrong if you store them as integers.
Elena's decision framework asks "would I ever do arithmetic with this value?" Can you think of a case where this rule is ambiguous — a value where arithmetic sometimes makes sense but not always?
The case study showed that "34" > "9" is False using string comparison. What would "100" > "99" return, and why? What about "apple" > "banana"?
When Elena loads a CSV file in Chapter 6, Python (or pandas, in Chapter 7) will make automatic decisions about data types. Based on what you've learned in this case study, what could go wrong with automatic type detection? What should Elena check first when she loads a new dataset?
A colleague suggests: "Just store everything as strings to be safe." What are the disadvantages of this approach?