23 min read

> "Data is a precious thing and will last longer than the systems themselves."

Learning Objectives

  • Load CSV files with encoding, delimiter, and header options using pd.read_csv
  • Read Excel workbooks with sheet selection, header rows, and skiprows using pd.read_excel
  • Parse JSON data including nested structures using pd.read_json and pd.json_normalize
  • Query a SQLite database using basic SQL (SELECT, WHERE, JOIN, GROUP BY) via pd.read_sql
  • Compare file formats and recommend the appropriate one for a given scenario

Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases

"Data is a precious thing and will last longer than the systems themselves." — Tim Berners-Lee, inventor of the World Wide Web


Chapter Overview

Here is a truth that no textbook told you about in Chapter 7: in the real world, data almost never shows up as a single, clean CSV file sitting politely in your project folder.

It arrives as an Excel workbook with seven sheets, three of which are blank, and one of which has the column headers on row 4 because someone put a logo on top. It arrives as a JSON file from a web API where every record contains another record nested three levels deep. It arrives as a CSV exported from a European system where the delimiter is a semicolon and decimal numbers use commas instead of periods. And sometimes it lives in a database, and you need to write SQL to extract exactly the rows and columns you need.

If data science were a cooking show, the previous chapters taught you recipes. This chapter teaches you how to open the packaging. It's less glamorous, but here's the thing: you can't cook with ingredients you can't unwrap.

In Chapters 7 through 11, you loaded data with a simple pd.read_csv('file.csv') and moved on. That was intentional — we didn't want file format headaches distracting you from learning pandas. But now that you're comfortable with DataFrames, it's time to confront the messy reality of how data actually arrives. And the good news is that pandas has a function for almost everything.

In this chapter, you will learn to:

  1. Load CSV files with encoding, delimiter, and header options using pd.read_csv (all paths)
  2. Read Excel workbooks with sheet selection, header rows, and skiprows using pd.read_excel (all paths)
  3. Parse JSON data including nested structures using pd.read_json and pd.json_normalize (all paths)
  4. Query a SQLite database using basic SQL (SELECT, WHERE, JOIN, GROUP BY) via pd.read_sql (standard + deep dive paths)
  5. Compare file formats and recommend the appropriate one for a given scenario (all paths)

12.1 CSV: The Format That Looks Simple (Until It Isn't)

You've been using CSVs since Chapter 6. You might think there's nothing left to learn. Let's fix that.

What a CSV Actually Is

CSV stands for Comma-Separated Values. At its core, a CSV file is just a plain text file where each line is a row and commas separate the values within each row:

name,age,city
Alice,30,Portland
Bob,25,Denver

That's it. No formatting, no colors, no formulas, no data types. Just text with commas. This simplicity is CSV's greatest strength — virtually every program in existence can read and write CSV files. It's also its greatest weakness — because everything is text, there's no built-in way to specify that 30 is an integer, that Portland is a string, or that a date column should be parsed as a date.

In Chapter 6, you loaded CSVs with Python's built-in csv module, and everything was a string. In Chapter 7, you upgraded to pd.read_csv(), which tries to infer data types automatically. But read_csv is much more powerful than the simple calls you've been making. It has over 50 parameters. We'll cover the ones that matter most.

Encoding: When Characters Go Wrong

Open a CSV file that was created on a computer in Brazil, Japan, or Germany, and you might see something like this:

nome,cidade,população
São Paulo,SP,12325000

That's not a formatting glitch. It's an encoding problem.

Encoding is the system a computer uses to convert characters (letters, numbers, symbols) into bytes that can be stored in a file. Different encoding systems represent the same characters differently. The two you'll encounter most often are:

  • UTF-8: The modern standard. It can represent virtually every character in every language — accented letters, Chinese characters, Arabic script, emoji, you name it. Most files created after 2010 use UTF-8.
  • Latin-1 (also called ISO-8859-1): An older encoding that covers Western European languages. It can't handle characters outside that range. Many files exported from legacy systems, older Windows applications, or European government databases use Latin-1.

When you open a Latin-1 file but your program assumes UTF-8, the bytes get misinterpreted, and you see garbled characters — what programmers call mojibake. The fix is telling pandas which encoding to use:

# Default — assumes UTF-8
df = pd.read_csv('brazilian_cities.csv')

# Specify the correct encoding
df = pd.read_csv('brazilian_cities.csv', encoding='latin-1')

How do you know which encoding a file uses? Unfortunately, there's no perfectly reliable way. Here's a practical approach:

  1. Try UTF-8 first (the default). If it works without garbled characters, you're done.
  2. If you see garbled characters, try encoding='latin-1'. This covers about 90% of non-UTF-8 files you'll encounter in practice.
  3. If Latin-1 doesn't work, the chardet library can guess the encoding:
import chardet

with open('mystery_file.csv', 'rb') as f:
    result = chardet.detect(f.read(10000))
    print(result)
# {'encoding': 'Windows-1252', 'confidence': 0.73}

Action Checklist: Dealing with Encoding Issues

  • [ ] Try loading with default UTF-8 first
  • [ ] If you see garbled text, try encoding='latin-1'
  • [ ] If that fails, try encoding='cp1252' (Windows Western European)
  • [ ] As a last resort, use the chardet library to detect encoding
  • [ ] Once you find the right encoding, document it in a comment

Delimiters: Not Everything Uses Commas

A delimiter is the character that separates values in each row. Commas are the most common, but you'll encounter others:

Delimiter Common In Example
, (comma) US/UK data, most web exports Alice,30,Portland
; (semicolon) European data (because commas are used for decimals) Alice;30;Portland
\t (tab) TSV files, database exports Alice 30 Portland
| (pipe) Mainframe exports, some government data Alice|30|Portland

If you try to load a semicolon-delimited file without specifying the delimiter, pandas will treat each entire line as a single column. Your DataFrame will have one column and very confusing values:

# Wrong — pandas looks for commas, finds none
df = pd.read_csv('european_data.csv')
# Result: one column called "name;age;city" with values like "Alice;30;Portland"

# Right — tell pandas to split on semicolons
df = pd.read_csv('european_data.csv', sep=';')

For tab-delimited files, use sep='\t':

df = pd.read_csv('database_export.tsv', sep='\t')

Quick Tip: If you're not sure what delimiter a file uses, open it in a plain text editor (not Excel!) and look at the first few lines. The pattern is usually obvious.

Headers, Skiprows, and Messy Files

Real-world CSV files don't always start with a clean header row on line 1. Sometimes there's metadata, a title, or blank lines at the top:

Report Generated: 2024-03-15
Department: Finance
(blank line)
name,amount,date
Alice,5000,2024-01-15
Bob,3200,2024-02-01

If you load this with a plain read_csv, pandas will think Report Generated: 2024-03-15 is the header. The fix:

# Skip the first 3 lines (metadata + blank line)
df = pd.read_csv('finance_report.csv', skiprows=3)

You can also skip specific rows by passing a list:

# Skip rows 0 and 2 (keep row 1)
df = pd.read_csv('messy.csv', skiprows=[0, 2])

And if your file has no header row at all — just data — tell pandas so it doesn't eat your first row:

# No header — pandas will auto-generate column names (0, 1, 2...)
df = pd.read_csv('no_header.csv', header=None)

# Better — provide your own column names
df = pd.read_csv('no_header.csv', header=None,
                 names=['name', 'age', 'city'])

Other Useful read_csv Parameters

Here are the parameters you'll use again and again:

Parameter What It Does Example
encoding Specifies character encoding encoding='latin-1'
sep Sets the delimiter sep=';' or sep='\t'
header Which row contains column names (or None) header=None
names Provide your own column names names=['a', 'b', 'c']
skiprows Skip rows at the top skiprows=3
usecols Load only specific columns usecols=['name', 'age']
nrows Load only the first N rows nrows=100
na_values Treat specific strings as NaN na_values=['N/A', 'missing']
dtype Force specific data types dtype={'zip': str}
parse_dates Parse columns as dates parse_dates=['date']

Two parameters deserve special attention:

usecols is your best friend for large files. If a CSV has 50 columns but you only need 3, loading all 50 wastes memory:

# Load only the columns you need
df = pd.read_csv('big_file.csv',
                 usecols=['country', 'year', 'coverage'])

dtype prevents a subtle but common bug. US ZIP codes like 02134 get interpreted as the integer 2134 — the leading zero vanishes. Force it to stay as a string:

df = pd.read_csv('addresses.csv', dtype={'zipcode': str})

Debugging Walkthrough: The Mysterious Extra Column

You load a CSV and find an unnamed column called Unnamed: 0 full of numbers (0, 1, 2, 3...). What happened?

The CSV was probably saved with df.to_csv('file.csv'), which writes the DataFrame index as the first column by default. When you load it back, pandas interprets that index column as data and auto-names it Unnamed: 0.

Fix when loading: df = pd.read_csv('file.csv', index_col=0)

Fix when saving: df.to_csv('file.csv', index=False)

🔄 Check Your Understanding

  1. A colleague sends you a CSV file from France. When you load it, the text looks garbled and numbers have commas where you'd expect decimal points. What two read_csv parameters would you adjust?
  2. You receive a CSV that starts with 5 lines of metadata before the header row. Write the read_csv call to load it correctly.
  3. Why would you use dtype={'id': str} when loading a file, even if the ID column contains only numbers?

12.2 Excel: The Spreadsheet Everyone Loves to Hate

Excel is the most widely used data tool on the planet. According to Microsoft, over 750 million people use Excel. For better or worse, an enormous amount of the world's data lives in .xlsx files. You will encounter Excel files constantly in professional settings, and you need to know how to handle them.

Why Excel Files Are Tricky

Excel files are not text files. Unlike CSVs, you can't open an .xlsx file in a text editor and read it — it's a compressed binary format (technically a ZIP archive containing XML files). This means pandas needs an additional library to read them.

For .xlsx files (modern Excel), pandas uses a library called openpyxl. For older .xls files, it uses xlrd. You'll need to install these:

# In your terminal or Anaconda Prompt:
# pip install openpyxl

Most of the time, if openpyxl is installed, pd.read_excel just works:

df = pd.read_excel('sales_data.xlsx')

Sheet Selection

Here's where Excel gets interesting. A single Excel file can contain multiple sheets (tabs), each holding a different dataset. By default, read_excel loads only the first sheet:

# Loads the first sheet
df = pd.read_excel('quarterly_report.xlsx')

# Load a specific sheet by name
df = pd.read_excel('quarterly_report.xlsx',
                   sheet_name='Q3 Sales')

# Load a specific sheet by position (0-indexed)
df = pd.read_excel('quarterly_report.xlsx',
                   sheet_name=2)  # Third sheet

You can even load multiple sheets at once:

# Load all sheets — returns a dictionary of DataFrames
all_sheets = pd.read_excel('quarterly_report.xlsx',
                           sheet_name=None)

# Access a specific sheet from the dictionary
q1 = all_sheets['Q1 Sales']
q2 = all_sheets['Q2 Sales']

This is incredibly useful when departments send you a single workbook with data split across tabs. Instead of opening Excel and manually copying data, you load all sheets and concatenate them in pandas:

all_sheets = pd.read_excel('report.xlsx',
                           sheet_name=None)
combined = pd.concat(all_sheets.values(),
                     ignore_index=True)

Handling Messy Excel Layouts

Excel users love formatting. They add titles, merge cells, insert logos, leave blank rows for visual spacing, and put footnotes at the bottom. All of this makes the data harder to load programmatically. Here are the most common problems and their fixes.

Problem 1: Header isn't on row 1. Someone put a report title and date on the first two rows:

# Skip the first 3 rows (title, date, blank line)
df = pd.read_excel('report.xlsx', skiprows=3)

Problem 2: The header row exists but pandas missed it. Specify which row contains the headers:

# Row 4 (0-indexed) contains the column names
df = pd.read_excel('report.xlsx', header=4)

Problem 3: Extra columns you don't need. Load only specific columns by letter or name:

# By column letters (like Excel references)
df = pd.read_excel('report.xlsx', usecols='A:D')

# By column names
df = pd.read_excel('report.xlsx',
                   usecols=['Name', 'Revenue', 'Region'])

Problem 4: Merged cells. This is Excel's nastiest trick. When someone merges cells A1:A5 and puts "Region: North" in the merged cell, only the first row gets the value — the rest become NaN. The fix is usually fillna(method='ffill') after loading:

df = pd.read_excel('merged_hell.xlsx')
df['region'] = df['region'].fillna(method='ffill')

Debugging Walkthrough: "No module named 'openpyxl'"

You run pd.read_excel('file.xlsx') and get: ImportError: Missing optional dependency 'openpyxl'.

Cause: pandas doesn't include openpyxl by default. It's a separate package.

Fix: Install it with pip install openpyxl (or conda install openpyxl if using Anaconda). Restart your Jupyter kernel after installing.

For .xls files (older format): You need xlrd instead: pip install xlrd

When to Use Excel vs. CSV

Here's a practical comparison:

Scenario Use CSV Use Excel
Data pipeline / automation Yes Usually no
Sharing with non-technical people Maybe Yes
Multiple related tables Multiple files Multiple sheets
Data > 100,000 rows Yes Caution (Excel struggles)
Need to preserve formatting No Yes
Version control (Git) Yes (text-based) No (binary)

In data science, CSV is the workhorse format for data you're going to analyze. Excel is the format you'll receive data in when someone else has been collecting it. Your workflow is often: receive Excel, load with pandas, analyze, export results back to Excel for stakeholders.

🔄 Check Your Understanding

  1. An Excel workbook has sheets named "2022", "2023", and "2024". Write the code to load all three sheets and combine them into a single DataFrame.
  2. Your colleague's Excel file has a company logo and title in the first 4 rows, with the actual data starting on row 5. What parameter do you use?
  3. Why might you choose to save your analysis results as Excel instead of CSV, even though you prefer CSV for your own work?

12.3 JSON: Data with Structure

If CSV is a flat table and Excel is a formatted table, then JSON (JavaScript Object Notation) is a tree. JSON can represent not just rows and columns, but nested, hierarchical data — lists inside dictionaries inside lists. This makes it the format of choice for web APIs (which we'll cover in Chapter 13) and modern applications.

What JSON Looks Like

You've actually seen JSON-like structure before — it looks almost identical to Python dictionaries and lists:

{
  "name": "Alice",
  "age": 30,
  "city": "Portland",
  "skills": ["Python", "SQL", "Tableau"]
}

A JSON file containing multiple records typically uses an array (list) of objects (dictionaries):

[
  {"name": "Alice", "age": 30, "city": "Portland"},
  {"name": "Bob", "age": 25, "city": "Denver"},
  {"name": "Carol", "age": 35, "city": "Seattle"}
]

Unlike CSV, JSON preserves data types — numbers stay numbers, strings stay strings, and null (JSON's version of None) is distinct from an empty string. JSON also supports nested structures, which is both its superpower and its headache.

Loading Simple JSON

When JSON has a flat, table-like structure (an array of objects with the same keys), loading it is trivial:

df = pd.read_json('people.json')

That's it. pandas recognizes the structure and creates a DataFrame with columns matching the JSON keys.

The Nested JSON Problem

But JSON from real-world sources — especially APIs — is rarely flat. Consider this response from a weather API:

[
  {
    "city": "Portland",
    "temperature": {"value": 72, "unit": "F"},
    "conditions": {
      "description": "Partly cloudy",
      "wind": {"speed": 12, "direction": "NW"}
    }
  }
]

If you load this with pd.read_json(), the temperature and conditions columns will contain dictionaries — not very useful for analysis. This is nested JSON, and it requires a different tool.

json_normalize: Flattening the Tree

pd.json_normalize() takes nested JSON and flattens it into a table. It converts nested keys into dot-separated column names:

import json

with open('weather.json', 'r') as f:
    data = json.load(f)

df = pd.json_normalize(data)
print(df.columns.tolist())
['city', 'temperature.value', 'temperature.unit',
 'conditions.description', 'conditions.wind.speed',
 'conditions.wind.direction']

The nested dictionary {"value": 72, "unit": "F"} became two columns: temperature.value and temperature.unit. The doubly-nested wind data became conditions.wind.speed and conditions.wind.direction. This is what "flattening" means — taking a tree structure and spreading it out into a table.

You can control the separator (the default is .):

df = pd.json_normalize(data, sep='_')
# Columns: temperature_value, temperature_unit, etc.

Handling JSON Arrays Inside Records

Sometimes a JSON record contains a list (array) of items — for example, a student with multiple courses:

[
  {
    "student": "Alice",
    "gpa": 3.8,
    "courses": [
      {"name": "Statistics", "grade": "A"},
      {"name": "Python", "grade": "A-"}
    ]
  }
]

This is tricky because Alice is one student but has two courses. You need to decide: do you want one row per student, or one row per course?

For one row per course (which is usually what you want for analysis), use the record_path parameter:

df = pd.json_normalize(
    data,
    record_path='courses',
    meta=['student', 'gpa']
)
       name grade student  gpa
0  Statistics     A   Alice  3.8
1      Python    A-   Alice  3.8

The record_path tells json_normalize which nested list to expand into rows. The meta parameter specifies which parent-level fields to repeat for each row. This is one of those functions where the parameter names aren't obvious but the results are exactly what you need.

Working with the json Module Directly

Sometimes you need more control than pd.read_json() provides. Python's built-in json module lets you load raw JSON and then selectively extract what you need:

import json

with open('complex_data.json', 'r') as f:
    raw = json.load(f)

# Explore the structure
print(type(raw))          # list? dict?
print(type(raw[0]))       # What's inside?
print(raw[0].keys())      # What are the keys?

# Extract just the part you want
records = raw['data']['results']
df = pd.json_normalize(records)

This "load, explore, extract, normalize" pattern is extremely common when working with API responses. You'll use it heavily in Chapter 13.

Quick Reference: JSON Loading Patterns

Scenario Code
Flat JSON (array of objects) pd.read_json('file.json')
Nested JSON pd.json_normalize(data)
JSON with arrays inside records pd.json_normalize(data, record_path='items', meta=['id'])
Need to explore structure first json.load(f) then pd.json_normalize()

🔄 Check Your Understanding

  1. What's the key difference between CSV and JSON when it comes to representing data types?
  2. You have a JSON record where "address" contains {"street": "123 Main", "city": "Portland", "state": "OR"}. After calling pd.json_normalize(), what column names would you expect?
  3. When would you use record_path in json_normalize? Give a one-sentence answer.

12.4 Databases and SQL: Asking Questions in a New Language

Every format we've covered so far — CSV, Excel, JSON — stores data in files. But a huge proportion of the world's data doesn't live in files. It lives in databases.

A database is a system for storing, organizing, and retrieving structured data. Think of it as a collection of tables (similar to DataFrames) managed by software that handles storage, security, concurrency (multiple users accessing data simultaneously), and efficient retrieval.

You interact with databases using SQL (Structured Query Language, pronounced "sequel" or "S-Q-L" — both are acceptable). SQL is a language specifically designed for asking questions of tabular data. It's not a general-purpose programming language like Python — you can't build a web app in SQL — but for querying data, it's exceptionally powerful and has been the standard for over 40 years.

Why Learn SQL in a Python Course?

Three reasons:

  1. SQL is everywhere. Most organizations store their critical data in databases, not files. If you want to access customer data, sales records, or operational metrics at most companies, you'll need SQL.
  2. SQL is expected. Every data science job posting lists SQL as a required skill. It's as fundamental to the field as Python.
  3. pandas and SQL complement each other. Use SQL to extract the data you need from a database, then use pandas to clean, transform, and analyze it.

SQLite: A Database in a File

Most databases (PostgreSQL, MySQL, SQL Server) are separate server applications that run on dedicated hardware. They're powerful but require setup, authentication, and network access. For learning, we'll use SQLite, which is delightfully simple: a SQLite database is a single file on your computer.

SQLite comes built into Python — no installation needed. And pandas can read from it directly.

Let's say you have a SQLite database file called university.db containing two tables: students and grades. Here's how to connect and query:

import sqlite3

# Create a connection to the database file
conn = sqlite3.connect('university.db')

# Read a SQL query directly into a DataFrame
df = pd.read_sql('SELECT * FROM students', conn)

# Always close the connection when you're done
conn.close()

That pd.read_sql() call is the bridge between SQL and pandas. You write SQL to describe what data you want, and pandas returns it as a DataFrame.

SQL Basics: Your First Queries

SQL has a declarative style — you describe what you want, not how to get it. This is different from Python, where you write step-by-step instructions.

SELECT — choosing columns:

-- Select specific columns
SELECT name, age, major FROM students

-- Select all columns
SELECT * FROM students

WHERE — filtering rows:

-- Students in Computer Science
SELECT name, gpa FROM students
WHERE major = 'Computer Science'

-- Students with GPA above 3.5
SELECT name, gpa FROM students
WHERE gpa > 3.5

-- Combine conditions with AND/OR
SELECT name, gpa FROM students
WHERE major = 'Computer Science' AND gpa > 3.5

ORDER BY — sorting results:

-- Sort by GPA, highest first
SELECT name, gpa FROM students
ORDER BY gpa DESC

-- Sort by name alphabetically
SELECT name, gpa FROM students
ORDER BY name ASC

GROUP BY — aggregating data:

-- Average GPA by major
SELECT major, AVG(gpa) as avg_gpa
FROM students
GROUP BY major

-- Count students per major
SELECT major, COUNT(*) as num_students
FROM students
GROUP BY major

JOIN — combining tables:

-- Combine students with their grades
SELECT s.name, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id

A JOIN connects two tables using a shared column — here, student_id exists in both the students and grades tables. This is conceptually identical to pd.merge(), which you learned in Chapter 9.

Putting It Together: SQL with pandas

Here's a realistic workflow — querying a university database and analyzing the results in pandas:

import sqlite3
import pandas as pd

conn = sqlite3.connect('university.db')

query = """
SELECT s.name, s.major, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id
WHERE s.major = 'Data Science'
ORDER BY s.name, g.course
"""

df = pd.read_sql(query, conn)
conn.close()

print(df.shape)
print(df.head())

Notice how the SQL query does the heavy lifting — selecting columns, joining tables, filtering, and sorting — and pandas receives a clean, ready-to-analyze DataFrame. This division of labor is the professional pattern: let the database do what it's good at (filtering and joining large datasets), then use pandas for what it's good at (cleaning, transforming, and visualizing).

The Connection String

The conn object is your connection to the database. For SQLite, it's simple:

conn = sqlite3.connect('path/to/database.db')

For other databases (PostgreSQL, MySQL, SQL Server), you use a connection string that specifies the database type, host, port, username, password, and database name. We won't cover those in detail here, but the pattern with pd.read_sql() remains the same — only the connection setup changes.

Debugging Walkthrough: "no such table"

You run pd.read_sql('SELECT * FROM students', conn) and get: OperationalError: no such table: students

Cause: The table name is wrong, or you're connected to the wrong database file.

Fix: List all tables in the database to see what's actually there:

python tables = pd.read_sql( "SELECT name FROM sqlite_master WHERE type='table'", conn ) print(tables)

🔄 Check Your Understanding

  1. What does SQL stand for, and what is it used for?
  2. How is SQLite different from databases like PostgreSQL or MySQL?
  3. Write a SQL query that selects the name and GPA of all students with a GPA above 3.0, sorted from highest to lowest GPA.
  4. What pandas function is conceptually similar to a SQL JOIN?

12.5 Format Comparison: Choosing the Right Tool

Now that you know how to load data from four different formats, when should you use each one? Here's the comparison table you'll want to bookmark:

Feature CSV Excel JSON Database (SQLite)
Structure Flat table Flat table (multiple sheets) Hierarchical / nested Relational tables
Data types preserved? No (everything is text) Yes Yes Yes
Human-readable? Yes (text editor) No (binary format) Yes (text editor) No (binary format)
Max practical size Millions of rows ~1 million rows Varies (memory-limited) Billions of rows
Multiple tables? Separate files Multiple sheets Nested objects Multiple tables + JOINs
Common source Data exports, open data Business reports, manual entry Web APIs, modern apps Enterprise systems
pandas function read_csv() read_excel() read_json() / json_normalize() read_sql()
Extra library needed? No openpyxl No sqlite3 (built-in)
Version control friendly? Yes No Yes No

When to recommend each format:

  • CSV when simplicity and compatibility matter, and the data is a single flat table.
  • Excel when sharing with non-technical stakeholders or when the data naturally has multiple related sheets.
  • JSON when the data is hierarchical or comes from a web API.
  • Database when the data is large, frequently updated, accessed by multiple users, or involves relationships between multiple tables.

Real-World Pattern: In practice, you'll often chain formats. An analyst might query a database with SQL, export the results as CSV, load them into pandas for analysis, and save the final report as Excel for their manager. Each format plays a role in the data's journey.


12.6 Code Implementation: A Complete Data Loading Toolkit

Let's build a practical function that can load data from any of these formats based on the file extension:

import pandas as pd
import json
import sqlite3

def load_data(filepath, **kwargs):
    """Load data from CSV, Excel, JSON, or SQLite.

    Detects format from file extension and passes
    additional keyword arguments to the appropriate
    pandas reader.
    """
    ext = filepath.rsplit('.', 1)[-1].lower()

    if ext == 'csv':
        return pd.read_csv(filepath, **kwargs)
    elif ext in ('xlsx', 'xls'):
        return pd.read_excel(filepath, **kwargs)
    elif ext == 'json':
        return pd.read_json(filepath, **kwargs)
    elif ext == 'db':
        query = kwargs.pop('query', 'SELECT * FROM data')
        conn = sqlite3.connect(filepath)
        df = pd.read_sql(query, conn)
        conn.close()
        return df
    else:
        raise ValueError(f"Unsupported format: {ext}")

This function won't cover every edge case, but it captures the pattern: detect the format, call the right reader, pass through any additional parameters. It's a great starting point for a personal utility library.


12.7 Project Checkpoint: Loading Supplementary Data

It's time to expand our running vaccination project. In previous chapters, you've been working with the WHO vaccination coverage dataset. Now you're going to bring in two additional data sources:

  1. An Excel file (population_data.xlsx) containing country population figures with data on separate sheets for different years
  2. A JSON file (country_metadata.json) containing country metadata — region, income group, and geographic coordinates

Step 1: Load the Population Data from Excel

# The population file has one sheet per year
pop_all = pd.read_excel(
    'population_data.xlsx',
    sheet_name=None  # Load all sheets
)

# Each sheet name is a year like "2020", "2021", etc.
pop_frames = []
for year, df in pop_all.items():
    df['year'] = int(year)
    pop_frames.append(df)

population = pd.concat(pop_frames, ignore_index=True)
print(f"Population data: {population.shape}")
print(population.head())

Step 2: Load the Country Metadata from JSON

import json

with open('country_metadata.json', 'r') as f:
    raw_meta = json.load(f)

# The JSON has nested geographic coordinates
metadata = pd.json_normalize(raw_meta)
print(f"Metadata: {metadata.shape}")
print(metadata.columns.tolist())

Step 3: Merge Everything Together

# Start with the vaccination data from earlier chapters
vacc = pd.read_csv('vaccination_coverage.csv')

# Merge population data
vacc = vacc.merge(
    population[['country_code', 'year', 'population']],
    on=['country_code', 'year'],
    how='left'
)

# Merge country metadata
vacc = vacc.merge(
    metadata[['country_code', 'income_group',
              'region_name']],
    on='country_code',
    how='left'
)

print(f"Enriched dataset: {vacc.shape}")
print(vacc.head())

This is a realistic data integration workflow. You loaded data from three different formats (CSV, Excel, JSON), each requiring different loading techniques, and merged them into a single analytical dataset. Elena does this kind of work every week in her public health role — pulling population figures from one source, vaccination data from another, and geographic metadata from a third.

What You've Built So Far (Project Progress):

  • Chapters 7-8: Loaded and cleaned the vaccination CSV
  • Chapters 9-10: Reshaped and transformed the data, cleaned text fields
  • Chapter 11: Added date handling and time series features
  • Chapter 12: Integrated population data (Excel) and country metadata (JSON)
  • Next: Chapter 13 will add live API data

12.8 Spaced Review: Chapters 1-11

Learning research shows that revisiting earlier material at spaced intervals dramatically improves retention. Let's do a quick check on concepts from previous chapters. Try answering these without looking back.

From Chapter 3 (Python Fundamentals I): What is the difference between = and == in Python? When do you use each one?

Check your answer `=` is the assignment operator — it gives a name to a value (`x = 5`). `==` is the comparison operator — it checks whether two values are equal and returns `True` or `False` (`x == 5`). Using `=` when you mean `==` (or vice versa) is one of the most common beginner bugs.

From Chapter 7 (Intro to pandas): What is the difference between a Series and a DataFrame?

Check your answer A Series is a single column of data (one-dimensional) with an index. A DataFrame is a table (two-dimensional) made up of multiple Series that share the same index. You can think of a DataFrame as a dictionary of Series.

From Chapter 8 (Cleaning Messy Data): Name three types of data quality problems you might find when loading a new dataset.

Check your answer Common data quality problems include: (1) missing values (NaN, blank cells), (2) inconsistent formatting (e.g., "USA" vs "United States" vs "US"), (3) wrong data types (numbers stored as strings), (4) duplicate rows, (5) outliers or impossible values (e.g., age = -5 or temperature = 999).

From Chapter 9 (Reshaping and Transforming): When would you use pd.melt() vs. pd.pivot_table()?

Check your answer Use `melt()` to convert wide format to long format — when column headers contain data values (like year columns "2020", "2021", "2022") that should be in rows. Use `pivot_table()` to convert long format to wide format — when you want row values to become column headers, or when you need to aggregate data into a summary table.

From Chapter 11 (Dates and Times): What does dt.month give you, and what object do you use it on?

Check your answer `dt.month` extracts the month (as an integer, 1-12) from a datetime Series. You use it on a pandas Series that has been converted to datetime type (via `pd.to_datetime()`). For example: `df['date'].dt.month` returns a Series of month numbers.

Chapter Summary

This chapter was about meeting data where it lives. You learned four essential loading techniques:

  1. CSV — the universal plain-text format. You mastered encoding (encoding='latin-1'), delimiters (sep=';'), header management (skiprows, header), selective loading (usecols), and type control (dtype).

  2. Excel — the business world's format. You learned sheet selection (sheet_name), handling messy layouts (skiprows, header), and loading all sheets into a dictionary (sheet_name=None).

  3. JSON — the web's format. You learned to load flat JSON with pd.read_json(), flatten nested structures with pd.json_normalize(), and use record_path and meta to handle arrays inside records.

  4. Databases/SQL — the enterprise format. You wrote your first SQL queries (SELECT, WHERE, JOIN, GROUP BY) and used pd.read_sql() to bring database data into pandas.

The threshold concept for this chapter is that data doesn't live in one format. The real skill isn't memorizing every parameter of read_csv — it's the confidence to say, "I don't know what format this file is in, but I can figure it out and get the data into a DataFrame." That confidence comes from knowing the patterns, having a debugging toolkit, and being willing to inspect a file in a text editor when things go wrong.

In Chapter 13, we'll take this further: instead of loading data from files on your computer, you'll pull data directly from the internet — from web APIs and web pages. The same mental model applies: figure out what format the data is in, extract what you need, and get it into a DataFrame.


Concept Inventory Update

New concepts introduced in this chapter:

# Concept First Introduced Section
1 File encoding (UTF-8, Latin-1) This chapter 12.1
2 Delimiters and format variations This chapter 12.1
3 JSON and nested data structures This chapter 12.3
4 Relational databases and SQL This chapter 12.4
5 Data integration across formats This chapter 12.7

Running total: 49 concepts, 134 terms, 32 techniques introduced across Chapters 1-12.


Next up: Chapter 13 — Getting Data from the Web. You'll learn to talk to APIs, scrape web pages, and build automated data collection pipelines. The internet becomes your dataset.