> "Data is a precious thing and will last longer than the systems themselves."
Learning Objectives
- Load CSV files with encoding, delimiter, and header options using pd.read_csv
- Read Excel workbooks with sheet selection, header rows, and skiprows using pd.read_excel
- Parse JSON data including nested structures using pd.read_json and pd.json_normalize
- Query a SQLite database using basic SQL (SELECT, WHERE, JOIN, GROUP BY) via pd.read_sql
- Compare file formats and recommend the appropriate one for a given scenario
In This Chapter
- Chapter Overview
- 12.1 CSV: The Format That Looks Simple (Until It Isn't)
- 12.2 Excel: The Spreadsheet Everyone Loves to Hate
- 12.3 JSON: Data with Structure
- 12.4 Databases and SQL: Asking Questions in a New Language
- 12.5 Format Comparison: Choosing the Right Tool
- 12.6 Code Implementation: A Complete Data Loading Toolkit
- 12.7 Project Checkpoint: Loading Supplementary Data
- 12.8 Spaced Review: Chapters 1-11
- Chapter Summary
- Concept Inventory Update
Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
"Data is a precious thing and will last longer than the systems themselves." — Tim Berners-Lee, inventor of the World Wide Web
Chapter Overview
Here is a truth that no textbook told you about in Chapter 7: in the real world, data almost never shows up as a single, clean CSV file sitting politely in your project folder.
It arrives as an Excel workbook with seven sheets, three of which are blank, and one of which has the column headers on row 4 because someone put a logo on top. It arrives as a JSON file from a web API where every record contains another record nested three levels deep. It arrives as a CSV exported from a European system where the delimiter is a semicolon and decimal numbers use commas instead of periods. And sometimes it lives in a database, and you need to write SQL to extract exactly the rows and columns you need.
If data science were a cooking show, the previous chapters taught you recipes. This chapter teaches you how to open the packaging. It's less glamorous, but here's the thing: you can't cook with ingredients you can't unwrap.
In Chapters 7 through 11, you loaded data with a simple pd.read_csv('file.csv') and moved on. That was intentional — we didn't want file format headaches distracting you from learning pandas. But now that you're comfortable with DataFrames, it's time to confront the messy reality of how data actually arrives. And the good news is that pandas has a function for almost everything.
In this chapter, you will learn to:
- Load CSV files with encoding, delimiter, and header options using
pd.read_csv(all paths) - Read Excel workbooks with sheet selection, header rows, and skiprows using
pd.read_excel(all paths) - Parse JSON data including nested structures using
pd.read_jsonandpd.json_normalize(all paths) - Query a SQLite database using basic SQL (SELECT, WHERE, JOIN, GROUP BY) via
pd.read_sql(standard + deep dive paths) - Compare file formats and recommend the appropriate one for a given scenario (all paths)
12.1 CSV: The Format That Looks Simple (Until It Isn't)
You've been using CSVs since Chapter 6. You might think there's nothing left to learn. Let's fix that.
What a CSV Actually Is
CSV stands for Comma-Separated Values. At its core, a CSV file is just a plain text file where each line is a row and commas separate the values within each row:
name,age,city
Alice,30,Portland
Bob,25,Denver
That's it. No formatting, no colors, no formulas, no data types. Just text with commas. This simplicity is CSV's greatest strength — virtually every program in existence can read and write CSV files. It's also its greatest weakness — because everything is text, there's no built-in way to specify that 30 is an integer, that Portland is a string, or that a date column should be parsed as a date.
In Chapter 6, you loaded CSVs with Python's built-in csv module, and everything was a string. In Chapter 7, you upgraded to pd.read_csv(), which tries to infer data types automatically. But read_csv is much more powerful than the simple calls you've been making. It has over 50 parameters. We'll cover the ones that matter most.
Encoding: When Characters Go Wrong
Open a CSV file that was created on a computer in Brazil, Japan, or Germany, and you might see something like this:
nome,cidade,população
São Paulo,SP,12325000
That's not a formatting glitch. It's an encoding problem.
Encoding is the system a computer uses to convert characters (letters, numbers, symbols) into bytes that can be stored in a file. Different encoding systems represent the same characters differently. The two you'll encounter most often are:
- UTF-8: The modern standard. It can represent virtually every character in every language — accented letters, Chinese characters, Arabic script, emoji, you name it. Most files created after 2010 use UTF-8.
- Latin-1 (also called ISO-8859-1): An older encoding that covers Western European languages. It can't handle characters outside that range. Many files exported from legacy systems, older Windows applications, or European government databases use Latin-1.
When you open a Latin-1 file but your program assumes UTF-8, the bytes get misinterpreted, and you see garbled characters — what programmers call mojibake. The fix is telling pandas which encoding to use:
# Default — assumes UTF-8
df = pd.read_csv('brazilian_cities.csv')
# Specify the correct encoding
df = pd.read_csv('brazilian_cities.csv', encoding='latin-1')
How do you know which encoding a file uses? Unfortunately, there's no perfectly reliable way. Here's a practical approach:
- Try UTF-8 first (the default). If it works without garbled characters, you're done.
- If you see garbled characters, try
encoding='latin-1'. This covers about 90% of non-UTF-8 files you'll encounter in practice. - If Latin-1 doesn't work, the
chardetlibrary can guess the encoding:
import chardet
with open('mystery_file.csv', 'rb') as f:
result = chardet.detect(f.read(10000))
print(result)
# {'encoding': 'Windows-1252', 'confidence': 0.73}
Action Checklist: Dealing with Encoding Issues
- [ ] Try loading with default UTF-8 first
- [ ] If you see garbled text, try
encoding='latin-1'- [ ] If that fails, try
encoding='cp1252'(Windows Western European)- [ ] As a last resort, use the
chardetlibrary to detect encoding- [ ] Once you find the right encoding, document it in a comment
Delimiters: Not Everything Uses Commas
A delimiter is the character that separates values in each row. Commas are the most common, but you'll encounter others:
| Delimiter | Common In | Example |
|---|---|---|
, (comma) |
US/UK data, most web exports | Alice,30,Portland |
; (semicolon) |
European data (because commas are used for decimals) | Alice;30;Portland |
\t (tab) |
TSV files, database exports | Alice 30 Portland |
| (pipe) |
Mainframe exports, some government data | Alice|30|Portland |
If you try to load a semicolon-delimited file without specifying the delimiter, pandas will treat each entire line as a single column. Your DataFrame will have one column and very confusing values:
# Wrong — pandas looks for commas, finds none
df = pd.read_csv('european_data.csv')
# Result: one column called "name;age;city" with values like "Alice;30;Portland"
# Right — tell pandas to split on semicolons
df = pd.read_csv('european_data.csv', sep=';')
For tab-delimited files, use sep='\t':
df = pd.read_csv('database_export.tsv', sep='\t')
Quick Tip: If you're not sure what delimiter a file uses, open it in a plain text editor (not Excel!) and look at the first few lines. The pattern is usually obvious.
Headers, Skiprows, and Messy Files
Real-world CSV files don't always start with a clean header row on line 1. Sometimes there's metadata, a title, or blank lines at the top:
Report Generated: 2024-03-15
Department: Finance
(blank line)
name,amount,date
Alice,5000,2024-01-15
Bob,3200,2024-02-01
If you load this with a plain read_csv, pandas will think Report Generated: 2024-03-15 is the header. The fix:
# Skip the first 3 lines (metadata + blank line)
df = pd.read_csv('finance_report.csv', skiprows=3)
You can also skip specific rows by passing a list:
# Skip rows 0 and 2 (keep row 1)
df = pd.read_csv('messy.csv', skiprows=[0, 2])
And if your file has no header row at all — just data — tell pandas so it doesn't eat your first row:
# No header — pandas will auto-generate column names (0, 1, 2...)
df = pd.read_csv('no_header.csv', header=None)
# Better — provide your own column names
df = pd.read_csv('no_header.csv', header=None,
names=['name', 'age', 'city'])
Other Useful read_csv Parameters
Here are the parameters you'll use again and again:
| Parameter | What It Does | Example |
|---|---|---|
encoding |
Specifies character encoding | encoding='latin-1' |
sep |
Sets the delimiter | sep=';' or sep='\t' |
header |
Which row contains column names (or None) |
header=None |
names |
Provide your own column names | names=['a', 'b', 'c'] |
skiprows |
Skip rows at the top | skiprows=3 |
usecols |
Load only specific columns | usecols=['name', 'age'] |
nrows |
Load only the first N rows | nrows=100 |
na_values |
Treat specific strings as NaN | na_values=['N/A', 'missing'] |
dtype |
Force specific data types | dtype={'zip': str} |
parse_dates |
Parse columns as dates | parse_dates=['date'] |
Two parameters deserve special attention:
usecols is your best friend for large files. If a CSV has 50 columns but you only need 3, loading all 50 wastes memory:
# Load only the columns you need
df = pd.read_csv('big_file.csv',
usecols=['country', 'year', 'coverage'])
dtype prevents a subtle but common bug. US ZIP codes like 02134 get interpreted as the integer 2134 — the leading zero vanishes. Force it to stay as a string:
df = pd.read_csv('addresses.csv', dtype={'zipcode': str})
Debugging Walkthrough: The Mysterious Extra Column
You load a CSV and find an unnamed column called
Unnamed: 0full of numbers (0, 1, 2, 3...). What happened?The CSV was probably saved with
df.to_csv('file.csv'), which writes the DataFrame index as the first column by default. When you load it back, pandas interprets that index column as data and auto-names itUnnamed: 0.Fix when loading:
df = pd.read_csv('file.csv', index_col=0)Fix when saving:
df.to_csv('file.csv', index=False)🔄 Check Your Understanding
- A colleague sends you a CSV file from France. When you load it, the text looks garbled and numbers have commas where you'd expect decimal points. What two
read_csvparameters would you adjust?- You receive a CSV that starts with 5 lines of metadata before the header row. Write the
read_csvcall to load it correctly.- Why would you use
dtype={'id': str}when loading a file, even if the ID column contains only numbers?
12.2 Excel: The Spreadsheet Everyone Loves to Hate
Excel is the most widely used data tool on the planet. According to Microsoft, over 750 million people use Excel. For better or worse, an enormous amount of the world's data lives in .xlsx files. You will encounter Excel files constantly in professional settings, and you need to know how to handle them.
Why Excel Files Are Tricky
Excel files are not text files. Unlike CSVs, you can't open an .xlsx file in a text editor and read it — it's a compressed binary format (technically a ZIP archive containing XML files). This means pandas needs an additional library to read them.
For .xlsx files (modern Excel), pandas uses a library called openpyxl. For older .xls files, it uses xlrd. You'll need to install these:
# In your terminal or Anaconda Prompt:
# pip install openpyxl
Most of the time, if openpyxl is installed, pd.read_excel just works:
df = pd.read_excel('sales_data.xlsx')
Sheet Selection
Here's where Excel gets interesting. A single Excel file can contain multiple sheets (tabs), each holding a different dataset. By default, read_excel loads only the first sheet:
# Loads the first sheet
df = pd.read_excel('quarterly_report.xlsx')
# Load a specific sheet by name
df = pd.read_excel('quarterly_report.xlsx',
sheet_name='Q3 Sales')
# Load a specific sheet by position (0-indexed)
df = pd.read_excel('quarterly_report.xlsx',
sheet_name=2) # Third sheet
You can even load multiple sheets at once:
# Load all sheets — returns a dictionary of DataFrames
all_sheets = pd.read_excel('quarterly_report.xlsx',
sheet_name=None)
# Access a specific sheet from the dictionary
q1 = all_sheets['Q1 Sales']
q2 = all_sheets['Q2 Sales']
This is incredibly useful when departments send you a single workbook with data split across tabs. Instead of opening Excel and manually copying data, you load all sheets and concatenate them in pandas:
all_sheets = pd.read_excel('report.xlsx',
sheet_name=None)
combined = pd.concat(all_sheets.values(),
ignore_index=True)
Handling Messy Excel Layouts
Excel users love formatting. They add titles, merge cells, insert logos, leave blank rows for visual spacing, and put footnotes at the bottom. All of this makes the data harder to load programmatically. Here are the most common problems and their fixes.
Problem 1: Header isn't on row 1. Someone put a report title and date on the first two rows:
# Skip the first 3 rows (title, date, blank line)
df = pd.read_excel('report.xlsx', skiprows=3)
Problem 2: The header row exists but pandas missed it. Specify which row contains the headers:
# Row 4 (0-indexed) contains the column names
df = pd.read_excel('report.xlsx', header=4)
Problem 3: Extra columns you don't need. Load only specific columns by letter or name:
# By column letters (like Excel references)
df = pd.read_excel('report.xlsx', usecols='A:D')
# By column names
df = pd.read_excel('report.xlsx',
usecols=['Name', 'Revenue', 'Region'])
Problem 4: Merged cells. This is Excel's nastiest trick. When someone merges cells A1:A5 and puts "Region: North" in the merged cell, only the first row gets the value — the rest become NaN. The fix is usually fillna(method='ffill') after loading:
df = pd.read_excel('merged_hell.xlsx')
df['region'] = df['region'].fillna(method='ffill')
Debugging Walkthrough: "No module named 'openpyxl'"
You run
pd.read_excel('file.xlsx')and get:ImportError: Missing optional dependency 'openpyxl'.Cause: pandas doesn't include openpyxl by default. It's a separate package.
Fix: Install it with
pip install openpyxl(orconda install openpyxlif using Anaconda). Restart your Jupyter kernel after installing.For .xls files (older format): You need
xlrdinstead:pip install xlrd
When to Use Excel vs. CSV
Here's a practical comparison:
| Scenario | Use CSV | Use Excel |
|---|---|---|
| Data pipeline / automation | Yes | Usually no |
| Sharing with non-technical people | Maybe | Yes |
| Multiple related tables | Multiple files | Multiple sheets |
| Data > 100,000 rows | Yes | Caution (Excel struggles) |
| Need to preserve formatting | No | Yes |
| Version control (Git) | Yes (text-based) | No (binary) |
In data science, CSV is the workhorse format for data you're going to analyze. Excel is the format you'll receive data in when someone else has been collecting it. Your workflow is often: receive Excel, load with pandas, analyze, export results back to Excel for stakeholders.
🔄 Check Your Understanding
- An Excel workbook has sheets named "2022", "2023", and "2024". Write the code to load all three sheets and combine them into a single DataFrame.
- Your colleague's Excel file has a company logo and title in the first 4 rows, with the actual data starting on row 5. What parameter do you use?
- Why might you choose to save your analysis results as Excel instead of CSV, even though you prefer CSV for your own work?
12.3 JSON: Data with Structure
If CSV is a flat table and Excel is a formatted table, then JSON (JavaScript Object Notation) is a tree. JSON can represent not just rows and columns, but nested, hierarchical data — lists inside dictionaries inside lists. This makes it the format of choice for web APIs (which we'll cover in Chapter 13) and modern applications.
What JSON Looks Like
You've actually seen JSON-like structure before — it looks almost identical to Python dictionaries and lists:
{
"name": "Alice",
"age": 30,
"city": "Portland",
"skills": ["Python", "SQL", "Tableau"]
}
A JSON file containing multiple records typically uses an array (list) of objects (dictionaries):
[
{"name": "Alice", "age": 30, "city": "Portland"},
{"name": "Bob", "age": 25, "city": "Denver"},
{"name": "Carol", "age": 35, "city": "Seattle"}
]
Unlike CSV, JSON preserves data types — numbers stay numbers, strings stay strings, and null (JSON's version of None) is distinct from an empty string. JSON also supports nested structures, which is both its superpower and its headache.
Loading Simple JSON
When JSON has a flat, table-like structure (an array of objects with the same keys), loading it is trivial:
df = pd.read_json('people.json')
That's it. pandas recognizes the structure and creates a DataFrame with columns matching the JSON keys.
The Nested JSON Problem
But JSON from real-world sources — especially APIs — is rarely flat. Consider this response from a weather API:
[
{
"city": "Portland",
"temperature": {"value": 72, "unit": "F"},
"conditions": {
"description": "Partly cloudy",
"wind": {"speed": 12, "direction": "NW"}
}
}
]
If you load this with pd.read_json(), the temperature and conditions columns will contain dictionaries — not very useful for analysis. This is nested JSON, and it requires a different tool.
json_normalize: Flattening the Tree
pd.json_normalize() takes nested JSON and flattens it into a table. It converts nested keys into dot-separated column names:
import json
with open('weather.json', 'r') as f:
data = json.load(f)
df = pd.json_normalize(data)
print(df.columns.tolist())
['city', 'temperature.value', 'temperature.unit',
'conditions.description', 'conditions.wind.speed',
'conditions.wind.direction']
The nested dictionary {"value": 72, "unit": "F"} became two columns: temperature.value and temperature.unit. The doubly-nested wind data became conditions.wind.speed and conditions.wind.direction. This is what "flattening" means — taking a tree structure and spreading it out into a table.
You can control the separator (the default is .):
df = pd.json_normalize(data, sep='_')
# Columns: temperature_value, temperature_unit, etc.
Handling JSON Arrays Inside Records
Sometimes a JSON record contains a list (array) of items — for example, a student with multiple courses:
[
{
"student": "Alice",
"gpa": 3.8,
"courses": [
{"name": "Statistics", "grade": "A"},
{"name": "Python", "grade": "A-"}
]
}
]
This is tricky because Alice is one student but has two courses. You need to decide: do you want one row per student, or one row per course?
For one row per course (which is usually what you want for analysis), use the record_path parameter:
df = pd.json_normalize(
data,
record_path='courses',
meta=['student', 'gpa']
)
name grade student gpa
0 Statistics A Alice 3.8
1 Python A- Alice 3.8
The record_path tells json_normalize which nested list to expand into rows. The meta parameter specifies which parent-level fields to repeat for each row. This is one of those functions where the parameter names aren't obvious but the results are exactly what you need.
Working with the json Module Directly
Sometimes you need more control than pd.read_json() provides. Python's built-in json module lets you load raw JSON and then selectively extract what you need:
import json
with open('complex_data.json', 'r') as f:
raw = json.load(f)
# Explore the structure
print(type(raw)) # list? dict?
print(type(raw[0])) # What's inside?
print(raw[0].keys()) # What are the keys?
# Extract just the part you want
records = raw['data']['results']
df = pd.json_normalize(records)
This "load, explore, extract, normalize" pattern is extremely common when working with API responses. You'll use it heavily in Chapter 13.
Quick Reference: JSON Loading Patterns
Scenario Code Flat JSON (array of objects) pd.read_json('file.json')Nested JSON pd.json_normalize(data)JSON with arrays inside records pd.json_normalize(data, record_path='items', meta=['id'])Need to explore structure first json.load(f)thenpd.json_normalize()🔄 Check Your Understanding
- What's the key difference between CSV and JSON when it comes to representing data types?
- You have a JSON record where
"address"contains{"street": "123 Main", "city": "Portland", "state": "OR"}. After callingpd.json_normalize(), what column names would you expect?- When would you use
record_pathinjson_normalize? Give a one-sentence answer.
12.4 Databases and SQL: Asking Questions in a New Language
Every format we've covered so far — CSV, Excel, JSON — stores data in files. But a huge proportion of the world's data doesn't live in files. It lives in databases.
A database is a system for storing, organizing, and retrieving structured data. Think of it as a collection of tables (similar to DataFrames) managed by software that handles storage, security, concurrency (multiple users accessing data simultaneously), and efficient retrieval.
You interact with databases using SQL (Structured Query Language, pronounced "sequel" or "S-Q-L" — both are acceptable). SQL is a language specifically designed for asking questions of tabular data. It's not a general-purpose programming language like Python — you can't build a web app in SQL — but for querying data, it's exceptionally powerful and has been the standard for over 40 years.
Why Learn SQL in a Python Course?
Three reasons:
- SQL is everywhere. Most organizations store their critical data in databases, not files. If you want to access customer data, sales records, or operational metrics at most companies, you'll need SQL.
- SQL is expected. Every data science job posting lists SQL as a required skill. It's as fundamental to the field as Python.
- pandas and SQL complement each other. Use SQL to extract the data you need from a database, then use pandas to clean, transform, and analyze it.
SQLite: A Database in a File
Most databases (PostgreSQL, MySQL, SQL Server) are separate server applications that run on dedicated hardware. They're powerful but require setup, authentication, and network access. For learning, we'll use SQLite, which is delightfully simple: a SQLite database is a single file on your computer.
SQLite comes built into Python — no installation needed. And pandas can read from it directly.
Let's say you have a SQLite database file called university.db containing two tables: students and grades. Here's how to connect and query:
import sqlite3
# Create a connection to the database file
conn = sqlite3.connect('university.db')
# Read a SQL query directly into a DataFrame
df = pd.read_sql('SELECT * FROM students', conn)
# Always close the connection when you're done
conn.close()
That pd.read_sql() call is the bridge between SQL and pandas. You write SQL to describe what data you want, and pandas returns it as a DataFrame.
SQL Basics: Your First Queries
SQL has a declarative style — you describe what you want, not how to get it. This is different from Python, where you write step-by-step instructions.
SELECT — choosing columns:
-- Select specific columns
SELECT name, age, major FROM students
-- Select all columns
SELECT * FROM students
WHERE — filtering rows:
-- Students in Computer Science
SELECT name, gpa FROM students
WHERE major = 'Computer Science'
-- Students with GPA above 3.5
SELECT name, gpa FROM students
WHERE gpa > 3.5
-- Combine conditions with AND/OR
SELECT name, gpa FROM students
WHERE major = 'Computer Science' AND gpa > 3.5
ORDER BY — sorting results:
-- Sort by GPA, highest first
SELECT name, gpa FROM students
ORDER BY gpa DESC
-- Sort by name alphabetically
SELECT name, gpa FROM students
ORDER BY name ASC
GROUP BY — aggregating data:
-- Average GPA by major
SELECT major, AVG(gpa) as avg_gpa
FROM students
GROUP BY major
-- Count students per major
SELECT major, COUNT(*) as num_students
FROM students
GROUP BY major
JOIN — combining tables:
-- Combine students with their grades
SELECT s.name, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id
A JOIN connects two tables using a shared column — here, student_id exists in both the students and grades tables. This is conceptually identical to pd.merge(), which you learned in Chapter 9.
Putting It Together: SQL with pandas
Here's a realistic workflow — querying a university database and analyzing the results in pandas:
import sqlite3
import pandas as pd
conn = sqlite3.connect('university.db')
query = """
SELECT s.name, s.major, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id
WHERE s.major = 'Data Science'
ORDER BY s.name, g.course
"""
df = pd.read_sql(query, conn)
conn.close()
print(df.shape)
print(df.head())
Notice how the SQL query does the heavy lifting — selecting columns, joining tables, filtering, and sorting — and pandas receives a clean, ready-to-analyze DataFrame. This division of labor is the professional pattern: let the database do what it's good at (filtering and joining large datasets), then use pandas for what it's good at (cleaning, transforming, and visualizing).
The Connection String
The conn object is your connection to the database. For SQLite, it's simple:
conn = sqlite3.connect('path/to/database.db')
For other databases (PostgreSQL, MySQL, SQL Server), you use a connection string that specifies the database type, host, port, username, password, and database name. We won't cover those in detail here, but the pattern with pd.read_sql() remains the same — only the connection setup changes.
Debugging Walkthrough: "no such table"
You run
pd.read_sql('SELECT * FROM students', conn)and get:OperationalError: no such table: studentsCause: The table name is wrong, or you're connected to the wrong database file.
Fix: List all tables in the database to see what's actually there:
python tables = pd.read_sql( "SELECT name FROM sqlite_master WHERE type='table'", conn ) print(tables)🔄 Check Your Understanding
- What does SQL stand for, and what is it used for?
- How is SQLite different from databases like PostgreSQL or MySQL?
- Write a SQL query that selects the name and GPA of all students with a GPA above 3.0, sorted from highest to lowest GPA.
- What pandas function is conceptually similar to a SQL JOIN?
12.5 Format Comparison: Choosing the Right Tool
Now that you know how to load data from four different formats, when should you use each one? Here's the comparison table you'll want to bookmark:
| Feature | CSV | Excel | JSON | Database (SQLite) |
|---|---|---|---|---|
| Structure | Flat table | Flat table (multiple sheets) | Hierarchical / nested | Relational tables |
| Data types preserved? | No (everything is text) | Yes | Yes | Yes |
| Human-readable? | Yes (text editor) | No (binary format) | Yes (text editor) | No (binary format) |
| Max practical size | Millions of rows | ~1 million rows | Varies (memory-limited) | Billions of rows |
| Multiple tables? | Separate files | Multiple sheets | Nested objects | Multiple tables + JOINs |
| Common source | Data exports, open data | Business reports, manual entry | Web APIs, modern apps | Enterprise systems |
| pandas function | read_csv() |
read_excel() |
read_json() / json_normalize() |
read_sql() |
| Extra library needed? | No | openpyxl | No | sqlite3 (built-in) |
| Version control friendly? | Yes | No | Yes | No |
When to recommend each format:
- CSV when simplicity and compatibility matter, and the data is a single flat table.
- Excel when sharing with non-technical stakeholders or when the data naturally has multiple related sheets.
- JSON when the data is hierarchical or comes from a web API.
- Database when the data is large, frequently updated, accessed by multiple users, or involves relationships between multiple tables.
Real-World Pattern: In practice, you'll often chain formats. An analyst might query a database with SQL, export the results as CSV, load them into pandas for analysis, and save the final report as Excel for their manager. Each format plays a role in the data's journey.
12.6 Code Implementation: A Complete Data Loading Toolkit
Let's build a practical function that can load data from any of these formats based on the file extension:
import pandas as pd
import json
import sqlite3
def load_data(filepath, **kwargs):
"""Load data from CSV, Excel, JSON, or SQLite.
Detects format from file extension and passes
additional keyword arguments to the appropriate
pandas reader.
"""
ext = filepath.rsplit('.', 1)[-1].lower()
if ext == 'csv':
return pd.read_csv(filepath, **kwargs)
elif ext in ('xlsx', 'xls'):
return pd.read_excel(filepath, **kwargs)
elif ext == 'json':
return pd.read_json(filepath, **kwargs)
elif ext == 'db':
query = kwargs.pop('query', 'SELECT * FROM data')
conn = sqlite3.connect(filepath)
df = pd.read_sql(query, conn)
conn.close()
return df
else:
raise ValueError(f"Unsupported format: {ext}")
This function won't cover every edge case, but it captures the pattern: detect the format, call the right reader, pass through any additional parameters. It's a great starting point for a personal utility library.
12.7 Project Checkpoint: Loading Supplementary Data
It's time to expand our running vaccination project. In previous chapters, you've been working with the WHO vaccination coverage dataset. Now you're going to bring in two additional data sources:
- An Excel file (
population_data.xlsx) containing country population figures with data on separate sheets for different years - A JSON file (
country_metadata.json) containing country metadata — region, income group, and geographic coordinates
Step 1: Load the Population Data from Excel
# The population file has one sheet per year
pop_all = pd.read_excel(
'population_data.xlsx',
sheet_name=None # Load all sheets
)
# Each sheet name is a year like "2020", "2021", etc.
pop_frames = []
for year, df in pop_all.items():
df['year'] = int(year)
pop_frames.append(df)
population = pd.concat(pop_frames, ignore_index=True)
print(f"Population data: {population.shape}")
print(population.head())
Step 2: Load the Country Metadata from JSON
import json
with open('country_metadata.json', 'r') as f:
raw_meta = json.load(f)
# The JSON has nested geographic coordinates
metadata = pd.json_normalize(raw_meta)
print(f"Metadata: {metadata.shape}")
print(metadata.columns.tolist())
Step 3: Merge Everything Together
# Start with the vaccination data from earlier chapters
vacc = pd.read_csv('vaccination_coverage.csv')
# Merge population data
vacc = vacc.merge(
population[['country_code', 'year', 'population']],
on=['country_code', 'year'],
how='left'
)
# Merge country metadata
vacc = vacc.merge(
metadata[['country_code', 'income_group',
'region_name']],
on='country_code',
how='left'
)
print(f"Enriched dataset: {vacc.shape}")
print(vacc.head())
This is a realistic data integration workflow. You loaded data from three different formats (CSV, Excel, JSON), each requiring different loading techniques, and merged them into a single analytical dataset. Elena does this kind of work every week in her public health role — pulling population figures from one source, vaccination data from another, and geographic metadata from a third.
What You've Built So Far (Project Progress):
- Chapters 7-8: Loaded and cleaned the vaccination CSV
- Chapters 9-10: Reshaped and transformed the data, cleaned text fields
- Chapter 11: Added date handling and time series features
- Chapter 12: Integrated population data (Excel) and country metadata (JSON)
- Next: Chapter 13 will add live API data
12.8 Spaced Review: Chapters 1-11
Learning research shows that revisiting earlier material at spaced intervals dramatically improves retention. Let's do a quick check on concepts from previous chapters. Try answering these without looking back.
From Chapter 3 (Python Fundamentals I): What is the difference between = and == in Python? When do you use each one?
Check your answer
`=` is the assignment operator — it gives a name to a value (`x = 5`). `==` is the comparison operator — it checks whether two values are equal and returns `True` or `False` (`x == 5`). Using `=` when you mean `==` (or vice versa) is one of the most common beginner bugs.From Chapter 7 (Intro to pandas): What is the difference between a Series and a DataFrame?
Check your answer
A Series is a single column of data (one-dimensional) with an index. A DataFrame is a table (two-dimensional) made up of multiple Series that share the same index. You can think of a DataFrame as a dictionary of Series.From Chapter 8 (Cleaning Messy Data): Name three types of data quality problems you might find when loading a new dataset.
Check your answer
Common data quality problems include: (1) missing values (NaN, blank cells), (2) inconsistent formatting (e.g., "USA" vs "United States" vs "US"), (3) wrong data types (numbers stored as strings), (4) duplicate rows, (5) outliers or impossible values (e.g., age = -5 or temperature = 999).From Chapter 9 (Reshaping and Transforming): When would you use pd.melt() vs. pd.pivot_table()?
Check your answer
Use `melt()` to convert wide format to long format — when column headers contain data values (like year columns "2020", "2021", "2022") that should be in rows. Use `pivot_table()` to convert long format to wide format — when you want row values to become column headers, or when you need to aggregate data into a summary table.From Chapter 11 (Dates and Times): What does dt.month give you, and what object do you use it on?
Check your answer
`dt.month` extracts the month (as an integer, 1-12) from a datetime Series. You use it on a pandas Series that has been converted to datetime type (via `pd.to_datetime()`). For example: `df['date'].dt.month` returns a Series of month numbers.Chapter Summary
This chapter was about meeting data where it lives. You learned four essential loading techniques:
-
CSV — the universal plain-text format. You mastered encoding (
encoding='latin-1'), delimiters (sep=';'), header management (skiprows,header), selective loading (usecols), and type control (dtype). -
Excel — the business world's format. You learned sheet selection (
sheet_name), handling messy layouts (skiprows,header), and loading all sheets into a dictionary (sheet_name=None). -
JSON — the web's format. You learned to load flat JSON with
pd.read_json(), flatten nested structures withpd.json_normalize(), and userecord_pathandmetato handle arrays inside records. -
Databases/SQL — the enterprise format. You wrote your first SQL queries (SELECT, WHERE, JOIN, GROUP BY) and used
pd.read_sql()to bring database data into pandas.
The threshold concept for this chapter is that data doesn't live in one format. The real skill isn't memorizing every parameter of read_csv — it's the confidence to say, "I don't know what format this file is in, but I can figure it out and get the data into a DataFrame." That confidence comes from knowing the patterns, having a debugging toolkit, and being willing to inspect a file in a text editor when things go wrong.
In Chapter 13, we'll take this further: instead of loading data from files on your computer, you'll pull data directly from the internet — from web APIs and web pages. The same mental model applies: figure out what format the data is in, extract what you need, and get it into a DataFrame.
Concept Inventory Update
New concepts introduced in this chapter:
| # | Concept | First Introduced | Section |
|---|---|---|---|
| 1 | File encoding (UTF-8, Latin-1) | This chapter | 12.1 |
| 2 | Delimiters and format variations | This chapter | 12.1 |
| 3 | JSON and nested data structures | This chapter | 12.3 |
| 4 | Relational databases and SQL | This chapter | 12.4 |
| 5 | Data integration across formats | This chapter | 12.7 |
Running total: 49 concepts, 134 terms, 32 techniques introduced across Chapters 1-12.
Next up: Chapter 13 — Getting Data from the Web. You'll learn to talk to APIs, scrape web pages, and build automated data collection pipelines. The internet becomes your dataset.