Key Takeaways: Getting Data from Files

This is your reference card for Chapter 12 — the chapter where you learned to pry data out of whatever container it arrives in. Keep this nearby when you receive a new dataset and need to figure out how to load it.


Format Comparison at a Glance

Feature CSV Excel JSON Database (SQLite)
Structure Flat table Flat table (multiple sheets) Hierarchical / nested Relational tables
Data types preserved? No Yes Yes Yes
Human-readable? Yes No Yes No
Max practical size Millions of rows ~1M rows Memory-limited Billions of rows
pandas function read_csv() read_excel() read_json() / json_normalize() read_sql()
Extra library? No openpyxl No sqlite3 (built-in)
Best for Pipelines, open data Business reports Web APIs Enterprise data

read_csv() Essential Parameters

pd.read_csv(filepath,
    encoding='utf-8',      # Try 'latin-1' for European files
    sep=',',               # Use ';' for European, '\t' for TSV
    header=0,              # Row number with headers (None if no header)
    names=[...],           # Custom column names
    skiprows=N,            # Skip N rows at the top
    usecols=[...],         # Load only specific columns
    nrows=N,               # Load only first N rows
    na_values=[...],       # Treat these strings as NaN
    dtype={'col': str},    # Force column data types
    parse_dates=['col'],   # Parse columns as dates
    index_col=0            # Use column 0 as index (fixes Unnamed: 0)
)

Common problems and quick fixes:

Symptom Cause Fix
Garbled characters (mojibake) Wrong encoding encoding='latin-1'
One column with semicolons in values Wrong delimiter sep=';'
First row is metadata, not headers No skiprows skiprows=N
Leading zeros dropped (ZIP codes) Type inference dtype={'zip': str}
Unnamed: 0 column appears Index was saved index_col=0

read_excel() Essential Parameters

pd.read_excel(filepath,
    sheet_name=0,          # Sheet name (str) or index (int) or None (all)
    header=0,              # Row with column headers
    skiprows=N,            # Skip rows at top
    usecols='A:D',         # Column range (Excel-style) or list of names
    engine='openpyxl'      # Required library for .xlsx
)

Loading all sheets and combining:

all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
combined = pd.concat(all_sheets.values(), ignore_index=True)

JSON Loading Patterns

Scenario Approach
Flat JSON (array of objects) pd.read_json('file.json')
Nested JSON pd.json_normalize(data, sep='_')
Arrays inside records pd.json_normalize(data, record_path='items', meta=['id', 'name'])
Complex / unknown structure Load with json.load(), explore, then normalize

The "load, explore, extract, normalize" pattern:

import json
with open('file.json', 'r') as f:
    raw = json.load(f)
print(type(raw))           # What is it?
print(raw[0].keys())       # What fields exist?
df = pd.json_normalize(raw)  # Flatten it

SQL Basics Reference

Core SQL keywords:

SELECT columns        -- Choose which columns
FROM table            -- Specify the table
WHERE condition       -- Filter rows
GROUP BY column       -- Group for aggregation
ORDER BY column       -- Sort results (ASC or DESC)
JOIN table2 ON key    -- Combine tables on shared column

Common query patterns:

-- Select specific columns with a filter
SELECT name, gpa FROM students WHERE gpa > 3.5

-- Count rows by group
SELECT major, COUNT(*) as n FROM students GROUP BY major

-- Average by group, sorted
SELECT dept, AVG(salary) as avg_sal
FROM employees
GROUP BY dept
ORDER BY avg_sal DESC

-- Join two tables
SELECT s.name, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id

Using SQL with pandas:

import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
conn.close()

Decision Flowchart: Which Format Do I Need?

Is the data in a file on disk?
├── Yes
│   ├── Is it a .csv or .tsv?  →  pd.read_csv()
│   ├── Is it a .xlsx or .xls? →  pd.read_excel()
│   ├── Is it a .json?         →  pd.read_json() or json_normalize()
│   └── Is it a .db?           →  sqlite3 + pd.read_sql()
└── No
    └── Is it from the web?    →  Chapter 13 (APIs and scraping)

Key Principles

  1. Try the simple load first. Start with default parameters. If it works, you're done. If not, diagnose and add parameters one at a time.

  2. Open mystery files in a text editor. Before guessing at parameters, look at the raw content. The delimiter, header, and encoding issues usually become obvious.

  3. Document your loading decisions. Future-you (or your colleague) will need to reload this data. A comment like # Latin-1 encoding, semicolon delimiter, header on row 4 saves hours.

  4. Let SQL do the heavy lifting. When querying databases, filter and aggregate in SQL before bringing data into pandas. Don't load 10 million rows just to use 1,000.

  5. Expect format inconsistencies. When combining data from multiple sources, column names, date formats, encoding, and ID conventions will differ. Budget time for standardization.