Key Takeaways: Getting Data from Files

Contributors to Introduction to Data Science

Key Takeaways: Getting Data from Files

This is your reference card for Chapter 12 — the chapter where you learned to pry data out of whatever container it arrives in. Keep this nearby when you receive a new dataset and need to figure out how to load it.

Format Comparison at a Glance

Feature	CSV	Excel	JSON	Database (SQLite)
Structure	Flat table	Flat table (multiple sheets)	Hierarchical / nested	Relational tables
Data types preserved?	No	Yes	Yes	Yes
Human-readable?	Yes	No	Yes	No
Max practical size	Millions of rows	~1M rows	Memory-limited	Billions of rows
pandas function	`read_csv()`	`read_excel()`	`read_json()` / `json_normalize()`	`read_sql()`
Extra library?	No	openpyxl	No	sqlite3 (built-in)
Best for	Pipelines, open data	Business reports	Web APIs	Enterprise data

read_csv() Essential Parameters

pd.read_csv(filepath,
    encoding='utf-8',      # Try 'latin-1' for European files
    sep=',',               # Use ';' for European, '\t' for TSV
    header=0,              # Row number with headers (None if no header)
    names=[...],           # Custom column names
    skiprows=N,            # Skip N rows at the top
    usecols=[...],         # Load only specific columns
    nrows=N,               # Load only first N rows
    na_values=[...],       # Treat these strings as NaN
    dtype={'col': str},    # Force column data types
    parse_dates=['col'],   # Parse columns as dates
    index_col=0            # Use column 0 as index (fixes Unnamed: 0)
)

Common problems and quick fixes:

Symptom	Cause	Fix
Garbled characters (mojibake)	Wrong encoding	`encoding='latin-1'`
One column with semicolons in values	Wrong delimiter	`sep=';'`
First row is metadata, not headers	No skiprows	`skiprows=N`
Leading zeros dropped (ZIP codes)	Type inference	`dtype={'zip': str}`
Unnamed: 0 column appears	Index was saved	`index_col=0`

read_excel() Essential Parameters

pd.read_excel(filepath,
    sheet_name=0,          # Sheet name (str) or index (int) or None (all)
    header=0,              # Row with column headers
    skiprows=N,            # Skip rows at top
    usecols='A:D',         # Column range (Excel-style) or list of names
    engine='openpyxl'      # Required library for .xlsx
)

Loading all sheets and combining:

all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
combined = pd.concat(all_sheets.values(), ignore_index=True)

JSON Loading Patterns

Scenario	Approach
Flat JSON (array of objects)	`pd.read_json('file.json')`
Nested JSON	`pd.json_normalize(data, sep='_')`
Arrays inside records	`pd.json_normalize(data, record_path='items', meta=['id', 'name'])`
Complex / unknown structure	Load with `json.load()`, explore, then normalize

The "load, explore, extract, normalize" pattern:

import json
with open('file.json', 'r') as f:
    raw = json.load(f)
print(type(raw))           # What is it?
print(raw[0].keys())       # What fields exist?
df = pd.json_normalize(raw)  # Flatten it

SQL Basics Reference

Core SQL keywords:

SELECT columns        -- Choose which columns
FROM table            -- Specify the table
WHERE condition       -- Filter rows
GROUP BY column       -- Group for aggregation
ORDER BY column       -- Sort results (ASC or DESC)
JOIN table2 ON key    -- Combine tables on shared column

Common query patterns:

-- Select specific columns with a filter
SELECT name, gpa FROM students WHERE gpa > 3.5

-- Count rows by group
SELECT major, COUNT(*) as n FROM students GROUP BY major

-- Average by group, sorted
SELECT dept, AVG(salary) as avg_sal
FROM employees
GROUP BY dept
ORDER BY avg_sal DESC

-- Join two tables
SELECT s.name, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id

Using SQL with pandas:

import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
conn.close()

Decision Flowchart: Which Format Do I Need?

Is the data in a file on disk?
├── Yes
│   ├── Is it a .csv or .tsv?  →  pd.read_csv()
│   ├── Is it a .xlsx or .xls? →  pd.read_excel()
│   ├── Is it a .json?         →  pd.read_json() or json_normalize()
│   └── Is it a .db?           →  sqlite3 + pd.read_sql()
└── No
    └── Is it from the web?    →  Chapter 13 (APIs and scraping)

Key Principles

Try the simple load first. Start with default parameters. If it works, you're done. If not, diagnose and add parameters one at a time.
Open mystery files in a text editor. Before guessing at parameters, look at the raw content. The delimiter, header, and encoding issues usually become obvious.
Document your loading decisions. Future-you (or your colleague) will need to reload this data. A comment like # Latin-1 encoding, semicolon delimiter, header on row 4 saves hours.
Let SQL do the heavy lifting. When querying databases, filter and aggregate in SQL before bringing data into pandas. Don't load 10 million rows just to use 1,000.
Expect format inconsistencies. When combining data from multiple sources, column names, date formats, encoding, and ID conventions will differ. Budget time for standardization.