Key Takeaways: Getting Data from Files
This is your reference card for Chapter 12 — the chapter where you learned to pry data out of whatever container it arrives in. Keep this nearby when you receive a new dataset and need to figure out how to load it.
Format Comparison at a Glance
| Feature | CSV | Excel | JSON | Database (SQLite) |
|---|---|---|---|---|
| Structure | Flat table | Flat table (multiple sheets) | Hierarchical / nested | Relational tables |
| Data types preserved? | No | Yes | Yes | Yes |
| Human-readable? | Yes | No | Yes | No |
| Max practical size | Millions of rows | ~1M rows | Memory-limited | Billions of rows |
| pandas function | read_csv() |
read_excel() |
read_json() / json_normalize() |
read_sql() |
| Extra library? | No | openpyxl | No | sqlite3 (built-in) |
| Best for | Pipelines, open data | Business reports | Web APIs | Enterprise data |
read_csv() Essential Parameters
pd.read_csv(filepath,
encoding='utf-8', # Try 'latin-1' for European files
sep=',', # Use ';' for European, '\t' for TSV
header=0, # Row number with headers (None if no header)
names=[...], # Custom column names
skiprows=N, # Skip N rows at the top
usecols=[...], # Load only specific columns
nrows=N, # Load only first N rows
na_values=[...], # Treat these strings as NaN
dtype={'col': str}, # Force column data types
parse_dates=['col'], # Parse columns as dates
index_col=0 # Use column 0 as index (fixes Unnamed: 0)
)
Common problems and quick fixes:
| Symptom | Cause | Fix |
|---|---|---|
| Garbled characters (mojibake) | Wrong encoding | encoding='latin-1' |
| One column with semicolons in values | Wrong delimiter | sep=';' |
| First row is metadata, not headers | No skiprows | skiprows=N |
| Leading zeros dropped (ZIP codes) | Type inference | dtype={'zip': str} |
| Unnamed: 0 column appears | Index was saved | index_col=0 |
read_excel() Essential Parameters
pd.read_excel(filepath,
sheet_name=0, # Sheet name (str) or index (int) or None (all)
header=0, # Row with column headers
skiprows=N, # Skip rows at top
usecols='A:D', # Column range (Excel-style) or list of names
engine='openpyxl' # Required library for .xlsx
)
Loading all sheets and combining:
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
combined = pd.concat(all_sheets.values(), ignore_index=True)
JSON Loading Patterns
| Scenario | Approach |
|---|---|
| Flat JSON (array of objects) | pd.read_json('file.json') |
| Nested JSON | pd.json_normalize(data, sep='_') |
| Arrays inside records | pd.json_normalize(data, record_path='items', meta=['id', 'name']) |
| Complex / unknown structure | Load with json.load(), explore, then normalize |
The "load, explore, extract, normalize" pattern:
import json
with open('file.json', 'r') as f:
raw = json.load(f)
print(type(raw)) # What is it?
print(raw[0].keys()) # What fields exist?
df = pd.json_normalize(raw) # Flatten it
SQL Basics Reference
Core SQL keywords:
SELECT columns -- Choose which columns
FROM table -- Specify the table
WHERE condition -- Filter rows
GROUP BY column -- Group for aggregation
ORDER BY column -- Sort results (ASC or DESC)
JOIN table2 ON key -- Combine tables on shared column
Common query patterns:
-- Select specific columns with a filter
SELECT name, gpa FROM students WHERE gpa > 3.5
-- Count rows by group
SELECT major, COUNT(*) as n FROM students GROUP BY major
-- Average by group, sorted
SELECT dept, AVG(salary) as avg_sal
FROM employees
GROUP BY dept
ORDER BY avg_sal DESC
-- Join two tables
SELECT s.name, g.course, g.grade
FROM students s
JOIN grades g ON s.student_id = g.student_id
Using SQL with pandas:
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
conn.close()
Decision Flowchart: Which Format Do I Need?
Is the data in a file on disk?
├── Yes
│ ├── Is it a .csv or .tsv? → pd.read_csv()
│ ├── Is it a .xlsx or .xls? → pd.read_excel()
│ ├── Is it a .json? → pd.read_json() or json_normalize()
│ └── Is it a .db? → sqlite3 + pd.read_sql()
└── No
└── Is it from the web? → Chapter 13 (APIs and scraping)
Key Principles
-
Try the simple load first. Start with default parameters. If it works, you're done. If not, diagnose and add parameters one at a time.
-
Open mystery files in a text editor. Before guessing at parameters, look at the raw content. The delimiter, header, and encoding issues usually become obvious.
-
Document your loading decisions. Future-you (or your colleague) will need to reload this data. A comment like
# Latin-1 encoding, semicolon delimiter, header on row 4saves hours. -
Let SQL do the heavy lifting. When querying databases, filter and aggregate in SQL before bringing data into pandas. Don't load 10 million rows just to use 1,000.
-
Expect format inconsistencies. When combining data from multiple sources, column names, date formats, encoding, and ID conventions will differ. Budget time for standardization.