Further Reading: Getting Data from Files

You've just learned to load data from four different formats — a skill you'll use in virtually every data science project. Here are resources to deepen your understanding, organized by what caught your interest.

Tier 1: Verified Sources

These are published books with full bibliographic details.

Wes McKinney, Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter (O'Reilly, 3rd edition, 2022). McKinney created pandas, and his book is the definitive reference for data loading. Chapter 6 covers reading and writing data in all the formats we discussed — CSV, Excel, JSON, SQL databases, and more (including HDF5 and Parquet, which we didn't cover). If you want to understand every parameter of read_csv(), this is where to look. The third edition is updated for modern pandas.

Alan Beaulieu, Learning SQL: Generate, Manipulate, and Retrieve Data (O'Reilly, 3rd edition, 2020). If the SQL section of this chapter sparked your interest, Beaulieu's book is the best beginner-friendly SQL resource available. It starts with simple SELECT statements and builds to complex joins, subqueries, and window functions. The examples use MySQL, but the SQL concepts transfer directly to SQLite, PostgreSQL, and other databases. Clear writing, good exercises, and a logical progression.

Anthony Molinaro and Robert de Graaf, SQL Cookbook: Query Solutions and Techniques for All SQL Users (O'Reilly, 2nd edition, 2021). More of a reference than a tutorial. If you already understand basic SQL and want to solve specific problems — "How do I find duplicate rows?" "How do I pivot data in SQL?" — this book has the patterns. Organized by task rather than by concept, so you can look up exactly what you need. Covers multiple database dialects.

Joel Grus, Data Science from Scratch: First Principles with Python (O'Reilly, 2nd edition, 2019). Chapter 9 covers getting data from files, APIs, and databases, including web scraping. Grus takes a hands-on, build-it-yourself approach that complements the pandas-centric methods we used. Good for understanding what's happening under the hood when you call read_csv().

Tier 2: Attributed Resources

These are well-known online resources and documentation. We provide enough detail to find them without URLs (because links change).

The pandas documentation — "IO Tools" section. The official pandas documentation has an extensive chapter on Input/Output that covers every file format pandas supports, including CSV, Excel, JSON, SQL, Parquet, HDF5, Feather, and more. Search for "pandas IO tools" to find it. The parameter tables for read_csv() and read_excel() are exhaustive — useful when you encounter an edge case the textbook didn't cover.

SQLite documentation (sqlite.org). SQLite's official website includes surprisingly readable documentation, including a complete SQL syntax reference. The "SQL As Understood By SQLite" page is a practical reference for all the SQL statements we covered and many we didn't. SQLite is used in more applications than any other database engine in the world — understanding it well is time well spent.

Joel Nothman, "Character encoding and pandas" (blog post, various versions). Several excellent blog posts explain the relationship between character encoding and data loading in Python and pandas. Searching for "pandas encoding utf-8 latin-1 guide" will surface practical tutorials. Understanding encoding saves hours of debugging when working with international data.

Python documentation for the json module. Python's standard library documentation for the json module covers json.load(), json.loads(), json.dump(), and all formatting options. Search "python json module documentation" to find the official docs at docs.python.org.

Recommended Next Steps

If you want more SQL practice: Download the "Chinook" sample database (a music store database commonly used for SQL learning — search for "Chinook SQLite database"). It has 11 related tables and is perfect for practicing SELECT, JOIN, and GROUP BY queries.
If you work with messy Excel files regularly: Explore the openpyxl library directly (not through pandas). It gives you cell-level control for reading and writing Excel files, including formatting, formulas, and charts. The official openpyxl documentation is well-organized.
If JSON and APIs intrigue you: Jump straight to Chapter 13, where you'll learn to pull JSON data directly from web APIs — turning the internet into your data source.
If you want to learn about data formats we didn't cover: Look into Parquet (a columnar format optimized for large datasets) and Feather (a fast binary format for DataFrames). Both are supported by pandas and are increasingly common in data engineering pipelines. McKinney's book covers both.
If the SQL section felt too brief: Consider working through an online SQL tutorial. "SQLBolt" (search for it by name) offers interactive browser-based SQL lessons with immediate feedback. "Mode Analytics SQL Tutorial" is another well-regarded free resource.