Chapter 11 Key Takeaways: Loading and Exploring Real Business Datasets
The Big Ideas
1. Loading data is not a trivial step.
The parameters you pass to pd.read_csv() and pd.read_excel() determine whether pandas correctly interprets separators, column headers, date values, missing value codes, and data types. Getting these wrong produces DataFrames that look fine on the surface but contain silent errors that corrupt your analysis.
2. Always explore before you analyze. No matter how reliable your data source, never skip the systematic exploration phase. Missing values, inconsistent categories, wrong data types, and duplicate records are routine in real business data. Discovering them after an analysis is complete is far more costly than catching them up front.
3. .info() is your primary diagnostic tool.
A single call to df.info() shows column names, non-null counts per column, data types, and total memory usage. The non-null count column instantly reveals which columns have missing data. This is the first substantive thing to check after loading any new dataset.
Core Functions and Parameters
Loading CSV Files
df = pd.read_csv(
"filename.csv",
sep=",", # Separator (comma is default)
header=0, # Row containing column names (0 = first row)
index_col=None, # Column to use as the row index
usecols=["a", "b", "c"], # Load only these columns
dtype={"id": str}, # Override type inference for specific columns
parse_dates=["date"], # Parse these columns as datetime
encoding="utf-8", # File text encoding
na_values=["N/A", "--"], # Extra strings to treat as missing
nrows=1000, # Load at most N rows
skiprows=3 # Skip N rows at the top before reading
)
Loading Excel Files
df = pd.read_excel(
"filename.xlsx",
sheet_name="Sheet1", # Sheet name or index (0 = first sheet)
header=0, # Row containing column names
usecols="A:F", # Excel column range OR list of column names
dtype={"id": str} # Same as read_csv
)
# Load all sheets at once
all_sheets = pd.read_excel("file.xlsx", sheet_name=None)
# Returns: {"Sheet1": df1, "Sheet2": df2, ...}
Reading from a URL
df = pd.read_csv("https://example.com/data.csv")
# All read_csv parameters work with URLs
The Systematic Inspection Workflow
Follow these 8 steps every time you receive a new dataset:
| Step | Command | What It Tells You |
|---|---|---|
| 1. Dimensions | df.shape |
Row and column count |
| 2. Visual check | df.head(), df.tail() |
Data format, presence of extra rows |
| 3. Types | df.dtypes |
Whether columns are the expected type |
| 4. Full summary | df.info() |
Missing values, types, memory — all at once |
| 5. Statistics | df.describe() |
Ranges, averages, outliers in numeric columns |
| 6. Unique values | df.nunique(), df["col"].value_counts(dropna=False) |
Category cardinality, consistency issues |
| 7. Missing data | df.isna().sum() |
Count of missing values per column |
| 8. Spot check | df[df.isna().any(axis=1)] |
Which specific rows have missing values |
Understanding Data Types
| pandas dtype | Meaning | Watch For |
|---|---|---|
int64 |
Integer | Becomes float64 if any value is NaN |
float64 |
Decimal number | May include NaN; check min/max for plausibility |
object |
String / mixed | May indicate a column that should be numeric (e.g., "$1,234") |
datetime64[ns] |
Date/time | Requires parse_dates at load time or explicit conversion |
bool |
True/False | Rare in CSV data; check if 0/1 columns should be bool |
category |
Low-cardinality string | Memory-efficient; use for columns with few unique values |
The most common type inference failure: A column that should be numeric ends up as object because some values contain non-numeric characters ($, ,, %, letters, or whitespace).
Missing Values: Quick Reference
# Count missing values per column
df.isna().sum()
# Show missing values as percentages
(df.isna().sum() / len(df) * 100).round(2)
# Show rows where any column has a missing value
df[df.isna().any(axis=1)]
# Show rows where a specific column has a missing value
df[df["region"].isna()]
# Always use dropna=False in value_counts to see NaN counts
df["region"].value_counts(dropna=False)
Memory Efficiency
- Numeric columns (
int64,float64) use 8 bytes per value — extremely efficient. - String columns (
object) use roughly 50–200 bytes per value — much more memory. - Convert low-cardinality string columns to
categorydtype to save 70–90% of memory:
df["region"] = df["region"].astype("category")
Use df.memory_usage(deep=True) to measure actual memory before and after optimization.
The Acme Corp Sales Dataset (introduced this chapter)
File: acme_sales_2023.csv
Columns: date, region, salesperson, product_sku, category, units_sold, unit_price, discount_rate, total_revenue
Data quality issues discovered through exploration:
- 13 missing region values (0.52% of rows)
- 5 missing category values (0.20% of rows)
- Inconsistent category capitalization: "Office Chairs", "office chairs", "OFFICE CHAIRS"
- 8 exact duplicate rows
All of these issues are addressed in Chapter 12.
Common Mistakes to Avoid
Forgetting dropna=False in .value_counts()
You will miss NaN counts and undercount records in grouped analyses.
Not using parse_dates
Date columns loaded as strings cannot be filtered, grouped, or compared using time-based logic.
Skipping .info() and jumping straight to analysis
You will not know about missing values or type issues until your analysis produces wrong results — or crashes.
Assuming the row count is correct
Always check .shape against the expected count. CRM exports, system reports, and scheduled file drops can fail silently and produce incomplete files.
Using "int" instead of "Int64" for nullable integers
If an integer column might have NaN values, use pandas' nullable integer type pd.Int64Dtype() or "Int64" (capital I) to avoid silent conversion to float64.
One-Liner Summary
Load with intent, inspect systematically, document what you find, and fix nothing until you understand everything. The 20 minutes you spend on exploration saves hours of debugging incorrect analyses.