Chapter 11 Key Takeaways: Loading and Exploring Real Business Datasets


The Big Ideas

1. Loading data is not a trivial step. The parameters you pass to pd.read_csv() and pd.read_excel() determine whether pandas correctly interprets separators, column headers, date values, missing value codes, and data types. Getting these wrong produces DataFrames that look fine on the surface but contain silent errors that corrupt your analysis.

2. Always explore before you analyze. No matter how reliable your data source, never skip the systematic exploration phase. Missing values, inconsistent categories, wrong data types, and duplicate records are routine in real business data. Discovering them after an analysis is complete is far more costly than catching them up front.

3. .info() is your primary diagnostic tool. A single call to df.info() shows column names, non-null counts per column, data types, and total memory usage. The non-null count column instantly reveals which columns have missing data. This is the first substantive thing to check after loading any new dataset.


Core Functions and Parameters

Loading CSV Files

df = pd.read_csv(
    "filename.csv",
    sep=",",                  # Separator (comma is default)
    header=0,                 # Row containing column names (0 = first row)
    index_col=None,           # Column to use as the row index
    usecols=["a", "b", "c"],  # Load only these columns
    dtype={"id": str},        # Override type inference for specific columns
    parse_dates=["date"],     # Parse these columns as datetime
    encoding="utf-8",         # File text encoding
    na_values=["N/A", "--"],  # Extra strings to treat as missing
    nrows=1000,               # Load at most N rows
    skiprows=3                # Skip N rows at the top before reading
)

Loading Excel Files

df = pd.read_excel(
    "filename.xlsx",
    sheet_name="Sheet1",      # Sheet name or index (0 = first sheet)
    header=0,                 # Row containing column names
    usecols="A:F",            # Excel column range OR list of column names
    dtype={"id": str}         # Same as read_csv
)

# Load all sheets at once
all_sheets = pd.read_excel("file.xlsx", sheet_name=None)
# Returns: {"Sheet1": df1, "Sheet2": df2, ...}

Reading from a URL

df = pd.read_csv("https://example.com/data.csv")
# All read_csv parameters work with URLs

The Systematic Inspection Workflow

Follow these 8 steps every time you receive a new dataset:

Step Command What It Tells You
1. Dimensions df.shape Row and column count
2. Visual check df.head(), df.tail() Data format, presence of extra rows
3. Types df.dtypes Whether columns are the expected type
4. Full summary df.info() Missing values, types, memory — all at once
5. Statistics df.describe() Ranges, averages, outliers in numeric columns
6. Unique values df.nunique(), df["col"].value_counts(dropna=False) Category cardinality, consistency issues
7. Missing data df.isna().sum() Count of missing values per column
8. Spot check df[df.isna().any(axis=1)] Which specific rows have missing values

Understanding Data Types

pandas dtype Meaning Watch For
int64 Integer Becomes float64 if any value is NaN
float64 Decimal number May include NaN; check min/max for plausibility
object String / mixed May indicate a column that should be numeric (e.g., "$1,234")
datetime64[ns] Date/time Requires parse_dates at load time or explicit conversion
bool True/False Rare in CSV data; check if 0/1 columns should be bool
category Low-cardinality string Memory-efficient; use for columns with few unique values

The most common type inference failure: A column that should be numeric ends up as object because some values contain non-numeric characters ($, ,, %, letters, or whitespace).


Missing Values: Quick Reference

# Count missing values per column
df.isna().sum()

# Show missing values as percentages
(df.isna().sum() / len(df) * 100).round(2)

# Show rows where any column has a missing value
df[df.isna().any(axis=1)]

# Show rows where a specific column has a missing value
df[df["region"].isna()]

# Always use dropna=False in value_counts to see NaN counts
df["region"].value_counts(dropna=False)

Memory Efficiency

  • Numeric columns (int64, float64) use 8 bytes per value — extremely efficient.
  • String columns (object) use roughly 50–200 bytes per value — much more memory.
  • Convert low-cardinality string columns to category dtype to save 70–90% of memory:
df["region"] = df["region"].astype("category")

Use df.memory_usage(deep=True) to measure actual memory before and after optimization.


The Acme Corp Sales Dataset (introduced this chapter)

File: acme_sales_2023.csv

Columns: date, region, salesperson, product_sku, category, units_sold, unit_price, discount_rate, total_revenue

Data quality issues discovered through exploration: - 13 missing region values (0.52% of rows) - 5 missing category values (0.20% of rows) - Inconsistent category capitalization: "Office Chairs", "office chairs", "OFFICE CHAIRS" - 8 exact duplicate rows

All of these issues are addressed in Chapter 12.


Common Mistakes to Avoid

Forgetting dropna=False in .value_counts() You will miss NaN counts and undercount records in grouped analyses.

Not using parse_dates Date columns loaded as strings cannot be filtered, grouped, or compared using time-based logic.

Skipping .info() and jumping straight to analysis You will not know about missing values or type issues until your analysis produces wrong results — or crashes.

Assuming the row count is correct Always check .shape against the expected count. CRM exports, system reports, and scheduled file drops can fail silently and produce incomplete files.

Using "int" instead of "Int64" for nullable integers If an integer column might have NaN values, use pandas' nullable integer type pd.Int64Dtype() or "Int64" (capital I) to avoid silent conversion to float64.


One-Liner Summary

Load with intent, inspect systematically, document what you find, and fix nothing until you understand everything. The 20 minutes you spend on exploration saves hours of debugging incorrect analyses.