> "The first step in any analysis is understanding what you're actually looking at. A well-explored dataset is half an analysis done."
In This Chapter
- Chapter Overview
- 11.1 Why Data Loading Matters More Than You Think
- 11.2 Setting Up: Importing pandas
- 11.3 Reading CSV Files with pd.read_csv()
- 11.4 Reading Excel Files with pd.read_excel()
- 11.5 Reading Data Directly from URLs
- 11.6 Introducing the Acme Corp Sales Dataset
- 11.7 Inspecting a New Dataset: Your Standard Toolkit
- 11.8 A Systematic Workflow for Exploring New Datasets
- 11.9 Understanding Type Inference in Depth
- 11.10 Practical Tips for Real Business Files
- 11.11 Complete Loading and Inspection Example
- 11.12 Chapter Summary
- Key Terms
- What Comes Next
Chapter 11: Loading and Exploring Real Business Datasets
"The first step in any analysis is understanding what you're actually looking at. A well-explored dataset is half an analysis done." — Priya Okonkwo, Senior Analyst, Acme Corp
Chapter Overview
Every data analysis project begins the same way: a file lands in your inbox, gets dropped into a shared drive, or arrives via an automated pipeline — and your job is to make sense of it. Before you can calculate a single average or draw a single chart, you need to load that data into Python, inspect it carefully, and understand its shape, contents, and quirks.
This chapter teaches you how to do exactly that with pandas, Python's premier data manipulation library. You will learn to read data from CSV files, Excel workbooks, and even live URLs on the internet. You will learn the systematic inspection techniques that professional analysts use whenever they encounter a new dataset. And you will meet the Acme Corp sales dataset — a realistic, messy collection of 2023 sales records — that will accompany you through this chapter and the next.
By the end of this chapter, you will be able to:
- Load CSV files with
pd.read_csv()and control every aspect of how pandas interprets them - Load Excel workbooks with
pd.read_excel()and work with named sheets - Read data directly from URLs without downloading files manually
- Systematically inspect a new dataset using
.info(),.describe(),.head(),.tail(),.value_counts(),.nunique(), and.dtypes - Understand how pandas infers data types and when that inference goes wrong
- Measure memory usage and understand its business implications
- Apply a repeatable, professional workflow for first contact with any new dataset
11.1 Why Data Loading Matters More Than You Think
Most beginners assume that loading data is the trivial part — you just point pandas at a file and everything works. In practice, the way you load a dataset determines whether your subsequent analysis is correct or subtly, silently wrong.
Consider a few scenarios that actually happen in business:
The dollar sign problem. Your company's financial system exports revenue figures as "$1,234.56" — a string, not a number. If you load the file without specifying the correct data types, pandas will keep those values as strings. Every calculation you perform — sums, averages, comparisons — will either fail with an error or produce nonsense.
The date problem. Sales dates arrive as "2023-01-15", "01/15/2023", or even "15-Jan-23" depending on who exported the file and what system generated it. If pandas reads them as plain strings, you cannot filter by month, calculate days between dates, or plot a time series correctly.
The encoding problem. A colleague in France sends you a customer list with names like "Léa Dubois" and "François Martin." If you open the file with the wrong text encoding, those names become garbled nonsense characters — and you may not notice until a customer complains.
The hidden header problem. Some reporting tools put metadata in the first few rows before the actual column names appear. If pandas reads row 0 as the header when the real header is in row 3, your entire DataFrame has wrong column names and a few garbage rows at the top.
None of these problems are exotic. They are the daily reality of business data. The parameters you pass to pd.read_csv() and pd.read_excel() are your first line of defense.
11.2 Setting Up: Importing pandas
Every script and notebook in this book begins the same way:
import pandas as pd
This single line imports the entire pandas library and gives it the alias pd. The alias is a universal convention — every pandas tutorial, Stack Overflow answer, and professional codebase uses pd. You should too.
If you receive an ImportError, pandas is not installed in your current environment. Install it with:
pip install pandas openpyxl
The openpyxl package is required for reading .xlsx files. It is a separate install but is commonly needed alongside pandas.
11.3 Reading CSV Files with pd.read_csv()
CSV (Comma-Separated Values) is the most common file format for business data. Virtually every system — accounting software, CRM platforms, e-commerce tools, spreadsheet applications — can export to CSV. pd.read_csv() is therefore the function you will use most often.
11.3.1 The Simplest Case
import pandas as pd
df = pd.read_csv("acme_sales_2023.csv")
print(df.head())
When this works perfectly, it is genuinely this simple. pandas reads the file, uses the first row as column names, infers data types for each column, and returns a DataFrame. For clean, well-formatted files, you may need nothing else.
But you have many options available for the cases where "simple" is not enough.
11.3.2 The sep Parameter: Handling Different Delimiters
The sep parameter specifies the character that separates values. The default is a comma (,), but business data frequently uses other separators.
# Tab-separated values (common from database exports and Excel "Save as Text")
df = pd.read_csv("sales_export.tsv", sep="\t")
# Semicolon-separated (standard in many European countries where comma is
# the decimal separator)
df = pd.read_csv("ventes_2023.csv", sep=";")
# Pipe-separated (used by some legacy ERP systems)
df = pd.read_csv("erp_export.txt", sep="|")
If you are not sure what separator a file uses, open it in a plain text editor (not Excel) and look at the first few lines.
11.3.3 The header Parameter: Where Are the Column Names?
By default, pandas assumes the first row (row 0) contains column names. Use header to change this.
# The column names are actually in row 3 (0-indexed)
# Rows 0, 1, and 2 contain system metadata
df = pd.read_csv("legacy_report.csv", header=3)
# The file has no header row at all — just data
# You must supply column names yourself
df = pd.read_csv("raw_data.csv",
header=None,
names=["date", "region", "revenue"])
11.3.4 The index_col Parameter: Choosing Your Row Index
By default, pandas creates a numeric index (0, 1, 2, ...) for rows. You can designate an existing column as the index instead.
# Use the 'transaction_id' column as the row index
df = pd.read_csv("transactions.csv", index_col="transaction_id")
# Use the first column (position 0) as the index
df = pd.read_csv("transactions.csv", index_col=0)
When should you set an index? When you have a natural unique identifier for each row — like a transaction ID, customer ID, or product SKU — that you will use frequently for lookups. If no such natural key exists, the default numeric index is perfectly fine.
11.3.5 The usecols Parameter: Loading Only the Columns You Need
Real business files often contain dozens of columns, many of which you do not need. Loading only the columns you need speeds up your script and reduces memory usage — which matters when working with large files.
# Load only these four columns by name
df = pd.read_csv("acme_sales_2023.csv",
usecols=["date", "region", "salesperson", "total_revenue"])
# Load columns by position (0-indexed): columns 0, 1, 4, and 7
df = pd.read_csv("acme_sales_2023.csv",
usecols=[0, 1, 4, 7])
This is especially valuable when you are working with files that have hundreds of columns but your analysis only touches five or six of them.
11.3.6 The dtype Parameter: Controlling Type Inference
pandas is good at inferring types, but it is not perfect. The dtype parameter lets you override pandas' guesses for specific columns.
# Tell pandas that product_sku should remain a string even if it looks like a number
# (e.g., SKUs like "00234" would otherwise become integer 234, losing the leading zeros)
df = pd.read_csv("acme_sales_2023.csv",
dtype={"product_sku": str})
# Force multiple columns to specific types
df = pd.read_csv("acme_sales_2023.csv",
dtype={
"product_sku": str,
"units_sold": int,
"discount_rate": float
})
A critical business example: any code that looks like a number but must preserve leading zeros — ZIP codes, product codes, account numbers — should be forced to str. "07030" (a New Jersey ZIP code) becomes 7030 as an integer, which is factually wrong.
11.3.7 The parse_dates Parameter: Reading Dates as Dates
When you load a CSV, date columns arrive as plain strings by default. The parse_dates parameter instructs pandas to convert them to actual datetime objects, which unlocks a huge range of date and time operations.
# Parse the 'date' column as datetime
df = pd.read_csv("acme_sales_2023.csv",
parse_dates=["date"])
# Parse multiple date columns
df = pd.read_csv("projects.csv",
parse_dates=["start_date", "end_date", "invoice_date"])
After parsing, you can filter rows by date range, extract month or year, calculate the number of days between two dates, and group by week or quarter — none of which is possible when dates are stored as strings.
11.3.8 The encoding Parameter: Handling International Text
Text files are stored in an encoding — a system that maps characters to bytes. The two most common encodings you will encounter are:
- UTF-8: The modern standard. Handles virtually every character in every language. This is what you should use for new files.
- Latin-1 (also called ISO-8859-1): An older standard that covers Western European languages. Files from older Windows systems often use this encoding or the closely related CP1252.
# Modern standard — works for most files
df = pd.read_csv("international_customers.csv", encoding="utf-8")
# For older files from Windows systems
df = pd.read_csv("legacy_export.csv", encoding="latin-1")
# If you are not sure, try utf-8 first, then latin-1
# A UnicodeDecodeError means you have the wrong encoding
The symptom of a wrong encoding is either a UnicodeDecodeError (Python refuses to read the file) or garbled characters in string columns (the file loads but special characters look like ’ or é).
11.3.9 The na_values Parameter: Recognizing Missing Data
pandas automatically treats blank cells, "NaN", "None", and "NA" as missing values. But business data often uses other conventions to signal missing data.
# Tell pandas that these specific strings mean "missing"
df = pd.read_csv("acme_sales_2023.csv",
na_values=["N/A", "n/a", "NULL", "-", "--", "TBD", "?", ""])
# Some systems export missing numbers as 99999 or -1 (sentinel values)
# You can handle these here too, though it is safer to convert them after loading
11.3.10 The nrows and skiprows Parameters: Loading Partial Data
When working with very large files (millions of rows), you often want to inspect a small sample before loading everything.
# Load only the first 1,000 rows — useful for quickly checking the file structure
df_sample = pd.read_csv("huge_transactions.csv", nrows=1000)
# Skip the first 5 rows (perhaps they contain copyright notices or metadata)
# and load from row 6 onward
df = pd.read_csv("formatted_report.csv", skiprows=5)
# Skip specific row numbers (for example, skip the second and fourth rows)
df = pd.read_csv("report.csv", skiprows=[1, 3])
nrows is also useful for loading files incrementally when you need to process data in chunks because it does not fit in memory.
11.3.11 Combining Parameters in Practice
In real work, you almost always use several parameters together:
df = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--"],
usecols=["date", "region", "salesperson", "product_sku",
"category", "units_sold", "unit_price",
"discount_rate", "total_revenue"]
)
This single call does the following: loads the file, parses the date column, preserves the product SKU as a string, recognizes several missing-value conventions, and loads only the columns needed for analysis. This is the kind of precise, intentional loading that prevents downstream errors.
11.4 Reading Excel Files with pd.read_excel()
Excel files (.xlsx, .xls) are ubiquitous in business. Even when companies use sophisticated databases and reporting tools, results often end up shared as Excel attachments. pd.read_excel() handles them gracefully.
11.4.1 Basic Excel Loading
import pandas as pd
# Load the first sheet of an Excel workbook
df = pd.read_excel("acme_customers.xlsx")
pandas reads the first worksheet by default. All the parameters you learned for read_csv() — header, usecols, dtype, na_values, parse_dates — work identically with read_excel().
11.4.2 The sheet_name Parameter: Choosing Which Sheet to Load
Excel workbooks often contain multiple sheets. Use sheet_name to select the one you want.
# Load a sheet by name
df_q1 = pd.read_excel("annual_report.xlsx", sheet_name="Q1 Sales")
df_q2 = pd.read_excel("annual_report.xlsx", sheet_name="Q2 Sales")
# Load a sheet by position (0-indexed — 0 is the first sheet, 1 is the second)
df_first = pd.read_excel("annual_report.xlsx", sheet_name=0)
df_second = pd.read_excel("annual_report.xlsx", sheet_name=1)
# Load ALL sheets at once — returns a dictionary of {sheet_name: DataFrame}
all_sheets = pd.read_excel("annual_report.xlsx", sheet_name=None)
for sheet_name, sheet_df in all_sheets.items():
print(f"Sheet '{sheet_name}': {sheet_df.shape[0]} rows, "
f"{sheet_df.shape[1]} columns")
Loading all sheets at once is extremely useful when you need to combine data from multiple sheets — for example, when each quarter's sales data is stored in a separate sheet and you need to analyze the full year.
11.4.3 The header Parameter with Excel
Just as with CSV files, Excel workbooks sometimes have decorative headers, titles, or merged cells before the actual data begins.
# The actual column names are in row 4 (0-indexed: row index 3)
# Rows 0-2 contain a company title and report date
df = pd.read_excel("formatted_report.xlsx", header=3)
11.4.4 The usecols Parameter with Excel
For Excel files, usecols accepts Excel-style column letters as well as column names or position numbers.
# Load columns A through E using Excel column letters
df = pd.read_excel("report.xlsx", usecols="A:E")
# Load specific non-contiguous Excel columns
df = pd.read_excel("report.xlsx", usecols="A,C,E,G")
# Load by column name (exactly like read_csv)
df = pd.read_excel("acme_customers.xlsx",
usecols=["customer_id", "name", "email", "region", "segment"])
11.5 Reading Data Directly from URLs
One of pandas' most convenient features is the ability to read CSV files directly from a URL — no downloading required.
import pandas as pd
# Read a CSV directly from a web URL
url = "https://raw.githubusercontent.com/some-org/some-repo/main/sales_data.csv"
df = pd.read_csv(url)
# All the same parameters apply — you can parse dates, set dtypes, etc.
df = pd.read_csv(url, parse_dates=["date"], dtype={"id": str})
This is particularly useful for:
- Public datasets: Government data portals, academic repositories, and open data initiatives frequently publish data as downloadable CSV files at stable URLs.
- Internal shared resources: Some organizations host internal data files on intranet servers or cloud storage with public read access.
- Automated pipelines: When a data source publishes updated files at the same URL each day, your script can always read the latest data without manual downloads.
11.5.1 Handling URL Access Issues
Reading from URLs requires a network connection and depends on the remote server being available. Some URLs require authentication. Keep this in mind:
import pandas as pd
url = "https://example.com/data/sales.csv"
try:
df = pd.read_csv(url)
print(f"Successfully loaded {len(df)} rows")
except Exception as e:
print(f"Could not load URL: {e}")
# Fall back to a local cached copy
df = pd.read_csv("data/sales_cached.csv")
11.6 Introducing the Acme Corp Sales Dataset
For this chapter and Chapter 12, we work with two primary datasets from Acme Corp:
The acme_sales_2023.csv Dataset
This CSV file contains 2,500 individual sales transactions from 2023. The columns are:
| Column | Description | Example |
|---|---|---|
date |
Transaction date | 2023-01-03 |
region |
Sales region | Northeast, Southwest, Midwest, Southeast, West |
salesperson |
Name of the salesperson | Jordan Blake |
product_sku |
Product stock-keeping unit identifier | OFF-CH-0045 |
category |
Product category | Office Chairs, Desks, Storage, Technology |
units_sold |
Number of units in this transaction | 3 |
unit_price |
Price per unit at time of sale | 249.99 |
discount_rate |
Fractional discount applied | 0.10 |
total_revenue |
units_sold * unit_price * (1 - discount_rate) |
674.97 |
The acme_customers.xlsx Dataset
This Excel workbook contains customer account information with two sheets:
- Active Customers: Current accounts with contact information, account tier, and account manager
- Inactive Customers: Accounts that have not placed an order in over 12 months
Both datasets contain intentional data quality issues that we will discover in this chapter and fix in Chapter 12. This mirrors real-world work — data is rarely clean when it arrives.
11.7 Inspecting a New Dataset: Your Standard Toolkit
Once you have loaded a dataset, your first job is to understand it before you do anything else. Professional analysts follow a consistent inspection workflow. This section walks through every technique, using the Acme sales dataset as the example.
Assume throughout this section that the DataFrame has been loaded as:
import pandas as pd
df = pd.read_csv("acme_sales_2023.csv", parse_dates=["date"],
dtype={"product_sku": str})
11.7.1 Dimensions: .shape
The first thing to check is how big the dataset is. .shape returns a tuple of (rows, columns).
print(df.shape)
# (2500, 9)
This tells you immediately whether you got the data you expected. If your manager said "this file has about 2,500 sales records" and .shape shows (25, 9), something went wrong — maybe you loaded only the header or the file was truncated.
For more readable output, use f-strings:
rows, cols = df.shape
print(f"Dataset contains {rows:,} rows and {cols} columns")
# Dataset contains 2,500 rows and 9 columns
11.7.2 First and Last Rows: .head() and .tail()
.head(n) returns the first n rows (default 5). .tail(n) returns the last n rows (default 5).
# See the first 5 rows
print(df.head())
# See the first 10 rows
print(df.head(10))
# See the last 5 rows
print(df.tail())
These calls give you a visual sanity check: do the columns look right? Are the values in the expected format? Are there obvious problems in the first or last rows (common locations for totals, summary rows, or notes that got appended to the data)?
11.7.3 Column Names and Data Types: .dtypes
.dtypes shows you every column and the data type pandas assigned to it.
print(df.dtypes)
Output:
date datetime64[ns]
region object
salesperson object
product_sku object
category object
units_sold int64
unit_price float64
discount_rate float64
total_revenue float64
dtype: object
Understanding pandas data types:
| pandas dtype | What it means | Examples |
|---|---|---|
int64 |
Integer (whole number) | 1, 42, -7 |
float64 |
Floating-point number | 3.14, 249.99, -0.05 |
object |
String / mixed type | "Northeast", "Jordan Blake" |
datetime64[ns] |
Date and time | 2023-01-03 |
bool |
Boolean | True, False |
category |
Categorical (memory-efficient strings) | Useful for columns with few unique values |
The most important thing to check: are the types what you expect? If unit_price shows as object instead of float64, that means the values contain non-numeric characters like "$" or "," — and you have a cleaning problem. If date shows as object, it means dates were not parsed and will not work in time-based calculations.
11.7.4 Comprehensive Summary: .info()
.info() is the single most informative single-command summary of a DataFrame. It shows:
- The total number of rows
- Each column name
- The number of non-null values in each column
- The data type of each column
- Total memory usage
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2500 non-null datetime64[ns]
1 region 2487 non-null object
2 salesperson 2500 non-null object
3 product_sku 2500 non-null object
4 category 2495 non-null object
5 units_sold 2500 non-null int64
6 unit_price 2500 non-null float64
7 discount_rate 2500 non-null float64
8 total_revenue 2500 non-null float64
dtypes: float64(3), int64(1), object(4), datetime64[ns](1)
memory usage: 175.8 KB
Look at the "Non-Null Count" column carefully. region shows 2487 non-null out of 2500 rows — which means 13 rows are missing region data. category has 5 missing values. These are the data quality issues Priya will need to address.
The instant you see non-null counts that do not match the total row count, you have found missing data. Make a note of it.
11.7.5 Statistical Summary: .describe()
.describe() computes descriptive statistics for every numeric column.
print(df.describe())
Output:
units_sold unit_price discount_rate total_revenue
count 2500.000000 2500.000000 2500.000000 2500.000000
mean 3.814000 312.456789 0.087600 1053.234567
std 2.156789 198.234567 0.056789 789.234567
min 1.000000 12.990000 0.000000 12.990000
25% 2.000000 149.990000 0.050000 384.345678
50% 3.000000 279.990000 0.100000 839.940000
75% 5.000000 449.990000 0.150000 1574.919000
max 50.000000 1899.990000 0.300000 94999.500000
What to look for in .describe() output:
- count: Should match the total row count. If it does not, there are missing values in that column.
- min and max: Are they plausible? A minimum
units_soldof-5would indicate a data error (negative units sold). A maximumdiscount_rateof2.5when discounts should be between 0 and 1 would be a problem. - mean vs median (50%): A large gap suggests skewed data or outliers. Here,
meantotal_revenue of $1,053 vs `median` of $840 suggests some very large transactions are pulling the average up. - std (standard deviation): Relative to the mean, gives a sense of how spread out values are.
To include string columns in the description, use include="all":
print(df.describe(include="all"))
This adds count, unique, top, and freq rows for non-numeric columns, showing the most common value and how often it appears.
11.7.6 Counting Unique Values: .nunique() and .value_counts()
For columns containing categories, names, or codes, you want to know how many distinct values exist and what they are.
.nunique() counts the number of unique values in each column:
print(df.nunique())
Output:
date 248
region 5
salesperson 22
product_sku 127
category 8
units_sold 18
unit_price 89
discount_rate 7
total_revenue 832
dtype: int64
This immediately tells you that there are 5 regions, 22 salespersons, and 127 distinct product SKUs. If you expected only 4 regions, the 5 here prompts investigation — perhaps there is a typo or a new region was added.
.value_counts() goes deeper, showing you exactly what each unique value is and how many times it appears. You call it on a specific column (a Series):
# How many transactions in each region?
print(df["region"].value_counts())
Output:
Northeast 612
Southwest 521
Midwest 489
Southeast 468
West 397
Name: region, dtype: int64
# Proportions instead of counts
print(df["region"].value_counts(normalize=True).round(3))
Output:
Northeast 0.245
Southwest 0.208
Midwest 0.196
Southeast 0.187
West 0.159
Name: region, dtype: float64
# Value counts for category, sorted by name instead of frequency
print(df["category"].value_counts().sort_index())
11.7.7 Examining Missing Data
.info() shows you that missing data exists. To examine it in more detail:
# Count missing values in each column
print(df.isna().sum())
Output:
date 0
region 13
salesperson 0
product_sku 0
category 5
units_sold 0
unit_price 0
discount_rate 0
total_revenue 0
dtype: int64
# Show missing values as a percentage
missing_pct = (df.isna().sum() / len(df) * 100).round(2)
print(missing_pct)
Output:
date 0.00
region 0.52
salesperson 0.00
product_sku 0.00
category 0.20
units_sold 0.00
unit_price 0.00
discount_rate 0.00
total_revenue 0.00
dtype: float64
0.52% of rows are missing a region, and 0.20% are missing a category. These are small but real problems that must be addressed before analysis.
To see the actual rows with missing values:
# All rows where region is missing
print(df[df["region"].isna()])
# All rows where ANY column has a missing value
print(df[df.isna().any(axis=1)])
11.7.8 Memory Usage: .memory_usage()
Understanding memory usage matters when files are large. .memory_usage() shows how much RAM each column occupies.
# Memory usage per column in bytes
print(df.memory_usage(deep=True))
Output:
Index 20000
date 20000
region 166234
salesperson 185678
product_sku 182345
category 163891
units_sold 20000
unit_price 20000
discount_rate 20000
total_revenue 20000
dtype: int64
The deep=True parameter is important: without it, pandas shows a simplified estimate. With deep=True, it measures the actual memory used by string columns.
Notice that string columns (object dtype) use far more memory than numeric columns. units_sold, unit_price, etc. each use 20,000 bytes for 2,500 rows of int64 or float64 (8 bytes per value × 2,500). But region uses 166,234 bytes — about 66 bytes per row on average — because Python strings have significant overhead.
For columns with a small number of distinct values (like region with only 5 unique values, or category), converting to category dtype can dramatically reduce memory usage:
# Convert to category dtype for memory efficiency
df["region"] = df["region"].astype("category")
df["category"] = df["category"].astype("category")
# Check the new memory usage
print(df.memory_usage(deep=True))
For a dataset with 2,500 rows and only 5 regions, the category dtype typically reduces memory for that column by 70-80%.
11.8 A Systematic Workflow for Exploring New Datasets
When you receive any new dataset, follow this sequence. It takes about five minutes and will save you hours of debugging downstream.
Step 1: Load with Intent
Do not just pd.read_csv("file.csv") blindly. Think about:
- What separator does this file use?
- Are there any header rows to skip?
- Which columns contain dates that should be parsed?
- Are there any ID columns with leading zeros that must stay as strings?
- What does this system use to represent missing values?
import pandas as pd
df = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--", ""]
)
Step 2: Check Dimensions
print(f"Shape: {df.shape}")
# Verify this matches what you expected
Step 3: Look at the Top and Bottom
print("=== FIRST 5 ROWS ===")
print(df.head())
print("\n=== LAST 5 ROWS ===")
print(df.tail())
The last few rows often contain totals, footnotes, or extra blank rows. The first few rows reveal the actual data format.
Step 4: Check Column Names and Types
print("=== DATA TYPES ===")
print(df.dtypes)
Are the types what you expected? Flag any that need correction.
Step 5: Run .info()
print("=== DATASET INFO ===")
df.info()
This is your primary tool for spotting missing values and type problems in one view.
Step 6: Statistical Summary
print("=== DESCRIPTIVE STATISTICS ===")
print(df.describe())
Check min/max values for plausibility. Look for surprising values.
Step 7: Unique Values for Categorical Columns
print("=== UNIQUE VALUES ===")
print(df.nunique())
# For each column with a small number of unique values
for col in ["region", "category"]:
print(f"\n{col.upper()} VALUE COUNTS:")
print(df[col].value_counts())
Step 8: Missing Value Analysis
print("=== MISSING VALUES ===")
missing = df.isna().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({
"missing_count": missing,
"missing_pct": missing_pct
})
print(missing_report[missing_report["missing_count"] > 0])
Step 9: Document What You Found
Write down (in comments, a notebook cell, or a separate document) what you discovered: - Total rows and columns - Columns with missing values and percentages - Unexpected values or ranges - Type mismatches to fix - Potential duplicate records to investigate
This documentation is the bridge between exploration and cleaning. In Chapter 12, Priya will take exactly this list of issues and work through them systematically.
11.9 Understanding Type Inference in Depth
pandas' type inference is both powerful and a source of subtle bugs. Understanding how it works helps you predict when it will get things wrong.
How pandas Infers Types
When pandas reads a CSV, it scans each column and applies these rules in order:
- If all non-null values can be parsed as integers, the column becomes
int64. - If all non-null values can be parsed as floating-point numbers (but not all as integers), the column becomes
float64. - If the column was explicitly listed in
parse_dates, pandas attempts datetime parsing. - Everything else becomes
object(Python string).
Common Type Inference Failures
Integer columns with missing values become float.
This is one of the most common surprises. If units_sold has even one NaN, pandas cannot use int64 (which cannot hold NaN) and will use float64 instead. Your clean integer column of [1, 2, 3, NaN, 5] becomes [1.0, 2.0, 3.0, NaN, 5.0].
Newer pandas versions (1.0+) introduce nullable integer types (Int64 with a capital I) that solve this:
# After loading, convert to nullable integer
df["units_sold"] = df["units_sold"].astype("Int64") # Capital I!
Money columns with currency symbols become object.
A column containing "$249.99"` cannot be parsed as a float because of the `$. The entire column becomes object. You will need to strip the $ and convert — which is exactly what Chapter 12 covers.
Dates in non-standard formats fail silently or error.
If parse_dates=["date"] is set but the column contains "Jan 3, 2023" instead of "2023-01-03", pandas may fail to parse some or all values and return strings instead. Always verify your date columns with .dtypes after loading.
Mixed-type columns become object.
If a column contains mostly numbers but one row has a note like "see memo" in it, the entire column becomes object. This often happens in real spreadsheets where someone typed a note into a data column.
11.10 Practical Tips for Real Business Files
Dealing with Merged Cells in Excel
Excel files from finance or operations teams often use merged cells for formatting. pd.read_excel() handles this by filling only the first cell of a merged region and leaving the rest NaN. You will need to use .fillna(method="ffill") (forward-fill) to propagate the value — but that technique is covered in Chapter 12.
Preview Large Files Before Full Load
For files over 100MB, always preview first:
# Load just 100 rows to understand the structure
df_preview = pd.read_csv("massive_transactions.csv", nrows=100)
df_preview.info()
df_preview.head()
# Once you understand the structure, load the full file with correct parameters
Check File Size Before Loading
import os
file_path = "acme_sales_2023.csv"
size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {size_mb:.2f} MB")
Handling the BOM (Byte Order Mark)
Some CSV files exported from Excel contain a BOM (Byte Order Mark) at the start of the file, which causes the first column name to appear with extra characters like \ufeff or . The fix:
# encoding='utf-8-sig' handles files with a BOM
df = pd.read_csv("excel_export.csv", encoding="utf-8-sig")
11.11 Complete Loading and Inspection Example
Here is a complete, self-contained script that demonstrates everything in this chapter using data created inline (no external file needed). It creates a realistic Acme Corp sales DataFrame and runs the full inspection workflow.
import pandas as pd
from io import StringIO
# Simulated CSV content — in real work this would be a file path
csv_data = """date,region,salesperson,product_sku,category,units_sold,unit_price,discount_rate,total_revenue
2023-01-03,Northeast,Jordan Blake,OFF-CH-0045,Office Chairs,2,249.99,0.10,449.98
2023-01-03,Southwest,Priya Okonkwo,DESK-SIT-012,Desks,1,899.99,0.15,764.99
2023-01-04,Midwest,Sandra Chen,TECH-MON-089,Technology,3,349.99,0.05,997.47
2023-01-05,,Marcus Webb,OFF-CH-0046,Office Chairs,1,299.99,0.00,299.99
2023-01-05,Northeast,Jordan Blake,STOR-CAB-023,Storage,4,129.99,0.10,467.96
2023-01-08,West,Sarah Kim,TECH-KEY-034,Technology,2,79.99,0.10,143.98
2023-01-09,Southeast,David Park,DESK-SIT-012,Desks,1,899.99,0.20,719.99
2023-01-10,Northeast,Jordan Blake,OFF-CH-0045,office chairs,2,249.99,0.10,449.98
"""
# Load from StringIO (simulates loading from a file)
df = pd.read_csv(
StringIO(csv_data),
parse_dates=["date"],
dtype={"product_sku": str}
)
print("=" * 60)
print("ACME CORP SALES — INITIAL EXPLORATION")
print("=" * 60)
# Step 1: Dimensions
print(f"\nShape: {df.shape[0]} rows x {df.shape[1]} columns")
# Step 2: First rows
print("\n--- First 3 Rows ---")
print(df.head(3).to_string())
# Step 3: Data types
print("\n--- Data Types ---")
print(df.dtypes)
# Step 4: Info
print("\n--- Dataset Info ---")
df.info()
# Step 5: Statistical summary
print("\n--- Descriptive Statistics ---")
print(df.describe().round(2))
# Step 6: Unique values
print("\n--- Unique Value Counts per Column ---")
print(df.nunique())
print("\n--- Region Distribution ---")
print(df["region"].value_counts(dropna=False))
print("\n--- Category Distribution ---")
print(df["category"].value_counts(dropna=False))
# Step 7: Missing values
print("\n--- Missing Values ---")
missing = df.isna().sum()
print(missing[missing > 0])
# Step 8: Memory usage
print("\n--- Memory Usage (deep) ---")
print(df.memory_usage(deep=True))
11.12 Chapter Summary
Loading data is not a single function call — it is a series of deliberate decisions. The parameters you pass to pd.read_csv() and pd.read_excel() determine whether pandas correctly interprets your data's types, dates, separators, and missing values. Getting these right at the loading stage prevents a cascade of silent errors later.
Equally important is systematic inspection. The tools in pandas — .shape, .head(), .tail(), .dtypes, .info(), .describe(), .value_counts(), .nunique(), and .memory_usage() — together give you a complete picture of any dataset within minutes. Professional analysts do not skip this step, no matter how many times they have seen similar data.
In Chapter 12, you will take the skills from this chapter further: using the data quality issues discovered during exploration as the input to a systematic data cleaning pipeline.
Key Terms
DataFrame: The primary pandas data structure — a two-dimensional table with labeled rows and columns.
dtype: Short for "data type." In pandas, each column has a single dtype that determines what kind of values it holds and what operations are available.
Type inference: The process by which pandas automatically determines the dtype of each column when loading a file, based on the values it finds.
Missing value (NaN): A marker indicating that a value is absent. In pandas, NaN (Not a Number) is used to represent missing data across all dtypes.
Encoding: The system that maps text characters to binary data. UTF-8 and Latin-1 are the two most common encodings in business data.
Delimiter: The character (comma, tab, semicolon, pipe) that separates values in a text-based data file.
BOM (Byte Order Mark): A special character that some software inserts at the beginning of text files. Can cause unexpected characters to appear in column names.
What Comes Next
Chapter 12 picks up exactly where this chapter leaves off. Priya has explored acme_sales_2023.csv and found several issues: 13 rows with missing region, 5 rows with missing category, inconsistent capitalization in the category column ("office chairs" vs "Office Chairs"), and a unit_price column stored as strings with dollar signs in some versions of the file.
Chapter 12 teaches the complete toolkit for fixing all of these problems — and many others that appear routinely in real business data.