Chapter 11 Exercises: Loading and Exploring Real Business Datasets
These exercises are organized into five tiers of increasing difficulty. Complete the tiers in order — each builds on skills from the previous one. All exercises can be completed using inline data (StringIO or dict-based DataFrames) — no external files required.
Tier 1: Recall (Concept Checks)
These exercises test whether you can identify the correct function, parameter, or method for a given task.
Exercise 1.1 — Match the Parameter
For each scenario, identify which pd.read_csv() parameter you would use.
| Scenario | Parameter |
|---|---|
| The file uses semicolons instead of commas | _____ |
| The column names are in row 4, not row 0 | _____ |
| You want to load only 3 of 20 available columns | _____ |
| A product code column has values like "00234" that must keep their leading zeros | _____ |
| The date column contains values like "2023-01-15" and you want them as datetime objects | _____ |
| The file was created on a French Windows system | _____ |
| Your file uses "NULL" and "--" to mean missing | _____ |
| You only want to inspect the first 500 rows of a 5-million row file | _____ |
Answers: sep, header, usecols, dtype, parse_dates, encoding, na_values, nrows
Exercise 1.2 — True or False
Answer True or False for each statement. If False, write the correct statement.
pd.read_csv()always uses a comma as the separator unless told otherwise.- Setting
header=Nonemeans there is no data in the file. df.head()returns the first 10 rows by default..info()shows the number of non-null values per column..describe()only works on numeric columns..value_counts()includesNaNvalues in its output by default.dtype={"id": str}will raise an error if theidcolumn contains numbers..nunique()returns the number of unique values in each column.pd.read_excel()requires theopenpyxlpackage to read.xlsxfiles..memory_usage(deep=True)gives a more accurate estimate than.memory_usage()alone.
Exercise 1.3 — Fill in the Blank
Complete each line of code with the correct method or parameter.
import pandas as pd
# Load a CSV where the separator is a pipe character
df = pd.read_csv("data.csv", _____ = "|")
# Load only columns 'name', 'date', and 'amount'
df = pd.read_csv("data.csv", _____ = ["name", "date", "amount"])
# Show the last 10 rows of a DataFrame
print(df._____(10))
# Count missing values in each column
print(df._____().sum())
# Show the number of unique values per column
print(df._____())
# Show the most frequent values in the 'region' column, including NaN
print(df["region"]._____(dropna=False))
# Check the data type of each column
print(df._____)
Exercise 1.4 — Reading .info() Output
Given this .info() output, answer the questions below.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 1500 non-null object
1 order_date 1500 non-null datetime64[ns]
2 customer_id 1500 non-null object
3 product 1482 non-null object
4 quantity 1500 non-null int64
5 revenue 1467 non-null float64
dtypes: float64(1), int64(1), object(3), datetime64[ns](1)
memory usage: 90.3 KB
a) How many rows does this DataFrame have?
b) How many columns are missing at least one value?
c) How many rows are missing a product value?
d) Was parse_dates=["order_date"] likely used when loading this file? How do you know?
e) What is the most likely reason order_id is object type rather than int64?
f) How many rows have revenue data?
Tier 2: Application (Single-Concept Practice)
These exercises require you to write code that applies one or two concepts from the chapter.
Exercise 2.1 — Loading with the Right Parameters
Create the following CSV data as a string and load it correctly using pd.read_csv() with StringIO.
Report Title: Q1 2023 Regional Sales
Export Date: 2023-04-01
Source: Acme Corp CRM v4.2
region;salesperson;units_sold;sale_date;revenue
Northeast;Jordan Blake;5;2023-01-03;$1249.95
Southwest;Casey Rivera;3;2023-01-05;$899.97
Midwest;Alex Torres;8;2023-01-08;N/A
West;Sarah Kim;2;2023-01-09;$499.98
Requirements:
- Skip the 4 metadata rows at the top
- Use the correct separator (semicolons, not commas)
- Parse sale_date as a datetime
- Treat "N/A" as a missing value
- Keep revenue as a string for now (it has $ signs)
After loading, print the DataFrame and verify with .info().
Exercise 2.2 — Systematic Exploration Drill
Using the following inline data, run the complete 8-step exploration workflow from Section 11.8.
import pandas as pd
data = {
"invoice_date": ["2023-06-01", "2023-06-03", "2023-06-03", "2023-06-07",
"2023-06-10", "2023-06-12", "2023-06-15", None],
"client": ["Brightfield LLC", "Nova Dynamics", "Brightfield LLC",
"OakPath Media", "Nova Dynamics", "Harrington & Sons",
"LivTech Inc.", "OakPath Media"],
"project_type": ["Strategy", "Technology", "Strategy", "Creative",
"technology", "Manufacturing", "Technology", "creative"],
"hours": [12.5, 8.0, 6.5, None, 14.0, 20.0, 9.5, 7.0],
"rate": [150.0, 175.0, 150.0, 125.0, 175.0, 150.0, 175.0, 125.0],
"billed": [1875.0, 1400.0, 975.0, None, 2450.0, 3000.0, 1662.5, 875.0]
}
df = pd.DataFrame(data)
df["invoice_date"] = pd.to_datetime(df["invoice_date"])
Document at least 3 data quality issues you find.
Exercise 2.3 — Excel Sheet Exploration
Using the following code to simulate an Excel workbook scenario, demonstrate how to load specific sheets and explore each one.
import pandas as pd
# Simulate two sheets of data
q1_data = {
"month": ["January", "February", "March"],
"region": ["Northeast", "Southwest", "Midwest"],
"revenue": [125000.00, 98500.00, 142300.00],
"units": [450, 380, 512]
}
q2_data = {
"month": ["April", "May", "June"],
"region": ["Northeast", "Southwest", "Midwest"],
"revenue": [134000.00, 107200.00, 156800.00],
"units": [485, 412, 548]
}
q1 = pd.DataFrame(q1_data)
q2 = pd.DataFrame(q2_data)
Write code that: a) Prints the shape of each "sheet" (DataFrame) b) Compares total revenue between Q1 and Q2 c) Checks whether both DataFrames have the same columns d) Combines them into a single DataFrame and inspects the result
Exercise 2.4 — Missing Value Investigation
import pandas as pd
import numpy as np
employee_data = {
"employee_id": ["E001", "E002", "E003", "E004", "E005",
"E006", "E007", "E008", "E009", "E010"],
"name": ["Aisha Patel", "Bruno Santos", "Clara Kim",
"Derek Jones", "Elena Vasquez", "Frank Chen",
"Grace Liu", "Hassan Al-Farsi", "Irene Wong", "James Park"],
"department": ["Sales", "HR", "Sales", None, "IT",
"HR", None, "Sales", "IT", "Finance"],
"salary": [72000, 58000, 68000, 81000, 95000,
None, 74000, 69000, 91000, 88000],
"hire_date": ["2019-03-01", "2020-07-15", "2018-11-01", "2021-01-10",
"2017-06-20", "2022-09-01", "2023-02-14", None,
"2019-08-05", "2020-12-01"],
"performance": ["Exceeds", "Meets", "Meets", "Exceeds", None,
"Meets", "Below", "Exceeds", None, "Meets"]
}
df = pd.DataFrame(employee_data)
Write code to:
a) Count missing values in each column
b) Show the percentage of missing values per column
c) Display the rows where department is missing
d) Display the rows where either salary OR performance is missing
e) Report the total number of cells in the entire DataFrame that contain missing values
Exercise 2.5 — Memory Usage Analysis
Load the following data and analyze memory usage:
import pandas as pd
sales_data = {
"transaction_id": range(1, 10001),
"region": (["Northeast", "Southwest", "Midwest", "Southeast", "West"] * 2000),
"category": (["Office Chairs", "Desks", "Technology", "Storage"] * 2500),
"salesperson": (["Jordan Blake", "Casey Rivera", "Alex Torres",
"Sarah Kim", "Morgan Lee"] * 2000),
"units_sold": ([1, 2, 3, 4, 5] * 2000),
"unit_price": ([99.99, 199.99, 299.99, 399.99, 499.99] * 2000),
"total_revenue": ([149.99, 349.99, 749.99, 1199.99, 1999.99] * 2000),
}
df = pd.DataFrame(sales_data)
Write code to:
a) Show memory usage per column (in kilobytes)
b) Identify which columns are candidates for category dtype conversion
c) Convert the candidates to category dtype
d) Show the memory usage again and calculate the total savings in kilobytes
Tier 3: Synthesis (Multi-Concept Problems)
These exercises require you to combine multiple concepts to solve a realistic business scenario.
Exercise 3.1 — The New Client File
You receive a CSV export from a client's order management system. The file has problems you need to identify before analysis.
import pandas as pd
from io import StringIO
problematic_csv = """Generated by OrderMaster v3.1
Export timestamp: 2024-01-15 09:23:44
OrderID|CustomerName|OrderDate|ProductCode|Quantity|UnitCost|Status|Region
ORD-10001|Acme Corp|01/03/2023|PROD-001|3|$149.99|fulfilled|Northeast
ORD-10002|GlobalTech|2023-01-05|PROD-002|1|$799.00|Fulfilled|Southwest
ORD-10003|MidWest Supplies|January 8, 2023|PROD-001|5|$149.99|FULFILLED|Midwest
ORD-10004|Acme Corp||PROD-003|2|$299.99|Pending|Northeast
ORD-10005|CoastalCo|01/09/2023|PROD-002|1|$799.00|fulfilled|West
ORD-10001|Acme Corp|01/03/2023|PROD-001|3|$149.99|fulfilled|Northeast
ORD-10006|GlobalTech|2023-01-12|PROD-004|7|$49.99||Southeast
ORD-10007|MidWest Supplies|01/15/2023|PROD-001|2|$149.99|Cancelled|Midwest
"""
Your task: a) Load the CSV correctly (handle the metadata rows, separator, and missing values) b) Run the complete exploration workflow c) List every data quality issue you find, organized by category: - Missing values - Type issues - Consistency issues - Duplicate records d) For each issue, suggest the fix you would apply in Chapter 12
Exercise 3.2 — Building a Reusable Exploration Function
Write a function called explore_dataset() that:
- Accepts a DataFrame and an optional name parameter
- Runs the complete 8-step exploration workflow
- Returns a dictionary containing: shape, missing_value_report, type_issues, and any categories with inconsistent capitalization
- Prints a nicely formatted report
Test your function on two different DataFrames from this chapter.
Exercise 3.3 — Comparing Two Time Periods
import pandas as pd
h1_data = {
"month": [1, 2, 3, 4, 5, 6],
"region": ["Northeast", "Northeast", "Southwest", "Southwest", "Midwest", "Midwest"],
"revenue": [125000, 118500, 98200, 103400, 142300, 138700],
"orders": [450, 428, 365, 389, 512, 498]
}
h2_data = {
"month": [7, 8, 9, 10, 11, 12],
"region": ["Northeast", "Northeast", "Southwest", "Southwest", "Midwest", "Midwest"],
"revenue": [134000, 129800, 107200, 112600, 156800, 161400],
"orders": [485, 471, 412, 431, 548, 562]
}
h1 = pd.DataFrame(h1_data)
h2 = pd.DataFrame(h2_data)
a) Explore both DataFrames and verify they have compatible structures b) Compare total revenue for H1 vs H2 c) Compare average monthly orders for H1 vs H2 d) Which region appears in both halves? Are the values consistent? e) Combine the two DataFrames and produce a summary by region
Tier 4: Analysis (Open-Ended Exploration)
These exercises develop critical thinking about data quality and exploration strategy.
Exercise 4.1 — The Detective Exercise
import pandas as pd
mystery_df = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2],
"value_a": [100, 200, 150, 175, None, 225, 190, 210, None, 185, 200],
"value_b": ["10%", "15%", "10%", "N/A", "20%", "15%", "10%",
"twenty percent", "15%", "10%", "15%"],
"category": ["Alpha", "Beta", "ALPHA", "Gamma", "beta", "Alpha",
"Gamma", "Beta", "GAMMA", "alpha", "Beta"],
"date_str": ["2023-01-01", "Jan 2, 2023", "2023-01-03", "01/04/2023",
"2023-01-05", "January 6", "2023-01-07", "2023-01-08",
"2023/01/09", "10-01-2023", "Jan 2, 2023"]
})
Perform a full exploration of this DataFrame and write a "data quality incident report" that: a) Lists every issue found, with the column name, issue type, and example values b) Estimates the impact of each issue if the data were used for analysis without cleaning c) Prioritizes the issues from most to least critical for a revenue analysis d) Proposes specific cleaning operations for each issue
Exercise 4.2 — Designing an Exploration Checklist
Based on everything you have learned in this chapter, write a data exploration checklist for your team. The checklist should: - Have at least 12 items - Specify exactly what command(s) to run for each item - Note what "passing" looks like vs. what a warning sign looks like - Be applicable to any tabular dataset, not just sales data
Format it as a Python script that can be run on any DataFrame.
Tier 5: Extension (Beyond the Chapter)
Exercise 5.1 — Large File Strategy
You are told that a client's sales file is approximately 2.5 GB and contains about 15 million rows. You cannot load the entire file into memory on your laptop.
Write a strategy (in both English and Python pseudocode) for: a) Getting the column names and data types without loading all 15 million rows b) Getting a representative sample to understand the data distribution c) Loading only the columns you need for your specific analysis task (revenue by region by month) d) Handling the case where even your subset of columns is too large
Note: For parts a-c, write working Python code using a small test DataFrame. For part d, research the chunksize parameter of pd.read_csv() and describe how it works.
Exercise 5.2 — URL Data Loading
Write a script that: a) Loads a CSV from a public URL (you may use a URL from a government data portal or public dataset repository) b) Runs the complete exploration workflow on it c) Handles potential failures gracefully with a try/except block that falls back to a local copy d) Reports on any data quality issues found
Note: If internet access is not available, simulate the exercise by defining your own CSV string with StringIO and pretending it came from a URL.
Exercise 5.3 — Connecting Exploration to Business Questions
You have been given the following business questions by a VP:
- "Which region had the highest total revenue in 2023?"
- "Who are our top 5 salespeople by units sold?"
- "What is the average discount rate we give on orders over $1,000?"
- "Are there any salespeople who consistently give discounts above 20%?"
Using the Acme sales data from this chapter: a) For each business question, identify exactly which columns you need b) Identify any data quality issues in those specific columns that would affect the answer c) Write the exploration code to verify those columns are clean enough to use d) Write preliminary analysis code to answer each question (even if the data needs some cleaning first)
Answer Hints (Selected Exercises)
Exercise 1.2 Answers:
1. True
2. False — it means there is no header row; data starts on row 0
3. False — default is 5 rows
4. True
5. False — describe(include="all") includes non-numeric columns
6. False — NaN is excluded by default; use dropna=False to include it
7. False — pandas will convert the numbers to strings
8. True
9. True
10. True
Exercise 1.3 Answers:
sep, usecols, tail, isna, nunique, value_counts, dtypes
Exercise 1.4 Answers:
a) 1,500 rows
b) 2 columns (product and revenue)
c) 18 rows (1500 - 1482)
d) Yes — order_date shows datetime64[ns], not object
e) Likely contains alphanumeric IDs (like "ORD-1001") or has leading zeros that must be preserved
f) 1,467 rows