Chapter 11 Exercises: Loading and Exploring Real Business Datasets

These exercises are organized into five tiers of increasing difficulty. Complete the tiers in order — each builds on skills from the previous one. All exercises can be completed using inline data (StringIO or dict-based DataFrames) — no external files required.


Tier 1: Recall (Concept Checks)

These exercises test whether you can identify the correct function, parameter, or method for a given task.


Exercise 1.1 — Match the Parameter

For each scenario, identify which pd.read_csv() parameter you would use.

Scenario Parameter
The file uses semicolons instead of commas _____
The column names are in row 4, not row 0 _____
You want to load only 3 of 20 available columns _____
A product code column has values like "00234" that must keep their leading zeros _____
The date column contains values like "2023-01-15" and you want them as datetime objects _____
The file was created on a French Windows system _____
Your file uses "NULL" and "--" to mean missing _____
You only want to inspect the first 500 rows of a 5-million row file _____

Answers: sep, header, usecols, dtype, parse_dates, encoding, na_values, nrows


Exercise 1.2 — True or False

Answer True or False for each statement. If False, write the correct statement.

  1. pd.read_csv() always uses a comma as the separator unless told otherwise.
  2. Setting header=None means there is no data in the file.
  3. df.head() returns the first 10 rows by default.
  4. .info() shows the number of non-null values per column.
  5. .describe() only works on numeric columns.
  6. .value_counts() includes NaN values in its output by default.
  7. dtype={"id": str} will raise an error if the id column contains numbers.
  8. .nunique() returns the number of unique values in each column.
  9. pd.read_excel() requires the openpyxl package to read .xlsx files.
  10. .memory_usage(deep=True) gives a more accurate estimate than .memory_usage() alone.

Exercise 1.3 — Fill in the Blank

Complete each line of code with the correct method or parameter.

import pandas as pd

# Load a CSV where the separator is a pipe character
df = pd.read_csv("data.csv", _____ = "|")

# Load only columns 'name', 'date', and 'amount'
df = pd.read_csv("data.csv", _____ = ["name", "date", "amount"])

# Show the last 10 rows of a DataFrame
print(df._____(10))

# Count missing values in each column
print(df._____().sum())

# Show the number of unique values per column
print(df._____())

# Show the most frequent values in the 'region' column, including NaN
print(df["region"]._____(dropna=False))

# Check the data type of each column
print(df._____)

Exercise 1.4 — Reading .info() Output

Given this .info() output, answer the questions below.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   order_id      1500 non-null   object
 1   order_date    1500 non-null   datetime64[ns]
 2   customer_id   1500 non-null   object
 3   product       1482 non-null   object
 4   quantity      1500 non-null   int64
 5   revenue       1467 non-null   float64
dtypes: float64(1), int64(1), object(3), datetime64[ns](1)
memory usage: 90.3 KB

a) How many rows does this DataFrame have? b) How many columns are missing at least one value? c) How many rows are missing a product value? d) Was parse_dates=["order_date"] likely used when loading this file? How do you know? e) What is the most likely reason order_id is object type rather than int64? f) How many rows have revenue data?


Tier 2: Application (Single-Concept Practice)

These exercises require you to write code that applies one or two concepts from the chapter.


Exercise 2.1 — Loading with the Right Parameters

Create the following CSV data as a string and load it correctly using pd.read_csv() with StringIO.

Report Title: Q1 2023 Regional Sales
Export Date: 2023-04-01
Source: Acme Corp CRM v4.2

region;salesperson;units_sold;sale_date;revenue
Northeast;Jordan Blake;5;2023-01-03;$1249.95
Southwest;Casey Rivera;3;2023-01-05;$899.97
Midwest;Alex Torres;8;2023-01-08;N/A
West;Sarah Kim;2;2023-01-09;$499.98

Requirements: - Skip the 4 metadata rows at the top - Use the correct separator (semicolons, not commas) - Parse sale_date as a datetime - Treat "N/A" as a missing value - Keep revenue as a string for now (it has $ signs)

After loading, print the DataFrame and verify with .info().


Exercise 2.2 — Systematic Exploration Drill

Using the following inline data, run the complete 8-step exploration workflow from Section 11.8.

import pandas as pd

data = {
    "invoice_date": ["2023-06-01", "2023-06-03", "2023-06-03", "2023-06-07",
                     "2023-06-10", "2023-06-12", "2023-06-15", None],
    "client":       ["Brightfield LLC", "Nova Dynamics", "Brightfield LLC",
                     "OakPath Media", "Nova Dynamics", "Harrington & Sons",
                     "LivTech Inc.", "OakPath Media"],
    "project_type": ["Strategy", "Technology", "Strategy", "Creative",
                     "technology", "Manufacturing", "Technology", "creative"],
    "hours":        [12.5, 8.0, 6.5, None, 14.0, 20.0, 9.5, 7.0],
    "rate":         [150.0, 175.0, 150.0, 125.0, 175.0, 150.0, 175.0, 125.0],
    "billed":       [1875.0, 1400.0, 975.0, None, 2450.0, 3000.0, 1662.5, 875.0]
}

df = pd.DataFrame(data)
df["invoice_date"] = pd.to_datetime(df["invoice_date"])

Document at least 3 data quality issues you find.


Exercise 2.3 — Excel Sheet Exploration

Using the following code to simulate an Excel workbook scenario, demonstrate how to load specific sheets and explore each one.

import pandas as pd

# Simulate two sheets of data
q1_data = {
    "month": ["January", "February", "March"],
    "region": ["Northeast", "Southwest", "Midwest"],
    "revenue": [125000.00, 98500.00, 142300.00],
    "units": [450, 380, 512]
}

q2_data = {
    "month": ["April", "May", "June"],
    "region": ["Northeast", "Southwest", "Midwest"],
    "revenue": [134000.00, 107200.00, 156800.00],
    "units": [485, 412, 548]
}

q1 = pd.DataFrame(q1_data)
q2 = pd.DataFrame(q2_data)

Write code that: a) Prints the shape of each "sheet" (DataFrame) b) Compares total revenue between Q1 and Q2 c) Checks whether both DataFrames have the same columns d) Combines them into a single DataFrame and inspects the result


Exercise 2.4 — Missing Value Investigation

import pandas as pd
import numpy as np

employee_data = {
    "employee_id": ["E001", "E002", "E003", "E004", "E005",
                    "E006", "E007", "E008", "E009", "E010"],
    "name":        ["Aisha Patel", "Bruno Santos", "Clara Kim",
                    "Derek Jones", "Elena Vasquez", "Frank Chen",
                    "Grace Liu", "Hassan Al-Farsi", "Irene Wong", "James Park"],
    "department":  ["Sales", "HR", "Sales", None, "IT",
                    "HR", None, "Sales", "IT", "Finance"],
    "salary":      [72000, 58000, 68000, 81000, 95000,
                    None, 74000, 69000, 91000, 88000],
    "hire_date":   ["2019-03-01", "2020-07-15", "2018-11-01", "2021-01-10",
                    "2017-06-20", "2022-09-01", "2023-02-14", None,
                    "2019-08-05", "2020-12-01"],
    "performance": ["Exceeds", "Meets", "Meets", "Exceeds", None,
                    "Meets", "Below", "Exceeds", None, "Meets"]
}

df = pd.DataFrame(employee_data)

Write code to: a) Count missing values in each column b) Show the percentage of missing values per column c) Display the rows where department is missing d) Display the rows where either salary OR performance is missing e) Report the total number of cells in the entire DataFrame that contain missing values


Exercise 2.5 — Memory Usage Analysis

Load the following data and analyze memory usage:

import pandas as pd

sales_data = {
    "transaction_id": range(1, 10001),
    "region":         (["Northeast", "Southwest", "Midwest", "Southeast", "West"] * 2000),
    "category":       (["Office Chairs", "Desks", "Technology", "Storage"] * 2500),
    "salesperson":    (["Jordan Blake", "Casey Rivera", "Alex Torres",
                        "Sarah Kim", "Morgan Lee"] * 2000),
    "units_sold":     ([1, 2, 3, 4, 5] * 2000),
    "unit_price":     ([99.99, 199.99, 299.99, 399.99, 499.99] * 2000),
    "total_revenue":  ([149.99, 349.99, 749.99, 1199.99, 1999.99] * 2000),
}

df = pd.DataFrame(sales_data)

Write code to: a) Show memory usage per column (in kilobytes) b) Identify which columns are candidates for category dtype conversion c) Convert the candidates to category dtype d) Show the memory usage again and calculate the total savings in kilobytes


Tier 3: Synthesis (Multi-Concept Problems)

These exercises require you to combine multiple concepts to solve a realistic business scenario.


Exercise 3.1 — The New Client File

You receive a CSV export from a client's order management system. The file has problems you need to identify before analysis.

import pandas as pd
from io import StringIO

problematic_csv = """Generated by OrderMaster v3.1
Export timestamp: 2024-01-15 09:23:44

OrderID|CustomerName|OrderDate|ProductCode|Quantity|UnitCost|Status|Region
ORD-10001|Acme Corp|01/03/2023|PROD-001|3|$149.99|fulfilled|Northeast
ORD-10002|GlobalTech|2023-01-05|PROD-002|1|$799.00|Fulfilled|Southwest
ORD-10003|MidWest Supplies|January 8, 2023|PROD-001|5|$149.99|FULFILLED|Midwest
ORD-10004|Acme Corp||PROD-003|2|$299.99|Pending|Northeast
ORD-10005|CoastalCo|01/09/2023|PROD-002|1|$799.00|fulfilled|West
ORD-10001|Acme Corp|01/03/2023|PROD-001|3|$149.99|fulfilled|Northeast
ORD-10006|GlobalTech|2023-01-12|PROD-004|7|$49.99||Southeast
ORD-10007|MidWest Supplies|01/15/2023|PROD-001|2|$149.99|Cancelled|Midwest
"""

Your task: a) Load the CSV correctly (handle the metadata rows, separator, and missing values) b) Run the complete exploration workflow c) List every data quality issue you find, organized by category: - Missing values - Type issues - Consistency issues - Duplicate records d) For each issue, suggest the fix you would apply in Chapter 12


Exercise 3.2 — Building a Reusable Exploration Function

Write a function called explore_dataset() that: - Accepts a DataFrame and an optional name parameter - Runs the complete 8-step exploration workflow - Returns a dictionary containing: shape, missing_value_report, type_issues, and any categories with inconsistent capitalization - Prints a nicely formatted report

Test your function on two different DataFrames from this chapter.


Exercise 3.3 — Comparing Two Time Periods

import pandas as pd

h1_data = {
    "month":   [1, 2, 3, 4, 5, 6],
    "region":  ["Northeast", "Northeast", "Southwest", "Southwest", "Midwest", "Midwest"],
    "revenue": [125000, 118500, 98200, 103400, 142300, 138700],
    "orders":  [450, 428, 365, 389, 512, 498]
}

h2_data = {
    "month":   [7, 8, 9, 10, 11, 12],
    "region":  ["Northeast", "Northeast", "Southwest", "Southwest", "Midwest", "Midwest"],
    "revenue": [134000, 129800, 107200, 112600, 156800, 161400],
    "orders":  [485, 471, 412, 431, 548, 562]
}

h1 = pd.DataFrame(h1_data)
h2 = pd.DataFrame(h2_data)

a) Explore both DataFrames and verify they have compatible structures b) Compare total revenue for H1 vs H2 c) Compare average monthly orders for H1 vs H2 d) Which region appears in both halves? Are the values consistent? e) Combine the two DataFrames and produce a summary by region


Tier 4: Analysis (Open-Ended Exploration)

These exercises develop critical thinking about data quality and exploration strategy.


Exercise 4.1 — The Detective Exercise

import pandas as pd

mystery_df = pd.DataFrame({
    "id":        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2],
    "value_a":   [100, 200, 150, 175, None, 225, 190, 210, None, 185, 200],
    "value_b":   ["10%", "15%", "10%", "N/A", "20%", "15%", "10%",
                  "twenty percent", "15%", "10%", "15%"],
    "category":  ["Alpha", "Beta", "ALPHA", "Gamma", "beta", "Alpha",
                  "Gamma", "Beta", "GAMMA", "alpha", "Beta"],
    "date_str":  ["2023-01-01", "Jan 2, 2023", "2023-01-03", "01/04/2023",
                  "2023-01-05", "January 6", "2023-01-07", "2023-01-08",
                  "2023/01/09", "10-01-2023", "Jan 2, 2023"]
})

Perform a full exploration of this DataFrame and write a "data quality incident report" that: a) Lists every issue found, with the column name, issue type, and example values b) Estimates the impact of each issue if the data were used for analysis without cleaning c) Prioritizes the issues from most to least critical for a revenue analysis d) Proposes specific cleaning operations for each issue


Exercise 4.2 — Designing an Exploration Checklist

Based on everything you have learned in this chapter, write a data exploration checklist for your team. The checklist should: - Have at least 12 items - Specify exactly what command(s) to run for each item - Note what "passing" looks like vs. what a warning sign looks like - Be applicable to any tabular dataset, not just sales data

Format it as a Python script that can be run on any DataFrame.


Tier 5: Extension (Beyond the Chapter)


Exercise 5.1 — Large File Strategy

You are told that a client's sales file is approximately 2.5 GB and contains about 15 million rows. You cannot load the entire file into memory on your laptop.

Write a strategy (in both English and Python pseudocode) for: a) Getting the column names and data types without loading all 15 million rows b) Getting a representative sample to understand the data distribution c) Loading only the columns you need for your specific analysis task (revenue by region by month) d) Handling the case where even your subset of columns is too large

Note: For parts a-c, write working Python code using a small test DataFrame. For part d, research the chunksize parameter of pd.read_csv() and describe how it works.


Exercise 5.2 — URL Data Loading

Write a script that: a) Loads a CSV from a public URL (you may use a URL from a government data portal or public dataset repository) b) Runs the complete exploration workflow on it c) Handles potential failures gracefully with a try/except block that falls back to a local copy d) Reports on any data quality issues found

Note: If internet access is not available, simulate the exercise by defining your own CSV string with StringIO and pretending it came from a URL.


Exercise 5.3 — Connecting Exploration to Business Questions

You have been given the following business questions by a VP:

  1. "Which region had the highest total revenue in 2023?"
  2. "Who are our top 5 salespeople by units sold?"
  3. "What is the average discount rate we give on orders over $1,000?"
  4. "Are there any salespeople who consistently give discounts above 20%?"

Using the Acme sales data from this chapter: a) For each business question, identify exactly which columns you need b) Identify any data quality issues in those specific columns that would affect the answer c) Write the exploration code to verify those columns are clean enough to use d) Write preliminary analysis code to answer each question (even if the data needs some cleaning first)


Answer Hints (Selected Exercises)

Exercise 1.2 Answers: 1. True 2. False — it means there is no header row; data starts on row 0 3. False — default is 5 rows 4. True 5. False — describe(include="all") includes non-numeric columns 6. False — NaN is excluded by default; use dropna=False to include it 7. False — pandas will convert the numbers to strings 8. True 9. True 10. True

Exercise 1.3 Answers: sep, usecols, tail, isna, nunique, value_counts, dtypes

Exercise 1.4 Answers: a) 1,500 rows b) 2 columns (product and revenue) c) 18 rows (1500 - 1482) d) Yes — order_date shows datetime64[ns], not object e) Likely contains alphanumeric IDs (like "ORD-1001") or has leading zeros that must be preserved f) 1,467 rows