Chapter 11 Exercises: Working with Dates, Times, and Time Series Data
How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 11 material, building from recall to original analysis. Part E applies your skills to new datasets. Part M mixes in concepts from earlier chapters. You'll need Python, pandas, and numpy for most problems.
Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension
Part A: Conceptual Understanding (1-star)
These questions check whether you absorbed the core ideas from the chapter.
Exercise 11.1 — Why dates need parsing
Explain why sorting a column of date strings in "MM/DD/YYYY" format produces incorrect results. Give a specific example of two dates that would sort wrong as strings but correctly as datetime objects.
Guidance
String sorting is lexicographic (character by character, left to right). The string "12/01/2023" (December 1) comes before "2/15/2023" (February 15) because "1" < "2" in character comparison. But chronologically, February comes before December. After parsing to datetime, pandas sorts by actual chronological order. Any pair where the month with a higher number (10, 11, 12) would sort before a single-digit month illustrates this.Exercise 11.2 — Format codes
Write the format string for each of these date formats. Do not run any code — write them from memory.
"2023-03-15"(ISO format)"03/15/2023"(US format)"15-Mar-2023"(abbreviated month)"March 15, 2023 at 2:30 PM""Wed, 15 Mar 2023 14:30:00"
Guidance
1. `"%Y-%m-%d"` 2. `"%m/%d/%Y"` 3. `"%d-%b-%Y"` 4. `"%B %d, %Y at %I:%M %p"` 5. `"%a, %d %b %Y %H:%M:%S"` Remember: `%Y` is 4-digit year, `%y` is 2-digit. `%B` is full month name, `%b` is abbreviated. `%H` is 24-hour, `%I` is 12-hour (needs `%p` for AM/PM).Exercise 11.3 — Resampling vs. groupby
Explain the relationship between .resample() and .groupby(). In what situation would you use resample instead of groupby? Can you achieve the same result with groupby?
Guidance
`.resample()` is essentially a time-aware version of `.groupby()`. It groups data by time periods (weeks, months, quarters) and applies aggregation functions. You could achieve similar results with `groupby(df["date"].dt.month)`, but `resample` is more convenient because: (1) it requires a DatetimeIndex, which enforces proper date handling, (2) it handles edge cases like incomplete periods, (3) it supports upsampling, and (4) it uses standardized frequency aliases. Use `resample` when working with time series; use `groupby` when grouping by non-time categories.Exercise 11.4 — Rolling windows explained
Explain in your own words what a 7-day rolling average is. Why is it so commonly used for daily data? What trade-off does the window size control?
Guidance
A 7-day rolling average replaces each day's value with the average of that day and the 6 preceding days. It's common because 7 days = one week, so it smooths out the day-of-week effect (weekdays vs. weekends). The window size controls a trade-off: smaller windows are more responsive to recent changes but noisier; larger windows are smoother but slower to react to real changes. A 3-day window is responsive but still shows fluctuations; a 30-day window is very smooth but might hide short-term trends.Exercise 11.5 — Timedelta vs. DateOffset
Explain the difference between pd.Timedelta(days=30) and pd.DateOffset(months=1). Give a specific example where they produce different results when added to the same date.
Guidance
`Timedelta(days=30)` adds exactly 30 days — it doesn't know anything about months or calendars. `DateOffset(months=1)` adds one calendar month — it understands that months have different lengths. Example: starting from January 31: `Timestamp("2023-01-31") + Timedelta(days=30)` gives March 2 (30 days later), while `Timestamp("2023-01-31") + DateOffset(months=1)` gives February 28 (one calendar month later, adjusted because February doesn't have 31 days).Part B: Applied Problems (2-star)
Exercise 11.6 — Parsing a messy date column
Parse the following Series into datetime objects. Handle errors gracefully.
dates = pd.Series([
"2023-03-15",
"03/20/2023",
"April 1, 2023",
"N/A",
"2023-05-10",
"not recorded",
"06-15-2023",
"",
"2023/07/04"
])
After parsing, report how many valid dates you have and how many became NaT.
Guidance
Use `pd.to_datetime(dates, errors="coerce")` for a first pass. This will handle the common formats automatically and convert unparseable entries ("N/A", "not recorded", "") to `NaT`. Count valid dates with `.notna().sum()` and `NaT` values with `.isna().sum()`. You should get 7 valid dates and 2 NaT values (or 3, depending on how the empty string is handled).Exercise 11.7 — Date arithmetic
Create a DataFrame with the following columns: patient_id, first_dose_date, second_dose_date. Include at least 6 patients with realistic dates. Then compute:
- The number of days between doses for each patient
- The average gap between doses
- Which patients waited more than 28 days
- A
booster_duecolumn showing 180 days after the second dose
Guidance
Subtract the date columns: `df["gap"] = df["second_dose_date"] - df["first_dose_date"]`. Get days with `.dt.days`. For the booster date: `df["booster_due"] = df["second_dose_date"] + pd.Timedelta(days=180)`. The average gap: `df["gap"].dt.days.mean()`.Exercise 11.8 — Building and using a DatetimeIndex
Create a time series of daily temperature readings for one month (30 days). Use pd.date_range() to generate the dates and np.random.normal(20, 5, 30) for temperatures. Then:
- Set the date as the index
- Select all data from the second week (days 8-14)
- Find the hottest and coldest days
- Compute the 5-day rolling average
Guidance
dates = pd.date_range("2023-06-01", periods=30, freq="D")
temps = pd.Series(np.random.normal(20, 5, 30), index=dates, name="temp_c")
Select with partial string: `temps["2023-06-08":"2023-06-14"]`. Hottest: `temps.idxmax()`. Rolling: `temps.rolling(5).mean()`.
Exercise 11.9 — Resampling daily to weekly and monthly
Using the daily vaccination data below, compute:
- Weekly total vaccinations (summing daily values)
- Monthly average daily vaccinations
- The month with the highest total vaccinations
- The week-over-week percentage change in total vaccinations
import numpy as np
np.random.seed(42)
dates = pd.date_range("2023-01-01", "2023-06-30", freq="D")
daily = pd.Series(
np.random.normal(5000, 800, len(dates)).clip(min=0).round(0),
index=dates, name="vaccinations")
Guidance
Weekly total: `daily.resample("W").sum()`. Monthly average: `daily.resample("ME").mean()`. Month with highest total: `daily.resample("ME").sum().idxmax()`. Week-over-week change: `weekly = daily.resample("W").sum(); weekly.pct_change() * 100`.Exercise 11.10 — Rolling vs. expanding
Using the same daily vaccination data from Exercise 11.9:
- Compute the 7-day rolling mean
- Compute the expanding (cumulative) mean
- Compare the two on day 30 — which is higher? Explain why.
- Compute the 7-day rolling standard deviation. What does a high standard deviation indicate?
Guidance
Rolling: `daily.rolling(7).mean()`. Expanding: `daily.expanding().mean()`. On day 30, the expanding mean is the average of all 30 days, while the rolling mean is the average of only the most recent 7 days. Whether rolling or expanding is higher depends on whether recent days were above or below the overall average. The rolling standard deviation measures recent volatility — a high value means daily vaccinations are fluctuating widely.Exercise 11.11 — Extracting date components for analysis
Load or create a DataFrame with a date column spanning at least one full year. Add columns for: year, month, day of week (as name), quarter, and a boolean is_weekend. Then use groupby to answer:
- Which day of the week has the most/fewest events?
- Which quarter has the most events?
- What's the weekend vs. weekday ratio?
Guidance
Use `.dt.year`, `.dt.month`, `.dt.day_name()`, `.dt.quarter`, `.dt.dayofweek >= 5`. Group by each component and count or sum. The weekend ratio: `df.groupby("is_weekend").size()` gives you counts, then divide weekend by weekday.Exercise 11.12 — date_range for scheduling
Use pd.date_range() to generate:
- Every business day (Monday-Friday) in March 2023
- The first day of each quarter in 2023
- Every other Wednesday from January to June 2023
- The last business day of each month in 2023
Guidance
1. `pd.date_range("2023-03-01", "2023-03-31", freq="B")` — "B" is business day 2. `pd.date_range("2023-01-01", periods=4, freq="QS")` — "QS" is quarter start 3. `pd.date_range("2023-01-01", "2023-06-30", freq="2W-WED")` — every 2 weeks anchored to Wednesday 4. `pd.date_range("2023-01-01", "2023-12-31", freq="BME")` — "BME" is business month end (or use "BM" in older pandas)Exercise 11.13 — Time zones
Create a Timestamp for "2023-07-04 12:00:00" in US/Eastern timezone. Then:
- Convert it to UTC
- Convert it to US/Pacific
- Convert it to Europe/London
- Explain why Europe/London is NOT the same as UTC on July 4th
Guidance
ts = pd.Timestamp("2023-07-04 12:00:00", tz="US/Eastern")
ts.tz_convert("UTC") # 16:00 UTC
ts.tz_convert("US/Pacific") # 09:00 Pacific
ts.tz_convert("Europe/London") # 17:00 London
London is UTC+1 in summer (BST — British Summer Time), not UTC+0. This is why timezone names (which handle daylight saving) are preferred over fixed UTC offsets.
Part C: Real-World Scenarios (3-star)
Exercise 11.14 — Sales time series analysis
Create a simulated dataset of daily coffee shop sales for one year (365 days) with these characteristics: - Base sales around $800/day - A gradual upward trend - Higher sales on weekdays than weekends - Seasonal bump in winter (November-February) - Random noise
Then perform a complete time series analysis: 1. Plot (describe or print) the daily, 7-day rolling, and 30-day rolling averages 2. Compute monthly total revenue 3. Identify the best and worst weeks 4. Calculate the weekend effect (percentage drop on weekends)
Guidance
Build the simulation by adding components: `baseline + trend + weekend_effect + seasonal + noise`. For the weekend effect, subtract $200 on Saturday/Sunday using `np.where(dates.dayofweek >= 5, -200, 0)`. For seasonality, use a sinusoidal pattern peaking in January. After creating the series, apply `resample` and `rolling` as practiced in the chapter.Exercise 11.15 — Pandemic timeline analysis
Create a DataFrame with daily COVID-19 case counts for a fictional city over 6 months. Include a realistic wave pattern (exponential growth, peak, decline). Then:
- Compute the 7-day rolling average (what news outlets report)
- Compute the doubling time during the growth phase (how many days for cases to double)
- Identify the peak date (when the 7-day average was highest)
- Compute the 14-day change (current 7-day avg vs. 14 days ago) to show acceleration/deceleration
- Resample to weekly data and compute week-over-week percentage change
Guidance
For the wave pattern, use something like: `cases = (1000 * np.exp(0.05 * days) * np.exp(-0.001 * days**2))` with noise added. Doubling time: find periods where `rolling_avg / rolling_avg.shift(n)` is approximately 2 and compute `n`. The 14-day change: `rolling_7.diff(14)`. Peak: `rolling_7.idxmax()`.Exercise 11.16 — Handling multiple date formats
You receive three CSV files from different international partners, each with date columns in different formats:
- File A (US): dates like "03/15/2023"
- File B (Europe): dates like "15.03.2023"
- File C (Japan): dates like "2023/03/15"
Write code to: 1. Parse each file's dates with the correct format string 2. Combine all three into a single DataFrame with consistent datetime dates 3. Verify there are no parsing errors by checking for NaT values 4. Sort the combined data chronologically
Guidance
Parse each separately with the appropriate format: - File A: `format="%m/%d/%Y"` - File B: `format="%d.%m.%Y"` - File C: `format="%Y/%m/%d"` Then `pd.concat()` the three DataFrames. Check for NaT with `.isna().sum()`. Sort with `.sort_values("date")`. The key lesson: always know the format of your date data before parsing.Exercise 11.17 — Finding temporal patterns
Using a dataset of your creation with daily values spanning at least one year:
- Compute the average value by day of week — is there a weekly pattern?
- Compute the average value by month — is there a seasonal pattern?
- Compare Q1 vs Q3 — is there a significant difference?
- Find the longest streak of consecutive days above the overall median
- Identify any "anomaly" days (values more than 2 standard deviations from the 30-day rolling mean)
Guidance
For the day-of-week pattern: `data.groupby(data.index.dayofweek).mean()`. For streaks, create a boolean mask (above median) and use `groupby` on the cumulative sum of *changes* to find consecutive runs. For anomalies: compute `rolling_30 = data.rolling(30).mean()` and `rolling_std = data.rolling(30).std()`, then flag where `abs(data - rolling_30) > 2 * rolling_std`.Exercise 11.18 — Event-driven time series
Create a time series with a known "intervention" point — for example, a city implements a new traffic law on day 100. Generate data where: - Days 1-99: baseline behavior (e.g., average 50 accidents per day) - Day 100: intervention - Days 101-200: reduced behavior (e.g., average 35 accidents per day)
Then: 1. Compute a 14-day rolling average to visualize the change 2. Compare the mean before vs. after the intervention 3. Compute when the rolling average first drops below a threshold 4. Calculate how many days it took for the "new normal" to stabilize
Guidance
Generate data: `before = np.random.normal(50, 10, 99); after = np.random.normal(35, 8, 101)`. Concatenate and add dates. The rolling average will show a smooth transition. "Stabilization" could be defined as when the rolling standard deviation drops below a threshold, indicating consistent new behavior.Part D: Synthesis Problems (3-star / 4-star)
Exercise 11.19 — Complete time series pipeline (3-star)
Build a complete time series analysis pipeline function that takes any time-indexed Series and produces: 1. Basic stats (min, max, mean, std) 2. Monthly summary (mean, min, max per month) 3. Rolling statistics (7-day and 30-day means) 4. Trend direction (is the 30-day rolling average going up or down?) 5. Anomaly detection (days more than 2 std from the rolling mean)
Test it on at least two different time series (e.g., temperature and sales).
Guidance
def analyze_timeseries(series, short_window=7, long_window=30):
results = {}
results["stats"] = series.describe()
results["monthly"] = series.resample("ME").agg(["mean", "min", "max"])
results["rolling_short"] = series.rolling(short_window).mean()
results["rolling_long"] = series.rolling(long_window).mean()
# Trend: compare last long_window rolling value to first
results["trend"] = "up" if results["rolling_long"].iloc[-1] > results["rolling_long"].dropna().iloc[0] else "down"
# Anomalies
rolling_mean = series.rolling(long_window).mean()
rolling_std = series.rolling(long_window).std()
results["anomalies"] = series[abs(series - rolling_mean) > 2 * rolling_std]
return results
Exercise 11.20 — Multi-series comparison (3-star)
Create daily vaccination time series for three countries over 6 months. Each should have a different pattern (one growing, one stable, one declining). Then:
- Resample all three to weekly totals
- Compute the correlation between countries' weekly vaccination rates
- Find which country had the most consistent vaccination rate (lowest coefficient of variation)
- Identify any week where all three countries showed increases
Guidance
Use `pd.DataFrame` with three columns, one per country. Resample with `.resample("W").sum()`. Correlation: `.corr()`. Coefficient of variation: `std / mean`. For weeks where all increase: compute `weekly.pct_change() > 0` and use `.all(axis=1)` to find weeks where all columns are True.Exercise 11.21 — Handling missing dates in a time series (4-star)
Create a daily time series with intentional gaps (missing dates). Then:
- Detect the gaps (which dates are missing?)
- Visualize the gap pattern (which months have the most missing data?)
- Fill gaps using three strategies: forward fill, linear interpolation, and filling with the rolling median
- Compare the three strategies — which seems most appropriate for your data?
- Document the percentage of data that was imputed
Guidance
Create the full date range with `pd.date_range()`, create a Series with some values, then drop random entries. Detect gaps: `full_range.difference(series.index)`. Reindex to the full range: `series.reindex(full_range)`. Then apply `ffill()`, `interpolate()`, and `fillna(series.rolling(7, min_periods=1).median())`. Compare by looking at how each strategy handles gaps of different lengths.Exercise 11.22 — Year-over-year analysis (4-star)
Create a daily time series spanning at least 2 full years. Then:
- Compute year-over-year change for each day (this year's value minus same day last year)
- Compute the same-month-previous-year comparison (this month's average vs. the same month last year)
- Handle the edge case: February 29 in a leap year
- Identify the period of greatest year-over-year improvement
Guidance
Year-over-year: `series.diff(365)` for a simple approach, or `series - series.shift(365)`. Same-month: resample to monthly means, then shift by 12 periods. February 29 is tricky — when shifting by 365 days, Feb 29 aligns with March 1. You may need to shift by 366 in leap year periods. The key learning: year-over-year calculations are conceptually simple but have calendar edge cases.Part M: Mixed Review (2-star)
Exercise 11.23 — Combining text cleaning and date parsing (Ch. 10, 11)
You have a column with entries like:
entries = pd.Series([
"Vaccinated on March 15, 2023",
"Dose given: 03/20/2023",
"Date: 2023-04-01",
"received shot april 5 2023",
"N/A",
"vaccinated 05/10/23"
])
Use string methods (Ch. 10) to extract the date portion, then parse it (Ch. 11) into proper datetime objects. This will require a combination of regex extraction and pd.to_datetime with errors="coerce".
Guidance
Strategy 1: Try `pd.to_datetime(entries, errors="coerce")` first — it will catch some formats automatically. Strategy 2: Use regex to extract date-like patterns, then parse. For example, `entries.str.extract(r"(\d{1,2}/\d{1,2}/\d{2,4}|\d{4}-\d{2}-\d{2}|\w+ \d{1,2},? \d{4})")` to capture common date patterns, then `pd.to_datetime(extracted, errors="coerce")`. This exercise shows how text cleaning and date parsing are often intertwined.Exercise 11.24 — Merging time series data (Ch. 9, 11)
Create two DataFrames: 1. Daily vaccination counts by date (365 rows) 2. Monthly population data by month (12 rows)
Merge them so each daily row has the corresponding month's population. Then compute "vaccinations per 100,000 population" for each day.
Guidance
Add a `year_month` column to both DataFrames for merging: `df["year_month"] = df["date"].dt.to_period("M")`. Then merge on `year_month`. Alternatively, use `merge_asof` if the dates don't align exactly. Compute the rate: `(vaccinations / population) * 100000`. This combines reshaping (Ch. 9) with date handling (Ch. 11).Exercise 11.25 — Data cleaning pipeline with dates (Ch. 8, 10, 11)
Create a messy dataset with the following issues and clean it:
- Date column with mixed formats and some invalid entries ("TBD", "N/A")
- A text column with inconsistent entries (needs Ch. 10 cleaning)
- Missing values in numeric columns (needs Ch. 8 imputation)
- Duplicates based on date + ID combination
Apply the full cleaning pipeline: parse dates, clean text, handle missing values, remove duplicates. Then compute a weekly rolling average on the cleaned data.