Chapter 11 Exercises: Working with Dates, Times, and Time Series Data

Contributors to Introduction to Data Science

Chapter 11 Exercises: Working with Dates, Times, and Time Series Data

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 11 material, building from recall to original analysis. Part E applies your skills to new datasets. Part M mixes in concepts from earlier chapters. You'll need Python, pandas, and numpy for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter.

Exercise 11.1 — Why dates need parsing

Explain why sorting a column of date strings in "MM/DD/YYYY" format produces incorrect results. Give a specific example of two dates that would sort wrong as strings but correctly as datetime objects.

Guidance

String sorting is lexicographic (character by character, left to right). The string "12/01/2023" (December 1) comes before "2/15/2023" (February 15) because "1" < "2" in character comparison. But chronologically, February comes before December. After parsing to datetime, pandas sorts by actual chronological order. Any pair where the month with a higher number (10, 11, 12) would sort before a single-digit month illustrates this.

Exercise 11.2 — Format codes

Write the format string for each of these date formats. Do not run any code — write them from memory.

"2023-03-15" (ISO format)
"03/15/2023" (US format)
"15-Mar-2023" (abbreviated month)
"March 15, 2023 at 2:30 PM"
"Wed, 15 Mar 2023 14:30:00"

Guidance

1. `"%Y-%m-%d"` 2. `"%m/%d/%Y"` 3. `"%d-%b-%Y"` 4. `"%B %d, %Y at %I:%M %p"` 5. `"%a, %d %b %Y %H:%M:%S"` Remember: `%Y` is 4-digit year, `%y` is 2-digit. `%B` is full month name, `%b` is abbreviated. `%H` is 24-hour, `%I` is 12-hour (needs `%p` for AM/PM).

Exercise 11.3 — Resampling vs. groupby

Explain the relationship between .resample() and .groupby(). In what situation would you use resample instead of groupby? Can you achieve the same result with groupby?

Guidance

`.resample()` is essentially a time-aware version of `.groupby()`. It groups data by time periods (weeks, months, quarters) and applies aggregation functions. You could achieve similar results with `groupby(df["date"].dt.month)`, but `resample` is more convenient because: (1) it requires a DatetimeIndex, which enforces proper date handling, (2) it handles edge cases like incomplete periods, (3) it supports upsampling, and (4) it uses standardized frequency aliases. Use `resample` when working with time series; use `groupby` when grouping by non-time categories.

Exercise 11.4 — Rolling windows explained

Explain in your own words what a 7-day rolling average is. Why is it so commonly used for daily data? What trade-off does the window size control?

Guidance

A 7-day rolling average replaces each day's value with the average of that day and the 6 preceding days. It's common because 7 days = one week, so it smooths out the day-of-week effect (weekdays vs. weekends). The window size controls a trade-off: smaller windows are more responsive to recent changes but noisier; larger windows are smoother but slower to react to real changes. A 3-day window is responsive but still shows fluctuations; a 30-day window is very smooth but might hide short-term trends.

Exercise 11.5 — Timedelta vs. DateOffset

Explain the difference between pd.Timedelta(days=30) and pd.DateOffset(months=1). Give a specific example where they produce different results when added to the same date.

Guidance

`Timedelta(days=30)` adds exactly 30 days — it doesn't know anything about months or calendars. `DateOffset(months=1)` adds one calendar month — it understands that months have different lengths. Example: starting from January 31: `Timestamp("2023-01-31") + Timedelta(days=30)` gives March 2 (30 days later), while `Timestamp("2023-01-31") + DateOffset(months=1)` gives February 28 (one calendar month later, adjusted because February doesn't have 31 days).

Part B: Applied Problems (2-star)

Exercise 11.6 — Parsing a messy date column

Parse the following Series into datetime objects. Handle errors gracefully.

dates = pd.Series([
    "2023-03-15",
    "03/20/2023",
    "April 1, 2023",
    "N/A",
    "2023-05-10",
    "not recorded",
    "06-15-2023",
    "",
    "2023/07/04"
])

After parsing, report how many valid dates you have and how many became NaT.

Guidance

Use `pd.to_datetime(dates, errors="coerce")` for a first pass. This will handle the common formats automatically and convert unparseable entries ("N/A", "not recorded", "") to `NaT`. Count valid dates with `.notna().sum()` and `NaT` values with `.isna().sum()`. You should get 7 valid dates and 2 NaT values (or 3, depending on how the empty string is handled).

Exercise 11.7 — Date arithmetic

Create a DataFrame with the following columns: patient_id, first_dose_date, second_dose_date. Include at least 6 patients with realistic dates. Then compute:

The number of days between doses for each patient
The average gap between doses
Which patients waited more than 28 days
A booster_due column showing 180 days after the second dose

Guidance

Subtract the date columns: `df["gap"] = df["second_dose_date"] - df["first_dose_date"]`. Get days with `.dt.days`. For the booster date: `df["booster_due"] = df["second_dose_date"] + pd.Timedelta(days=180)`. The average gap: `df["gap"].dt.days.mean()`.

Exercise 11.8 — Building and using a DatetimeIndex

Create a time series of daily temperature readings for one month (30 days). Use pd.date_range() to generate the dates and np.random.normal(20, 5, 30) for temperatures. Then:

Set the date as the index
Select all data from the second week (days 8-14)
Find the hottest and coldest days
Compute the 5-day rolling average

Guidance

dates = pd.date_range("2023-06-01", periods=30, freq="D")
temps = pd.Series(np.random.normal(20, 5, 30), index=dates, name="temp_c")

Select with partial string: `temps["2023-06-08":"2023-06-14"]`. Hottest: `temps.idxmax()`. Rolling: `temps.rolling(5).mean()`.

Exercise 11.9 — Resampling daily to weekly and monthly

Using the daily vaccination data below, compute:

Weekly total vaccinations (summing daily values)
Monthly average daily vaccinations
The month with the highest total vaccinations
The week-over-week percentage change in total vaccinations

import numpy as np
np.random.seed(42)
dates = pd.date_range("2023-01-01", "2023-06-30", freq="D")
daily = pd.Series(
    np.random.normal(5000, 800, len(dates)).clip(min=0).round(0),
    index=dates, name="vaccinations")

Guidance

Weekly total: `daily.resample("W").sum()`. Monthly average: `daily.resample("ME").mean()`. Month with highest total: `daily.resample("ME").sum().idxmax()`. Week-over-week change: `weekly = daily.resample("W").sum(); weekly.pct_change() * 100`.

Exercise 11.10 — Rolling vs. expanding

Using the same daily vaccination data from Exercise 11.9:

Compute the 7-day rolling mean
Compute the expanding (cumulative) mean
Compare the two on day 30 — which is higher? Explain why.
Compute the 7-day rolling standard deviation. What does a high standard deviation indicate?

Guidance

Rolling: `daily.rolling(7).mean()`. Expanding: `daily.expanding().mean()`. On day 30, the expanding mean is the average of all 30 days, while the rolling mean is the average of only the most recent 7 days. Whether rolling or expanding is higher depends on whether recent days were above or below the overall average. The rolling standard deviation measures recent volatility — a high value means daily vaccinations are fluctuating widely.

Exercise 11.11 — Extracting date components for analysis

Load or create a DataFrame with a date column spanning at least one full year. Add columns for: year, month, day of week (as name), quarter, and a boolean is_weekend. Then use groupby to answer:

Which day of the week has the most/fewest events?
Which quarter has the most events?
What's the weekend vs. weekday ratio?

Guidance

Use `.dt.year`, `.dt.month`, `.dt.day_name()`, `.dt.quarter`, `.dt.dayofweek >= 5`. Group by each component and count or sum. The weekend ratio: `df.groupby("is_weekend").size()` gives you counts, then divide weekend by weekday.

Exercise 11.12 — date_range for scheduling

Use pd.date_range() to generate:

Every business day (Monday-Friday) in March 2023
The first day of each quarter in 2023
Every other Wednesday from January to June 2023
The last business day of each month in 2023

Guidance

1. `pd.date_range("2023-03-01", "2023-03-31", freq="B")` — "B" is business day 2. `pd.date_range("2023-01-01", periods=4, freq="QS")` — "QS" is quarter start 3. `pd.date_range("2023-01-01", "2023-06-30", freq="2W-WED")` — every 2 weeks anchored to Wednesday 4. `pd.date_range("2023-01-01", "2023-12-31", freq="BME")` — "BME" is business month end (or use "BM" in older pandas)

Exercise 11.13 — Time zones

Create a Timestamp for "2023-07-04 12:00:00" in US/Eastern timezone. Then:

Convert it to UTC
Convert it to US/Pacific
Convert it to Europe/London
Explain why Europe/London is NOT the same as UTC on July 4th

Guidance

ts = pd.Timestamp("2023-07-04 12:00:00", tz="US/Eastern")
ts.tz_convert("UTC")        # 16:00 UTC
ts.tz_convert("US/Pacific") # 09:00 Pacific
ts.tz_convert("Europe/London")  # 17:00 London

London is UTC+1 in summer (BST — British Summer Time), not UTC+0. This is why timezone names (which handle daylight saving) are preferred over fixed UTC offsets.

Part C: Real-World Scenarios (3-star)

Exercise 11.14 — Sales time series analysis

Create a simulated dataset of daily coffee shop sales for one year (365 days) with these characteristics: - Base sales around $800/day - A gradual upward trend - Higher sales on weekdays than weekends - Seasonal bump in winter (November-February) - Random noise

Then perform a complete time series analysis: 1. Plot (describe or print) the daily, 7-day rolling, and 30-day rolling averages 2. Compute monthly total revenue 3. Identify the best and worst weeks 4. Calculate the weekend effect (percentage drop on weekends)

Guidance

Build the simulation by adding components: `baseline + trend + weekend_effect + seasonal + noise`. For the weekend effect, subtract $200 on Saturday/Sunday using `np.where(dates.dayofweek >= 5, -200, 0)`. For seasonality, use a sinusoidal pattern peaking in January. After creating the series, apply `resample` and `rolling` as practiced in the chapter.

Exercise 11.15 — Pandemic timeline analysis

Create a DataFrame with daily COVID-19 case counts for a fictional city over 6 months. Include a realistic wave pattern (exponential growth, peak, decline). Then:

Compute the 7-day rolling average (what news outlets report)
Compute the doubling time during the growth phase (how many days for cases to double)
Identify the peak date (when the 7-day average was highest)
Compute the 14-day change (current 7-day avg vs. 14 days ago) to show acceleration/deceleration
Resample to weekly data and compute week-over-week percentage change

Guidance

For the wave pattern, use something like: `cases = (1000 * np.exp(0.05 * days) * np.exp(-0.001 * days**2))` with noise added. Doubling time: find periods where `rolling_avg / rolling_avg.shift(n)` is approximately 2 and compute `n`. The 14-day change: `rolling_7.diff(14)`. Peak: `rolling_7.idxmax()`.

Exercise 11.16 — Handling multiple date formats

You receive three CSV files from different international partners, each with date columns in different formats:

File A (US): dates like "03/15/2023"
File B (Europe): dates like "15.03.2023"
File C (Japan): dates like "2023/03/15"

Write code to: 1. Parse each file's dates with the correct format string 2. Combine all three into a single DataFrame with consistent datetime dates 3. Verify there are no parsing errors by checking for NaT values 4. Sort the combined data chronologically

Guidance

Parse each separately with the appropriate format: - File A: `format="%m/%d/%Y"` - File B: `format="%d.%m.%Y"` - File C: `format="%Y/%m/%d"` Then `pd.concat()` the three DataFrames. Check for NaT with `.isna().sum()`. Sort with `.sort_values("date")`. The key lesson: always know the format of your date data before parsing.

Exercise 11.17 — Finding temporal patterns

Using a dataset of your creation with daily values spanning at least one year:

Compute the average value by day of week — is there a weekly pattern?
Compute the average value by month — is there a seasonal pattern?
Compare Q1 vs Q3 — is there a significant difference?
Find the longest streak of consecutive days above the overall median
Identify any "anomaly" days (values more than 2 standard deviations from the 30-day rolling mean)

Guidance

For the day-of-week pattern: `data.groupby(data.index.dayofweek).mean()`. For streaks, create a boolean mask (above median) and use `groupby` on the cumulative sum of *changes* to find consecutive runs. For anomalies: compute `rolling_30 = data.rolling(30).mean()` and `rolling_std = data.rolling(30).std()`, then flag where `abs(data - rolling_30) > 2 * rolling_std`.

Exercise 11.18 — Event-driven time series

Create a time series with a known "intervention" point — for example, a city implements a new traffic law on day 100. Generate data where: - Days 1-99: baseline behavior (e.g., average 50 accidents per day) - Day 100: intervention - Days 101-200: reduced behavior (e.g., average 35 accidents per day)

Then: 1. Compute a 14-day rolling average to visualize the change 2. Compare the mean before vs. after the intervention 3. Compute when the rolling average first drops below a threshold 4. Calculate how many days it took for the "new normal" to stabilize

Guidance

Generate data: `before = np.random.normal(50, 10, 99); after = np.random.normal(35, 8, 101)`. Concatenate and add dates. The rolling average will show a smooth transition. "Stabilization" could be defined as when the rolling standard deviation drops below a threshold, indicating consistent new behavior.

Part D: Synthesis Problems (3-star / 4-star)

Exercise 11.19 — Complete time series pipeline (3-star)

Build a complete time series analysis pipeline function that takes any time-indexed Series and produces: 1. Basic stats (min, max, mean, std) 2. Monthly summary (mean, min, max per month) 3. Rolling statistics (7-day and 30-day means) 4. Trend direction (is the 30-day rolling average going up or down?) 5. Anomaly detection (days more than 2 std from the rolling mean)

Test it on at least two different time series (e.g., temperature and sales).

Guidance

def analyze_timeseries(series, short_window=7, long_window=30):
    results = {}
    results["stats"] = series.describe()
    results["monthly"] = series.resample("ME").agg(["mean", "min", "max"])
    results["rolling_short"] = series.rolling(short_window).mean()
    results["rolling_long"] = series.rolling(long_window).mean()
    # Trend: compare last long_window rolling value to first
    results["trend"] = "up" if results["rolling_long"].iloc[-1] > results["rolling_long"].dropna().iloc[0] else "down"
    # Anomalies
    rolling_mean = series.rolling(long_window).mean()
    rolling_std = series.rolling(long_window).std()
    results["anomalies"] = series[abs(series - rolling_mean) > 2 * rolling_std]
    return results

Exercise 11.20 — Multi-series comparison (3-star)

Create daily vaccination time series for three countries over 6 months. Each should have a different pattern (one growing, one stable, one declining). Then:

Resample all three to weekly totals
Compute the correlation between countries' weekly vaccination rates
Find which country had the most consistent vaccination rate (lowest coefficient of variation)
Identify any week where all three countries showed increases

Guidance

Use `pd.DataFrame` with three columns, one per country. Resample with `.resample("W").sum()`. Correlation: `.corr()`. Coefficient of variation: `std / mean`. For weeks where all increase: compute `weekly.pct_change() > 0` and use `.all(axis=1)` to find weeks where all columns are True.

Exercise 11.21 — Handling missing dates in a time series (4-star)

Create a daily time series with intentional gaps (missing dates). Then:

Detect the gaps (which dates are missing?)
Visualize the gap pattern (which months have the most missing data?)
Fill gaps using three strategies: forward fill, linear interpolation, and filling with the rolling median
Compare the three strategies — which seems most appropriate for your data?
Document the percentage of data that was imputed

Guidance

Create the full date range with `pd.date_range()`, create a Series with some values, then drop random entries. Detect gaps: `full_range.difference(series.index)`. Reindex to the full range: `series.reindex(full_range)`. Then apply `ffill()`, `interpolate()`, and `fillna(series.rolling(7, min_periods=1).median())`. Compare by looking at how each strategy handles gaps of different lengths.

Exercise 11.22 — Year-over-year analysis (4-star)

Create a daily time series spanning at least 2 full years. Then:

Compute year-over-year change for each day (this year's value minus same day last year)
Compute the same-month-previous-year comparison (this month's average vs. the same month last year)
Handle the edge case: February 29 in a leap year
Identify the period of greatest year-over-year improvement

Guidance

Year-over-year: `series.diff(365)` for a simple approach, or `series - series.shift(365)`. Same-month: resample to monthly means, then shift by 12 periods. February 29 is tricky — when shifting by 365 days, Feb 29 aligns with March 1. You may need to shift by 366 in leap year periods. The key learning: year-over-year calculations are conceptually simple but have calendar edge cases.

Part M: Mixed Review (2-star)

Exercise 11.23 — Combining text cleaning and date parsing (Ch. 10, 11)

You have a column with entries like:

entries = pd.Series([
    "Vaccinated on March 15, 2023",
    "Dose given: 03/20/2023",
    "Date: 2023-04-01",
    "received shot april 5 2023",
    "N/A",
    "vaccinated 05/10/23"
])

Use string methods (Ch. 10) to extract the date portion, then parse it (Ch. 11) into proper datetime objects. This will require a combination of regex extraction and pd.to_datetime with errors="coerce".

Guidance

Strategy 1: Try `pd.to_datetime(entries, errors="coerce")` first — it will catch some formats automatically. Strategy 2: Use regex to extract date-like patterns, then parse. For example, `entries.str.extract(r"(\d{1,2}/\d{1,2}/\d{2,4}|\d{4}-\d{2}-\d{2}|\w+ \d{1,2},? \d{4})")` to capture common date patterns, then `pd.to_datetime(extracted, errors="coerce")`. This exercise shows how text cleaning and date parsing are often intertwined.

Exercise 11.24 — Merging time series data (Ch. 9, 11)

Create two DataFrames: 1. Daily vaccination counts by date (365 rows) 2. Monthly population data by month (12 rows)

Merge them so each daily row has the corresponding month's population. Then compute "vaccinations per 100,000 population" for each day.

Guidance

Add a `year_month` column to both DataFrames for merging: `df["year_month"] = df["date"].dt.to_period("M")`. Then merge on `year_month`. Alternatively, use `merge_asof` if the dates don't align exactly. Compute the rate: `(vaccinations / population) * 100000`. This combines reshaping ([Ch. 9](../chapter-09-reshaping-transforming/index.md)) with date handling (Ch. 11).

Exercise 11.25 — Data cleaning pipeline with dates (Ch. 8, 10, 11)

Create a messy dataset with the following issues and clean it:

Date column with mixed formats and some invalid entries ("TBD", "N/A")
A text column with inconsistent entries (needs Ch. 10 cleaning)
Missing values in numeric columns (needs Ch. 8 imputation)
Duplicates based on date + ID combination

Apply the full cleaning pipeline: parse dates, clean text, handle missing values, remove duplicates. Then compute a weekly rolling average on the cleaned data.

Guidance

This is a capstone exercise combining three chapters. Order matters: (1) parse dates with `errors="coerce"`, (2) clean text with `str.strip().str.lower()`, (3) remove duplicate date+ID pairs with `drop_duplicates()`, (4) handle missing values with `fillna()` or `dropna()`, (5) set date as index and compute rolling average. Document each step's effect on the row count and data quality.