Chapter 11 Quiz: Working with Dates, Times, and Time Series Data

Contributors to Introduction to Data Science

Chapter 11 Quiz: Working with Dates, Times, and Time Series Data

Instructions: This quiz tests your understanding of Chapter 11. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. What happens when you sort a column of date strings in "MM/DD/YYYY" format without converting to datetime?

(A) The dates sort correctly because pandas recognizes date patterns in strings
(B) The dates sort lexicographically, which places December (12) before February (2)
(C) Pandas raises an error because it cannot sort date strings
(D) The dates sort correctly only if they are all in the same year

Answer

**Correct: (B)** - **(A)** is incorrect. Pandas does not automatically recognize date patterns in string columns — it treats them as plain text. - **(B)** is correct. String sorting is character-by-character from left to right. "12/01/2023" comes before "2/15/2023" because "1" < "2" in ASCII. This means December sorts before February, March, etc. — chronologically wrong. - **(C)** Pandas can sort strings; it just sorts them alphabetically rather than chronologically. - **(D)** Even within the same year, the problem persists: "10/01/2023" (October) would sort before "2/01/2023" (February).

Question 2. What does the format code %B represent in a date format string?

(A) Abbreviated month name (e.g., "Mar")
(B) Full month name (e.g., "March")
(C) Month number zero-padded (e.g., "03")
(D) Business day indicator

Answer

**Correct: (B)** - **(A)** is `%b` (lowercase), which gives abbreviated month names like "Jan", "Feb", "Mar". - **(B)** is correct. `%B` (uppercase) gives the full month name: "January", "February", "March", etc. - **(C)** is `%m`, which gives the zero-padded month number: "01", "02", "03". - **(D)** There is no such format code. Business days are handled by frequency aliases ("B") in `date_range`, not by format codes.

Question 3. What does pd.to_datetime(series, errors="coerce") do with unparseable values?

(A) Raises a ValueError for each unparseable value
(B) Skips unparseable values, leaving them as strings
(C) Converts unparseable values to NaT (Not a Time)
(D) Replaces unparseable values with the current date

Answer

**Correct: (C)** - **(A)** describes the default behavior (`errors="raise"`), not `errors="coerce"`. - **(B)** describes `errors="ignore"`, which returns the original input unchanged if any value can't be parsed. - **(C)** is correct. `errors="coerce"` converts values that cannot be parsed into `NaT` (Not a Time), which is the datetime equivalent of `NaN`. This is the safest option for messy data because it lets you parse what you can and identify what you can't. - **(D)** Pandas never silently replaces values with the current date.

Question 4. What is the result of pd.Timestamp("2023-01-31") + pd.DateOffset(months=1)?

(A) Timestamp('2023-03-03') (31 days after January 31)
(B) Timestamp('2023-02-28') (last day of February)
(C) An error because February doesn't have 31 days
(D) Timestamp('2023-03-01') (first day of March)

Answer

**Correct: (B)** - **(A)** describes what `Timedelta(days=31)` would do — adding exactly 31 days. - **(B)** is correct. `DateOffset(months=1)` is calendar-aware. Adding one month to January 31 gives February 28 (or 29 in a leap year), because `DateOffset` adjusts to the last valid day of the target month. This is the key difference between `DateOffset` and `Timedelta`. - **(C)** `DateOffset` handles this gracefully instead of raising an error. - **(D)** March 1 is not "one month after January 31" in calendar logic.

Question 5. What does .resample("W").sum() do to a daily time series?

(A) Computes the sum of all values across the entire series, grouped by week number
(B) Aggregates daily values into weekly totals, creating one row per week
(C) Smooths the daily values using a 7-day rolling sum
(D) Fills missing values with the sum of neighboring values

Answer

**Correct: (B)** - **(A)** is partially right about grouping by week but poorly phrased. `resample` creates time-based buckets and aggregates within each. - **(B)** is correct. `resample("W")` groups the data into weekly bins (by default ending on Sunday), and `.sum()` adds up all values within each week. The result is a new Series with one value per week, where each value is the total of the 7 daily values in that week. - **(C)** describes `rolling(7).sum()`, which is a sliding window, not resampling. - **(D)** describes interpolation or forward-fill, not resampling.

Question 6. What is the .dt accessor in pandas?

(A) A method for converting strings to datetime objects
(B) An interface for accessing datetime properties (year, month, day) from a datetime Series
(C) A function for creating DatetimeIndex objects
(D) A parameter for specifying date formats

Answer

**Correct: (B)** - **(A)** describes `pd.to_datetime()`, not the `.dt` accessor. - **(B)** is correct. The `.dt` accessor provides access to datetime properties and methods for a Series with datetime dtype. Just as `.str` provides string methods for string columns, `.dt` provides datetime methods: `.dt.year`, `.dt.month`, `.dt.day_name()`, `.dt.quarter`, etc. - **(C)** describes `pd.DatetimeIndex()` or `pd.date_range()`. - **(D)** The `format` parameter in `to_datetime()` handles format specification.

Question 7. Why does a 7-day rolling average produce NaN for the first 6 values?

(A) The first 6 values are always invalid in time series data
(B) There aren't enough data points to fill the window — you need 7 values to compute a 7-value average
(C) Pandas reserves the first 6 values as a warm-up period for time series calculations
(D) The rolling function only works with data after the first week

Answer

**Correct: (B)** - **(A)** The first values are perfectly valid — they just can't participate in a full 7-day average yet. - **(B)** is correct. A 7-day rolling average for day 3, for example, would need data from days -3 to 3 (or days 1-7 depending on alignment). Since days before day 1 don't exist, the calculation can't be performed, resulting in `NaN`. You can change this behavior with `min_periods` — e.g., `rolling(7, min_periods=1)` computes the average with whatever data is available, even if it's less than 7 points. - **(C)** There's no "warm-up period" concept in pandas. - **(D)** The function works on all data; it just can't compute a full window until enough data points exist.

Question 8. Which approach is correct for parsing dates during CSV loading?

(A) pd.read_csv("data.csv", dtype={"date": "datetime64"})
(B) pd.read_csv("data.csv", parse_dates=["date"])
(C) pd.read_csv("data.csv", date_format="date")
(D) pd.read_csv("data.csv").astype({"date": "datetime64"})

Answer

**Correct: (B)** - **(A)** The `dtype` parameter doesn't accept "datetime64" for date parsing — it's used for types like `int`, `float`, `str`. - **(B)** is correct. `parse_dates` accepts a list of column names (or indices) to parse as dates during loading. This is the cleanest approach because it handles the conversion in one step, before any analysis begins. - **(C)** `date_format` *is* a real `read_csv` parameter, but it expects a strftime format string (e.g., `"%Y-%m-%d"`) used together with `parse_dates` — not a column name — and it does not trigger date parsing on its own. (The older `date_parser` parameter was removed in pandas 2.0.) - **(D)** While this might work for simple cases, `astype` with datetime can be fragile. `pd.to_datetime()` is the proper conversion function, and `parse_dates` does this automatically during loading.

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: A Timedelta of 30 days is always equivalent to adding one month to a date.

Answer

**False.** Months have different numbers of days: January has 31, February has 28 or 29, April has 30. Adding `Timedelta(days=30)` to January 1 gives January 31 — still in January, not February 1. Adding `DateOffset(months=1)` to January 1 gives February 1. The distinction becomes especially important at month boundaries: `Timedelta(days=30)` added to January 31 gives March 2, while `DateOffset(months=1)` gives February 28.

Question 10. True or False: NaT (Not a Time) behaves similarly to NaN — it is ignored by aggregations and can be detected with .isna().

Answer

**True.** `NaT` is the datetime equivalent of `NaN`. It is returned when a date cannot be parsed (`errors="coerce"`), and it behaves like `NaN` in most contexts: it's excluded from `.mean()`, `.sum()`, and other aggregations; it propagates through arithmetic (any operation involving `NaT` returns `NaT`); and it can be detected with `.isna()` or `.isnull()`. You can also filter it out with `.dropna()`.

Question 11. True or False: When using partial string indexing on a DatetimeIndex, df["2023"] selects all rows from the year 2023.

Answer

**True.** Partial string indexing is one of the most convenient features of DatetimeIndex. You can slice by year (`df["2023"]`), year-month (`df["2023-03"]`), or specific date ranges (`df["2023-03-01":"2023-03-31"]`). Pandas automatically interprets the string as a time range and selects all matching rows. This only works when the index is a DatetimeIndex — it won't work with a regular column unless you use boolean filtering.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain the difference between resampling and rolling windows. When would you use each?

Answer

**Resampling** changes the frequency of the data by grouping values into time buckets (daily to weekly, hourly to daily) and aggregating. It reduces (or increases) the number of data points. The result has one value per time period. **Rolling windows** compute statistics over a sliding window of fixed size without changing the frequency. The result has the same number of data points as the original, with each value representing the statistic computed over the surrounding window. Use resampling when you want to change the time granularity (e.g., "show me monthly totals"). Use rolling windows when you want to smooth the data while keeping the original granularity (e.g., "show me the 7-day trend for each day").

Question 13. What is the day-month ambiguity problem in date parsing? How do you avoid it?

Answer

The day-month ambiguity occurs with dates like "01/02/2023" — it could be January 2 (month-first, US convention) or February 1 (day-first, European convention). When both the day and month values are 12 or less, there's no way to tell which interpretation is correct from the string alone. Avoid it by: (1) always specifying the `format` parameter in `pd.to_datetime()`, e.g., `format="%d/%m/%Y"` for day-first, (2) using the `dayfirst=True` parameter when your data uses day-first format, or (3) requiring ISO format (YYYY-MM-DD) in your data pipelines, which is unambiguous because the year comes first and clearly separates the interpretation.

Question 14. Describe three different frequency aliases used with resample() or date_range(), and explain when you'd use each.

Answer

- **"D" (Calendar Day):** Every calendar day. Use for daily time series data — temperature readings, daily sales, infection counts. - **"B" (Business Day):** Monday through Friday only. Use for financial data, stock prices, or any metric that only applies on workdays. - **"ME" (Month End):** Groups data into monthly buckets ending on the last day of each month. Use when you want monthly summaries — total monthly revenue, average monthly temperature, monthly vaccination counts. Other common ones: "W" (weekly), "W-MON" (weekly anchored to Monday), "QS" (quarter start), "h" (hourly), "MS" (month start).

Question 15. What is the min_periods parameter in .rolling(), and why might you use it?

Answer

`min_periods` sets the minimum number of data points required to compute a valid result within the rolling window. By default, `min_periods` equals the window size, so a 7-day rolling average requires all 7 values and produces `NaN` for the first 6 days. Setting `min_periods=1` allows the rolling calculation to start immediately, using however many points are available. On day 1, it's the average of 1 value; on day 2, the average of 2; and so on until the full window size is reached. This is useful when you want results for every row and can accept that early values are based on fewer data points, or when your data has gaps and you'd rather compute a partial-window average than produce `NaN`.

Section 4: Applied Scenarios (3 questions, 5 points each)

Question 16. You receive a CSV file where the date column contains values like "15-Mar-2023", "20-Apr-2023", and "TBD". Write the pandas code to: (a) load the file, (b) parse the dates with the correct format string, (c) handle the "TBD" entries, and (d) compute the number of days between the earliest and latest valid dates.

Answer

df = pd.read_csv("data.csv")

# Parse with format string, coerce errors
df["date"] = pd.to_datetime(
    df["date_col"],
    format="%d-%b-%Y",
    errors="coerce"
)

# Check how many TBD/invalid entries became NaT
print(f"Invalid dates: {df['date'].isna().sum()}")

# Date range of valid dates
valid_dates = df["date"].dropna()
date_span = (valid_dates.max() - valid_dates.min()).days
print(f"Date span: {date_span} days")

The `format="%d-%b-%Y"` matches the day-abbreviated_month-year pattern. `errors="coerce"` converts "TBD" to `NaT` instead of raising an error. The date span is computed by subtracting the min from the max and extracting `.days`.

Question 17. You have daily website traffic data for 2023. Your manager asks: "Were our weekends busier than weekdays last quarter (Q4)?" Write the code to answer this question, starting from a DataFrame with columns date and visits.

Answer

# Ensure date is datetime
df["date"] = pd.to_datetime(df["date"])

# Filter to Q4 (October-December)
q4 = df[df["date"].dt.quarter == 4].copy()

# Add weekend flag
q4["is_weekend"] = q4["date"].dt.dayofweek >= 5

# Compare
comparison = q4.groupby("is_weekend")["visits"].agg(
    ["mean", "median", "count"])
comparison.index = ["Weekday", "Weekend"]
print(comparison)

# Direct answer
weekday_avg = comparison.loc["Weekday", "mean"]
weekend_avg = comparison.loc["Weekend", "mean"]
print(f"\nWeekend average: {weekend_avg:,.0f}")
print(f"Weekday average: {weekday_avg:,.0f}")
print(f"Weekend {'busier' if weekend_avg > weekday_avg else 'slower'} "
      f"by {abs(weekend_avg - weekday_avg) / weekday_avg * 100:.1f}%")

The key steps: filter to Q4 using `.dt.quarter`, create a boolean weekend indicator using `.dt.dayofweek >= 5`, and compare means with `groupby`.

Question 18. You need to report the 7-day rolling average of daily COVID cases, but your dataset has gaps (some days are missing). How do you handle this? Describe two approaches and their trade-offs.

Answer

**Approach 1: Reindex to fill gaps with NaN, then use `min_periods`.**

full_range = pd.date_range(series.index.min(), series.index.max())
series = series.reindex(full_range)
rolling_avg = series.rolling(7, min_periods=4).mean()

This makes the gaps explicit. Setting `min_periods=4` means you'll compute an average even with only 4 of 7 days present. Trade-off: the rolling average on days near gaps is based on fewer data points and may be less reliable. **Approach 2: Interpolate first, then compute the rolling average.**

full_range = pd.date_range(series.index.min(), series.index.max())
series = series.reindex(full_range).interpolate(method="linear")
rolling_avg = series.rolling(7).mean()

This fills gaps with estimated values based on surrounding data. Trade-off: interpolated values are not real data — they're estimates. This can mask real gaps (like a reporting system being down) and introduce artificial smoothness. **Which to choose:** If the gaps are random and short (1-2 days), interpolation is reasonable. If gaps represent genuinely missing data (like a testing site being closed), `min_periods` is more honest because it doesn't invent data.

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What does the following code produce? Trace through each step.

import pandas as pd

dates = pd.date_range("2023-01-01", periods=14, freq="D")
values = [10, 12, 8, 15, 11, 5, 4, 13, 14, 9, 16, 12, 6, 3]

ts = pd.Series(values, index=dates)

result = ts.resample("W").agg(["sum", "mean", "max"])
print(result)

Answer

`pd.date_range("2023-01-01", periods=14, freq="D")` creates dates from January 1 to January 14, 2023. `resample("W")` groups by week, ending on Sunday by default. January 1 (Sunday) forms its own week. January 2-8 (Mon-Sun) is the next week. January 9-14 (Mon-Sat) is a partial week. Result:

              sum  mean  max
2023-01-01     10  10.0   10
2023-01-08     70  10.0   15
2023-01-15     60  10.0   16

- Week ending Jan 1: just 10 (one day) - Week ending Jan 8: 12+8+15+11+5+4+13 = 68, wait let me recount. Values for Jan 2-8: 12, 8, 15, 11, 5, 4, 13 = 68. Mean = 68/7 = 9.71. Max = 15. - Week ending Jan 15: 14, 9, 16, 12, 6, 3 = 60. Mean = 60/6 = 10.0. Max = 16. (Exact values may vary slightly based on which day starts the week.)

Question 20. The following code has a bug. Identify it, explain the problem, and fix it.

import pandas as pd

df = pd.DataFrame({
    "date": ["2023-01-15", "2023-02-20", "2023-03-10"],
    "sales": [100, 150, 120]
})

# Goal: compute sales for Q1
q1_sales = df[df["date"].dt.quarter == 1]["sales"].sum()
print(f"Q1 sales: {q1_sales}")

Answer

**The bug:** The `date` column is a string (dtype `object`), not a datetime. The `.dt` accessor only works on datetime columns. This code will raise an `AttributeError`: "Can only use .dt accessor with datetimelike values." **Fix:** Convert the column to datetime before using `.dt`:

df["date"] = pd.to_datetime(df["date"])
q1_sales = df[df["date"].dt.quarter == 1]["sales"].sum()
print(f"Q1 sales: {q1_sales}")

Or, parse dates during DataFrame creation:

df = pd.DataFrame({
    "date": pd.to_datetime(["2023-01-15", "2023-02-20", "2023-03-10"]),
    "sales": [100, 150, 120]
})

This is one of the most common datetime errors: forgetting that dates loaded from CSV files are strings until explicitly converted.