Every interesting question about the real world involves time.
Learning Objectives
- Parse date strings in various formats into pandas Timestamp objects using to_datetime with format specifiers
- Perform date arithmetic using Timedelta, date_range, and DateOffset
- Set a datetime column as the DataFrame index and use time-based indexing and slicing
- Resample time series data to different frequencies (daily to weekly, monthly) with appropriate aggregation
- Compute rolling and expanding window statistics to smooth noisy time series data
In This Chapter
- Chapter Overview
- 11.1 Why Dates Are Hard (And Why Pandas Makes Them Easier)
- 11.2 Parsing Dates with pd.to_datetime()
- 11.3 The Timestamp Object: Anatomy of a Moment
- 11.4 Date Arithmetic: Timedelta and DateOffset
- 11.5 The DatetimeIndex: Making Time the Index
- 11.6 Resampling: Changing the Frequency of Time Series Data
- 11.7 Rolling Windows: Smoothing the Noise
- 11.8 Expanding Windows: Cumulative Statistics
- 11.9 Working with Time Zones
- 11.10 Practical Patterns: Common Date Operations
- 11.11 Project Checkpoint: Parsing Dates and Computing Rolling Averages
- 11.12 Common Pitfalls and How to Avoid Them
- 11.13 Spaced Review: Concepts from Chapters 1-10
- Chapter Summary
Chapter 11: Working with Dates, Times, and Time Series Data
"Time is the most underpurchased asset in the world." — Warren Buffett
Chapter Overview
Every interesting question about the real world involves time.
When did COVID-19 vaccination rates start to plateau? How do coffee shop sales change from Monday to Sunday? Is air quality getting better or worse over the last decade? When did the website crash, and how long was it down? All of these questions are impossible to answer without dates, times, and the ability to reason about data that changes over time.
And yet, time is one of the trickiest things to handle in data. Dates arrive as strings in a dozen different formats: "2023-03-15", "03/15/2023", "March 15, 2023", "15-Mar-23". Time zones create invisible traps: a meeting at "3:00 PM" means different things in New York and London. Leap years, daylight saving time, months with different numbers of days — the calendar is a maze of special cases that computers handle poorly unless you tell them exactly what to do.
The good news: pandas has built this maze's map for you. The pd.to_datetime() function, the Timestamp type, the DatetimeIndex, and the resample method together form a powerful toolkit for working with time. By the end of this chapter, you'll be able to parse any date format, slice data by time ranges, aggregate daily data into weekly or monthly summaries, and compute rolling averages that smooth out day-to-day noise.
In this chapter, you will learn to:
- Parse date strings in various formats into pandas Timestamp objects using
to_datetimewith format specifiers (all paths) - Perform date arithmetic using
Timedelta,date_range, andDateOffset(all paths) - Set a datetime column as the DataFrame index and use time-based indexing and slicing (all paths)
- Resample time series data to different frequencies (daily to weekly, monthly) with appropriate aggregation (all paths)
- Compute rolling and expanding window statistics to smooth noisy time series data (all paths)
11.1 Why Dates Are Hard (And Why Pandas Makes Them Easier)
Let's start with a simple experiment:
import pandas as pd
dates = pd.Series(["2023-01-15", "2023-02-20", "2023-03-10"])
dates.sort_values()
0 2023-01-15
1 2023-02-20
2 2023-03-10
dtype: object
That sorted correctly. But watch what happens with a different format:
dates = pd.Series(["1/15/2023", "2/20/2023", "12/5/2023"])
dates.sort_values()
0 1/15/2023
2 12/5/2023
1 2/20/2023
dtype: object
"12/5/2023" sorted between January and February because pandas is sorting these as strings, and the string "12" comes after "1" but before "2" in lexicographic order. December ended up between January and February. This is not just wrong — it's the kind of wrong that's invisible in a large dataset.
The fix is to tell pandas that these are dates, not strings:
dates = pd.to_datetime(
pd.Series(["1/15/2023", "2/20/2023", "12/5/2023"]))
dates.sort_values()
0 2023-01-15
1 2023-02-20
2 2023-12-05
dtype: datetime64[ns]
Now they sort correctly. The dtype changed from object (string) to datetime64[ns] — pandas' native date type. This unlocks everything: proper sorting, date arithmetic, time-based filtering, and resampling.
The Fundamental Problem with Dates as Strings
Strings that look like dates are useless for analysis. You can't subtract one from another to find the number of days between them. You can't ask "give me all rows from March." You can't compute a weekly average. The string "2023-03-15" is just ten characters to Python — it has no idea that it represents the fifteenth of March.
Converting dates from strings to proper datetime objects is almost always one of the first things you should do when loading a new dataset. It's up there with checking for missing values and verifying data types. If your date column has dtype object, it's a string, and it needs conversion.
Check Your Understanding
- Why does sorting date strings in "MM/DD/YYYY" format produce incorrect results?
- What dtype does a properly parsed date column have in pandas?
- A colleague says "my date column looks fine, all the dates are there." Why might it still need conversion?
11.2 Parsing Dates with pd.to_datetime()
The pd.to_datetime() function is your main tool for converting strings to datetime objects. It's remarkably flexible.
Automatic Parsing
For common date formats, to_datetime just works:
pd.to_datetime("2023-03-15") # ISO format
pd.to_datetime("March 15, 2023") # Written out
pd.to_datetime("03/15/2023") # US format
pd.to_datetime("15-Mar-2023") # Abbreviated month
All of these produce:
Timestamp('2023-03-15 00:00:00')
You can also pass an entire Series or column:
df = pd.DataFrame({
"date_str": ["2023-01-15", "2023-02-20",
"2023-03-10", "2023-04-05"]
})
df["date"] = pd.to_datetime(df["date_str"])
print(df.dtypes)
date_str object
date datetime64[ns]
dtype: object
When Automatic Parsing Goes Wrong: The Day-Month Ambiguity
Here's a trap that catches people constantly:
pd.to_datetime("01/02/2023")
Timestamp('2023-01-02 00:00:00')
Is that January 2nd or February 1st? Pandas assumed the US format (month/day/year), but if your data is from Europe, Australia, or most of the world, it means February 1st. There's no way for pandas to know which you meant.
The solution is to be explicit about the format. Use dayfirst=True for day-first formats:
pd.to_datetime("01/02/2023", dayfirst=True)
Timestamp('2023-02-01 00:00:00')
Or, better yet, specify the exact format string:
pd.to_datetime("01/02/2023", format="%d/%m/%Y")
Timestamp('2023-02-01 00:00:00')
Format Codes: Speaking the Language of Dates
Format codes are the alphabet for date parsing. Each code represents a component of a date or time:
| Code | Meaning | Example |
|---|---|---|
%Y |
4-digit year | 2023 |
%y |
2-digit year | 23 |
%m |
Month (zero-padded) | 03 |
%d |
Day (zero-padded) | 15 |
%B |
Full month name | March |
%b |
Abbreviated month | Mar |
%H |
Hour (24-hour) | 14 |
%I |
Hour (12-hour) | 02 |
%M |
Minute | 30 |
%S |
Second | 45 |
%p |
AM/PM | PM |
%A |
Full weekday name | Wednesday |
%a |
Abbreviated weekday | Wed |
You combine these codes with the literal characters in your date string:
# "15-Mar-2023 14:30"
pd.to_datetime("15-Mar-2023 14:30", format="%d-%b-%Y %H:%M")
Timestamp('2023-03-15 14:30:00')
# "March 15, 2023 at 2:30 PM"
pd.to_datetime("March 15, 2023 at 2:30 PM",
format="%B %d, %Y at %I:%M %p")
Timestamp('2023-03-15 14:30:00')
Parsing an Entire Column
When you specify the format parameter, parsing is also much faster — sometimes 10x faster on large datasets — because pandas doesn't have to guess the format for each value:
df = pd.DataFrame({
"date_str": ["15/01/2023", "20/02/2023",
"10/03/2023", "05/04/2023"]
})
# Slow (pandas guesses format for each value)
df["date"] = pd.to_datetime(df["date_str"])
# Fast (you tell pandas the format)
df["date"] = pd.to_datetime(df["date_str"],
format="%d/%m/%Y")
For large datasets (hundreds of thousands of rows), always specify the format.
Handling Parsing Errors
Real data has bad dates. Someone typed "not available" or "TBD" in a date field. to_datetime will raise an error by default:
dates = pd.Series(["2023-01-15", "TBD", "2023-03-10"])
pd.to_datetime(dates) # Raises ValueError
Use errors="coerce" to turn unparseable values into NaT (Not a Time — the datetime equivalent of NaN):
pd.to_datetime(dates, errors="coerce")
0 2023-01-15
1 NaT
2 2023-03-10
dtype: datetime64[ns]
NaT behaves like NaN — it's ignored by aggregations, it propagates through arithmetic, and you can detect it with .isna().
Check Your Understanding
- What format string would you use for dates like "2023/Mar/15"?
- Why is specifying a format string faster than letting pandas guess?
- What does
errors="coerce"do, and when would you use it?
11.3 The Timestamp Object: Anatomy of a Moment
Once you've parsed a date, you get a Timestamp object — pandas' representation of a single point in time. It has many useful attributes:
ts = pd.Timestamp("2023-03-15 14:30:00")
ts.year # 2023
ts.month # 3
ts.day # 15
ts.hour # 14
ts.minute # 30
ts.second # 0
ts.dayofweek # 2 (Wednesday; Monday=0)
ts.day_name() # 'Wednesday'
ts.month_name() # 'March'
ts.quarter # 1
ts.dayofyear # 74
ts.is_leap_year # False
For a column of datetime values, you access these through the .dt accessor (similar to .str for strings):
df = pd.DataFrame({
"date": pd.to_datetime(
["2023-01-15", "2023-06-20", "2023-12-25"])
})
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()
df["quarter"] = df["date"].dt.quarter
print(df)
date year month day_name quarter
0 2023-01-15 2023 1 Sunday 1
1 2023-06-20 2023 6 Tuesday 2
2 2023-12-25 2023 12 Monday 4
The .dt accessor is the datetime equivalent of the .str accessor you learned in Chapter 10. Just as .str.lower() applies a string method to every value, .dt.month extracts the month from every datetime value.
Formatting Dates for Output
The strftime() method (string-format-time) converts a datetime back to a string in any format you want:
ts = pd.Timestamp("2023-03-15 14:30:00")
ts.strftime("%B %d, %Y") # 'March 15, 2023'
ts.strftime("%m/%d/%y") # '03/15/23'
ts.strftime("%A, %B %d at %I:%M %p") # 'Wednesday, March 15 at 02:30 PM'
For a column:
df["date"].dt.strftime("%B %Y")
0 January 2023
1 June 2023
2 December 2023
dtype: object
This is useful for creating human-readable labels on charts or in reports.
11.4 Date Arithmetic: Timedelta and DateOffset
One of the biggest advantages of proper datetime objects is that you can do math with them.
Subtracting Dates Gives a Timedelta
start = pd.Timestamp("2023-01-15")
end = pd.Timestamp("2023-03-20")
difference = end - start
print(difference)
print(type(difference))
64 days 00:00:00
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
A Timedelta represents a duration — a length of time. You can extract components:
difference.days # 64
difference.total_seconds() # 5529600.0
This works on entire columns:
df = pd.DataFrame({
"dose_1": pd.to_datetime(["2023-01-15", "2023-02-01", "2023-03-10"]),
"dose_2": pd.to_datetime(["2023-02-12", "2023-02-28", "2023-04-07"])
})
df["gap"] = df["dose_2"] - df["dose_1"]
df["gap_days"] = df["gap"].dt.days
print(df)
dose_1 dose_2 gap gap_days
0 2023-01-15 2023-02-12 28 days 28
1 2023-02-01 2023-02-28 27 days 27
2 2023-03-10 2023-04-07 28 days 28
Adding Time to a Date
You can add a Timedelta to a Timestamp:
appointment = pd.Timestamp("2023-03-15")
follow_up = appointment + pd.Timedelta(days=28)
print(follow_up) # 2023-04-12 00:00:00
Create Timedelta objects in several ways:
pd.Timedelta(days=7) # 7 days
pd.Timedelta(hours=48) # 2 days
pd.Timedelta("3 days 4 hours") # 3 days, 4 hours
pd.Timedelta(weeks=2) # 14 days
DateOffset: Calendar-Aware Offsets
Timedelta counts exact durations. But sometimes you need calendar-aware offsets: "one month later" should go from January 15 to February 15, not add exactly 30 days.
from pandas.tseries.offsets import MonthEnd, MonthBegin, YearEnd
date = pd.Timestamp("2023-01-15")
date + pd.DateOffset(months=1) # 2023-02-15
date + pd.DateOffset(years=1) # 2024-01-15
date + MonthEnd(1) # 2023-01-31
date + MonthBegin(1) # 2023-02-01
DateOffset understands the calendar: adding one month to January 31 gives February 28 (or 29 in a leap year), not March 3.
A Practical Example: Computing Days Since Vaccination
vaccination_dates = pd.to_datetime(pd.Series([
"2023-01-15", "2023-02-20", "2023-03-10",
"2023-04-05", "2023-05-12"
]))
today = pd.Timestamp("2023-12-01")
days_since = (today - vaccination_dates).dt.days
print(days_since)
0 320
1 284
2 266
3 240
4 203
dtype: int64
Now you can easily find patients due for a booster (say, more than 180 days since their last dose):
due_for_booster = days_since > 180
print(due_for_booster)
0 True
1 True
2 True
3 True
4 True
dtype: bool
Check Your Understanding
- What's the difference between
pd.Timedelta(days=30)andpd.DateOffset(months=1)?- What type do you get when you subtract one Timestamp from another?
- How would you compute the number of hours between two datetime values?
11.5 The DatetimeIndex: Making Time the Index
Here's where things get powerful. When you set a datetime column as your DataFrame's index, pandas unlocks a whole suite of time-aware operations.
Setting the Index
df = pd.DataFrame({
"date": pd.to_datetime([
"2023-01-01", "2023-01-02", "2023-01-03",
"2023-01-04", "2023-01-05"
]),
"vaccinations": [1200, 1350, 980, 1500, 1425]
})
df = df.set_index("date")
print(df)
vaccinations
date
2023-01-01 1200
2023-01-02 1350
2023-01-03 980
2023-01-04 1500
2023-01-05 1425
The index is now a DatetimeIndex:
type(df.index) # pandas.core.indexes.datetimes.DatetimeIndex
Time-Based Slicing
With a DatetimeIndex, you can slice using date strings — no need to create Timestamp objects:
# All rows from January 2 to January 4
df["2023-01-02":"2023-01-04"]
vaccinations
date
2023-01-02 1350
2023-01-03 980
2023-01-04 1500
# All rows in January 2023
df["2023-01"]
vaccinations
date
2023-01-01 1200
2023-01-02 1350
2023-01-03 980
2023-01-04 1500
2023-01-05 1425
# All rows in 2023
df["2023"]
This partial-string indexing is incredibly convenient. You can slice by year, year-month, or exact date range — and pandas figures out what you mean.
Creating a DatetimeIndex with date_range()
Sometimes you need to create a date sequence from scratch:
# Every day in January 2023
dates = pd.date_range(start="2023-01-01",
end="2023-01-31",
freq="D")
print(len(dates)) # 31
# Every Monday in Q1 2023
mondays = pd.date_range(start="2023-01-01",
end="2023-03-31",
freq="W-MON")
print(mondays[:5])
DatetimeIndex(['2023-01-02', '2023-01-09', '2023-01-16',
'2023-01-23', '2023-01-30'],
dtype='datetime64[ns]', freq='W-MON')
# First day of each month in 2023
month_starts = pd.date_range(start="2023-01-01",
periods=12,
freq="MS")
print(month_starts)
The freq parameter uses frequency aliases:
| Alias | Meaning |
|---|---|
"D" |
Calendar day |
"B" |
Business day (Mon-Fri) |
"W" |
Weekly (defaults to Sunday) |
"W-MON" |
Weekly, anchored to Monday |
"MS" |
Month start |
"ME" |
Month end |
"QS" |
Quarter start |
"YS" |
Year start |
"h" |
Hourly |
"min" |
Minutely |
A Practical Example: Building a Daily Vaccination Timeline
import numpy as np
# Create a year of daily vaccination data
dates = pd.date_range("2023-01-01", "2023-12-31", freq="D")
np.random.seed(42)
# Simulate: baseline + trend + weekly pattern + noise
baseline = 5000
trend = np.linspace(0, 2000, len(dates))
weekly = 1000 * np.sin(
2 * np.pi * np.arange(len(dates)) / 7)
noise = np.random.normal(0, 500, len(dates))
daily_doses = pd.Series(
(baseline + trend + weekly + noise).clip(min=0),
index=dates,
name="daily_vaccinations"
).round(0).astype(int)
print(daily_doses.head(10))
2023-01-01 5748
2023-01-02 4431
2023-01-03 5824
2023-01-04 6262
2023-01-05 4883
2023-01-06 4387
2023-01-07 5606
2023-01-08 6019
2023-01-09 4614
2023-01-10 5577
Name: daily_vaccinations, dtype: int64
This is a time series — a sequence of values indexed by time. And now we can ask it all kinds of questions.
11.6 Resampling: Changing the Frequency of Time Series Data
Resampling is the process of converting time series data from one frequency to another. Daily data to weekly. Hourly data to daily. Monthly data to quarterly.
Downsampling: From Higher to Lower Frequency
Downsampling reduces the number of data points by aggregating them. The resample() method works like groupby() for time:
# Daily to weekly total
weekly = daily_doses.resample("W").sum()
print(weekly.head(5))
2023-01-01 5748
2023-01-08 37412
2023-01-15 39243
2023-01-22 40876
2023-01-29 42015
Name: daily_vaccinations, dtype: int64
# Daily to monthly average
monthly_avg = daily_doses.resample("ME").mean().round(0)
print(monthly_avg)
2023-01-31 5487.0
2023-02-28 5806.0
2023-03-31 6129.0
2023-04-30 6298.0
2023-05-31 6456.0
2023-06-30 6531.0
2023-07-31 6749.0
2023-08-31 6702.0
2023-09-30 6816.0
2023-10-31 6929.0
2023-11-30 6945.0
2023-12-31 7011.0
Name: daily_vaccinations, dtype: float64
You can see the upward trend clearly in the monthly averages — from about 5,500 doses per day in January to 7,000 in December.
Choosing the Right Aggregation Function
The aggregation function matters enormously:
# Total vaccinations per month (sum)
daily_doses.resample("ME").sum()
# Average daily vaccinations per month (mean)
daily_doses.resample("ME").mean()
# Peak daily vaccinations per month (max)
daily_doses.resample("ME").max()
# Number of days with data per month (count)
daily_doses.resample("ME").count()
Use sum when: You want the total over the period (total monthly sales, total weekly vaccinations).
Use mean when: You want the typical value during the period (average daily temperature, average daily vaccination rate).
Use max/min when: You want extremes (hottest day of the month, lowest daily sales).
Multiple Aggregations at Once
monthly_stats = daily_doses.resample("ME").agg(
["mean", "min", "max", "std"])
print(monthly_stats.head())
mean min max std
2023-01-31 5487.0 3844 7206 891.258...
2023-02-28 5806.0 4112 7498 873.445...
2023-03-31 6129.0 4501 7844 928.102...
2023-04-30 6298.0 4680 8001 875.339...
2023-05-31 6456.0 4890 8234 904.567...
Upsampling: From Lower to Higher Frequency
You can also go the other direction — from monthly to daily, for example. This creates gaps that need filling:
quarterly = daily_doses.resample("QS").mean()
print(quarterly)
2023-01-01 5807.3
2023-04-01 6428.6
2023-07-01 6756.1
2023-10-01 6961.8
Name: daily_vaccinations, dtype: float64
# Upsample to monthly, forward-fill to populate gaps
monthly = quarterly.resample("MS").ffill()
print(monthly)
2023-01-01 5807.3
2023-02-01 5807.3
2023-03-01 5807.3
2023-04-01 6428.6
2023-05-01 6428.6
...
ffill() (forward fill) copies each value forward until the next known value. You can also use bfill() (backward fill) or interpolate() for smoother transitions.
Check Your Understanding
- What's the difference between
.resample("W").sum()and.resample("W").mean()?- When resampling daily data to monthly, why might you choose
sumfor sales data butmeanfor temperature data?- What does
ffill()do when upsampling?
11.7 Rolling Windows: Smoothing the Noise
Daily data is noisy. Day-to-day fluctuations make it hard to see the underlying trend. Rolling windows solve this by computing a statistic (usually the mean) over a sliding window of fixed size.
The Basic Rolling Mean
# 7-day rolling average
rolling_7 = daily_doses.rolling(window=7).mean()
print(rolling_7.head(10))
2023-01-01 NaN
2023-01-02 NaN
2023-01-03 NaN
2023-01-04 NaN
2023-01-05 NaN
2023-01-06 NaN
2023-01-07 5344.7
2023-01-08 5280.6
2023-01-09 5374.1
2023-01-10 5410.6
Name: daily_vaccinations, dtype: float64
The first 6 values are NaN because you need 7 data points to compute a 7-day average. Starting from the 7th day, each value is the average of that day and the 6 preceding days.
Why Rolling Averages Matter
Let's compare the raw daily data with a 7-day rolling average:
comparison = pd.DataFrame({
"daily": daily_doses,
"7_day_avg": daily_doses.rolling(7).mean().round(0)
})
print(comparison.loc["2023-03-01":"2023-03-14"])
daily 7_day_avg
2023-03-01 5892 5842.0
2023-03-02 4501 5780.0
2023-03-03 6823 5946.0
2023-03-04 7123 6124.0
2023-03-05 5501 5976.0
2023-03-06 5210 5874.0
2023-03-07 6344 5913.0
2023-03-08 6112 5945.0
2023-03-09 4890 5857.0
2023-03-10 7003 6026.0
2023-03-11 6789 6007.0
2023-03-12 5544 5985.0
2023-03-13 5301 5998.0
2023-03-14 6450 6013.0
The daily values swing from 4,501 to 7,123. The 7-day average stays in a much narrower band (5,780 to 6,124). The rolling average smooths out the day-to-day noise and reveals the underlying trend.
This is exactly the calculation you see in news reports: "the 7-day rolling average of new COVID cases." It's the same technique applied to vaccination data, stock prices, website traffic, and countless other metrics.
Choosing the Window Size
The window size controls the trade-off between smoothness and responsiveness:
# Smaller window (3 days): responsive, still noisy
daily_doses.rolling(3).mean()
# Medium window (7 days): balanced — the default for most weekly patterns
daily_doses.rolling(7).mean()
# Larger window (30 days): very smooth, but slow to respond to changes
daily_doses.rolling(30).mean()
A 7-day window is the most common choice for daily data because it smooths out the day-of-week effect (weekdays vs. weekends) while still capturing weekly trends.
Centered vs. Trailing Windows
By default, the rolling window looks backward — the value for March 15 uses data from March 9-15. You can center the window instead:
# Centered: March 15 uses data from March 12-18
daily_doses.rolling(7, center=True).mean()
Centered windows are better for visualizing historical data (the smoothed value represents the middle of the window period). Trailing windows are better for real-time monitoring (you can compute today's value because it only uses past data).
Other Rolling Statistics
Rolling windows work with any aggregation, not just the mean:
daily_doses.rolling(7).std() # rolling standard deviation
daily_doses.rolling(7).min() # rolling minimum
daily_doses.rolling(7).max() # rolling maximum
daily_doses.rolling(7).median() # rolling median
daily_doses.rolling(7).sum() # rolling sum
The rolling standard deviation is particularly useful — a widening standard deviation suggests increasing volatility, while a narrowing one suggests stabilization.
Minimum Periods
If you don't want NaN values at the start of your series, you can set min_periods:
# Compute rolling average even with fewer than 7 data points
daily_doses.rolling(7, min_periods=1).mean()
2023-01-01 5748.0 (average of 1 value)
2023-01-02 5089.5 (average of 2 values)
2023-01-03 5334.3 (average of 3 values)
...
2023-01-07 5344.7 (average of 7 values)
With min_periods=1, the first value is just itself (average of one number), the second is the average of two values, and so on. Once you reach the full window size, it behaves normally.
11.8 Expanding Windows: Cumulative Statistics
While rolling windows use a fixed-size window that moves through the data, expanding windows start at the beginning and grow to include all data up to the current point.
# Cumulative average (expanding window)
cumulative_avg = daily_doses.expanding().mean()
print(cumulative_avg.head(5))
2023-01-01 5748.0
2023-01-02 5089.5
2023-01-03 5334.3
2023-01-04 5566.3
2023-01-05 5429.6
Name: daily_vaccinations, dtype: float64
The first value is just itself. The second is the average of the first two. The third is the average of the first three. And so on. By the end of the year, it's the average of all 365 days.
Cumulative Sum
# Running total of vaccinations
cumulative_total = daily_doses.expanding().sum()
print(f"Total vaccinations by year end: "
f"{cumulative_total.iloc[-1]:,.0f}")
Total vaccinations by year end: 2,312,456
Expanding vs. Rolling: When to Use Each
| Use Case | Tool | Why |
|---|---|---|
| Smooth daily noise | Rolling (7-day) | Captures recent trends |
| Track cumulative totals | Expanding sum | Shows running total |
| Monitor overall average | Expanding mean | Shows long-term average |
| Detect recent volatility | Rolling std | Compares to recent behavior |
| Detect trend changes | Rolling (30-day) | Smooths short-term noise |
11.9 Working with Time Zones
Time zones are a source of endless bugs. A meeting at "3:00 PM EST" is "8:00 PM UTC" is "12:00 PM PST." If your data involves events from different time zones — server logs, flight schedules, international vaccination campaigns — you need to handle this explicitly.
Naive vs. Aware Timestamps
A naive timestamp has no timezone information:
naive = pd.Timestamp("2023-03-15 14:30:00")
print(naive.tzinfo) # None
An aware timestamp knows its timezone:
aware = pd.Timestamp("2023-03-15 14:30:00", tz="US/Eastern")
print(aware)
print(aware.tzinfo)
2023-03-15 14:30:00-04:00
US/Eastern
Converting Between Time Zones
eastern = pd.Timestamp("2023-03-15 14:30:00", tz="US/Eastern")
# Convert to UTC
utc = eastern.tz_convert("UTC")
print(utc) # 2023-03-15 18:30:00+00:00
# Convert to Pacific
pacific = eastern.tz_convert("US/Pacific")
print(pacific) # 2023-03-15 11:30:00-07:00
For a column:
# Add timezone to a naive column
df["date_utc"] = df["date"].dt.tz_localize("UTC")
# Convert to a different timezone
df["date_eastern"] = df["date_utc"].dt.tz_convert("US/Eastern")
Practical Advice on Time Zones
For most data science work with tabular data, here's the simple rule: store everything in UTC internally, convert to local time only for display. This avoids daylight saving time bugs, ambiguous times, and the chaos of comparing timestamps from different zones.
If your dataset doesn't specify time zones and you know all the data is from one zone, you can usually work with naive timestamps. Just document your assumption.
11.10 Practical Patterns: Common Date Operations
Here are operations you'll use constantly. Bookmark this section.
Extracting Components for Analysis
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek # 0=Monday
df["day_name"] = df["date"].dt.day_name()
df["is_weekend"] = df["date"].dt.dayofweek >= 5
df["quarter"] = df["date"].dt.quarter
df["week_of_year"] = df["date"].dt.isocalendar().week
Filtering by Date Range
# Using boolean masks
mask = (df["date"] >= "2023-06-01") & (df["date"] < "2023-07-01")
june_data = df[mask]
# Using DatetimeIndex slicing
df_indexed = df.set_index("date")
june_data = df_indexed["2023-06"]
Grouping by Time Periods
# Average vaccination rate by month
df.groupby(df["date"].dt.month)["vaccinations"].mean()
# Total by quarter
df.groupby(df["date"].dt.quarter)["vaccinations"].sum()
# Compare weekdays vs. weekends
df.groupby(df["date"].dt.dayofweek >= 5)["vaccinations"].mean()
Finding Time Gaps
# Detect missing dates in a time series
full_range = pd.date_range(
df["date"].min(), df["date"].max(), freq="D")
missing = full_range.difference(df["date"])
print(f"Missing {len(missing)} days of data")
Computing Year-Over-Year Changes
# Shift data by one year for comparison
df_indexed["last_year"] = df_indexed["vaccinations"].shift(365)
df_indexed["yoy_change"] = (
df_indexed["vaccinations"] - df_indexed["last_year"])
11.11 Project Checkpoint: Parsing Dates and Computing Rolling Averages
Let's apply everything to Elena's vaccination project. She has daily vaccination counts by country, but the date column arrived as strings in an inconsistent format.
# Load the data
vax = pd.DataFrame({
"country": (["United States"] * 10 +
["Germany"] * 10 +
["Brazil"] * 10),
"date_str": [
"01/15/2023", "01/16/2023", "01/17/2023",
"01/18/2023", "01/19/2023", "01/20/2023",
"01/21/2023", "01/22/2023", "01/23/2023",
"01/24/2023",
"15.01.2023", "16.01.2023", "17.01.2023",
"18.01.2023", "19.01.2023", "20.01.2023",
"21.01.2023", "22.01.2023", "23.01.2023",
"24.01.2023",
"2023-01-15", "2023-01-16", "2023-01-17",
"2023-01-18", "2023-01-19", "2023-01-20",
"2023-01-21", "2023-01-22", "2023-01-23",
"2023-01-24"
],
"daily_vaccinations": [
45000, 52000, 48000, 51000, 55000,
30000, 28000, 53000, 49000, 54000,
12000, 13500, 11800, 14200, 13000,
8500, 7200, 12500, 13800, 14500,
85000, 92000, 78000, 95000, 88000,
65000, 58000, 91000, 86000, 93000
]
})
Step 1: Parse the dates
The date formats differ by country — US uses MM/DD/YYYY, Germany uses DD.MM.YYYY, Brazil uses ISO format. Elena handles each:
def parse_mixed_dates(row):
date_str = row["date_str"]
if "." in date_str and len(date_str) == 10:
return pd.to_datetime(date_str, format="%d.%m.%Y")
elif "/" in date_str:
return pd.to_datetime(date_str, format="%m/%d/%Y")
else:
return pd.to_datetime(date_str)
vax["date"] = vax.apply(parse_mixed_dates, axis=1)
Step 2: Compute rolling 7-day averages by country
vax = vax.sort_values(["country", "date"])
vax["rolling_7d"] = (vax
.groupby("country")["daily_vaccinations"]
.transform(lambda x: x.rolling(7, min_periods=1).mean())
.round(0))
print(vax[vax["country"] == "United States"][
["date", "daily_vaccinations", "rolling_7d"]])
date daily_vaccinations rolling_7d
0 2023-01-15 45000 45000.0
1 2023-01-16 52000 48500.0
2 2023-01-17 48000 48333.0
3 2023-01-18 51000 49000.0
4 2023-01-19 55000 50200.0
5 2023-01-20 30000 46833.0
6 2023-01-21 28000 44143.0
7 2023-01-22 53000 45286.0
8 2023-01-23 49000 44857.0
9 2023-01-24 54000 45714.0
The rolling average smooths out the weekend dip (January 20-21 had lower numbers) and shows a clearer picture of the vaccination pace.
Step 3: Identify weekend effects
vax["is_weekend"] = vax["date"].dt.dayofweek >= 5
weekend_effect = (vax
.groupby(["country", "is_weekend"])
["daily_vaccinations"].mean()
.round(0)
.unstack())
weekend_effect.columns = ["Weekday", "Weekend"]
weekend_effect["Drop %"] = (
(1 - weekend_effect["Weekend"] / weekend_effect["Weekday"]) * 100
).round(1)
print(weekend_effect)
Weekday Weekend Drop %
country
Brazil 88375.0 61500.0 30.4
Germany 13175.0 7850.0 40.4
United States 50625.0 29000.0 42.7
There's a clear weekend effect — vaccinations drop 30-43% on weekends across all three countries. This is the kind of insight that time-aware analysis reveals.
Step 4: Set up for time-based analysis
# Create a time-indexed version for one country
us_vax = vax[vax["country"] == "United States"].set_index("date")
us_vax = us_vax[["daily_vaccinations"]]
# This enables time-based slicing
jan_20_onward = us_vax["2023-01-20":]
print(jan_20_onward)
Elena now has a clean, time-indexed DataFrame ready for visualization (Chapter 15) and statistical analysis (Chapter 19).
11.12 Common Pitfalls and How to Avoid Them
Pitfall 1: Mixed Date Formats in One Column
If a single column has "01/15/2023" and "15/01/2023" mixed together, to_datetime will silently misparse some entries. The date "01/02/2023" could be January 2 or February 1, and there's no way to tell without context.
Solution: Know your data source. If the format is consistent within the column, specify it with format=. If it's inconsistent, you need metadata (like the country of origin) to determine the format.
Pitfall 2: Two-Digit Years
pd.to_datetime("03/15/23")
Is that 1923 or 2023? Pandas uses a cutoff: years 00-68 become 2000-2068, and 69-99 become 1969-1999. This is almost always what you want, but if you're working with historical data, specify the format explicitly:
pd.to_datetime("03/15/23", format="%m/%d/%y") # 2023
pd.to_datetime("03/15/23", format="%m/%d/19%y") # No! Doesn't work.
For historical dates, use four-digit years in your data.
Pitfall 3: Forgetting to Convert
This is the most common pitfall. You load a CSV, start analyzing, and forget that the "date" column is still a string:
df = pd.read_csv("data.csv")
print(df["date"].dtype) # object <-- Still a string!
Always check dtypes after loading. Or, even better, parse dates during loading:
df = pd.read_csv("data.csv", parse_dates=["date"])
The parse_dates parameter tells read_csv to automatically parse the named columns as dates. This is the cleanest approach when the format is straightforward.
Pitfall 4: Off-by-One in Resampling
When resampling to weeks, what counts as "this week" depends on which day the week starts. By default, resample("W") ends weeks on Sunday. If your business week starts on Monday, use resample("W-MON").
Similarly, resample("ME") groups by month end. If you want month-start buckets, use resample("MS").
Check Your Understanding
- How would you parse dates during CSV loading instead of after?
- What's the risk of using
to_datetimewithout a format string on a column with MM/DD/YYYY dates?- Why might
resample("W")andresample("W-MON")give different results?
11.13 Spaced Review: Concepts from Chapters 1-10
Take five minutes to answer these questions from earlier chapters without looking back.
From Chapter 1: What's the difference between a descriptive question and a predictive question? Which comes first in a typical analysis?
From Chapter 4: Write a function that takes a list of numbers and returns the average. Do it without using any library.
From Chapter 7: How do you select rows from a pandas DataFrame where a column meets a condition? Write the syntax.
From Chapter 8: What's the difference between dropna() and fillna()? When would you use each?
From Chapter 9: What does pd.merge(df1, df2, on="key", how="left") do? What happens to rows in df1 that have no match in df2?
From Chapter 10: What regex pattern would match a US phone number like "555-123-4567"? What does \d{3} mean?
Chapter Summary
Time is the backbone of most real-world data analysis. In this chapter, you learned to handle it properly.
Parsing converts date strings into datetime objects with pd.to_datetime(). Always specify the format when possible — it's faster, safer, and prevents day-month ambiguity.
Timestamp and the .dt accessor give you access to every component of a date: year, month, day, day of week, quarter, and more.
Timedelta and DateOffset enable date arithmetic: computing durations, adding intervals, and comparing dates across a dataset.
DatetimeIndex makes your DataFrame time-aware, enabling partial-string slicing ("give me all of March") and the powerful resample method.
Resampling changes the frequency of your time series: daily to weekly, hourly to daily, monthly to quarterly. The choice of aggregation function (sum vs. mean vs. max) determines what story the resampled data tells.
Rolling windows smooth noisy data by computing statistics over a sliding window. The 7-day rolling average is the workhorse of time series analysis, revealing trends hidden by day-to-day fluctuations.
Expanding windows compute cumulative statistics: running totals, running averages, running maximums.
Together, these tools let you ask and answer temporal questions: When did vaccination rates peak? How fast are they growing? Is there a weekend effect? What's the 30-day trend? These are the questions that drive real-world decisions, and now you have the tools to answer them.
What's Next
In Chapter 12, you'll learn to load data from multiple file formats — CSV, Excel, JSON, and databases. Not all data comes in one neat file, and the ability to combine data from different sources is essential for building comprehensive analyses. If you're working on the vaccination project, you'll load population data from Excel and country metadata from JSON.
This chapter covered date parsing, time arithmetic, resampling, and rolling windows. The exercises that follow will give you practice with each of these techniques on realistic datasets. Time series analysis rewards careful attention to detail — make sure your dates are parsed correctly before computing anything.