Chapter 11: Working with Dates, Times, and Time Series Data

Contributors to Introduction to Data Science

17 min read

Every interesting question about the real world involves time.

Learning Objectives

Parse date strings in various formats into pandas Timestamp objects using to_datetime with format specifiers
Perform date arithmetic using Timedelta, date_range, and DateOffset
Set a datetime column as the DataFrame index and use time-based indexing and slicing
Resample time series data to different frequencies (daily to weekly, monthly) with appropriate aggregation
Compute rolling and expanding window statistics to smooth noisy time series data

In This Chapter

Chapter Overview
11.1 Why Dates Are Hard (And Why Pandas Makes Them Easier)
11.2 Parsing Dates with pd.to_datetime()
11.3 The Timestamp Object: Anatomy of a Moment
11.4 Date Arithmetic: Timedelta and DateOffset
11.5 The DatetimeIndex: Making Time the Index
11.6 Resampling: Changing the Frequency of Time Series Data
11.7 Rolling Windows: Smoothing the Noise
11.8 Expanding Windows: Cumulative Statistics
11.9 Working with Time Zones
11.10 Practical Patterns: Common Date Operations
11.11 Project Checkpoint: Parsing Dates and Computing Rolling Averages
11.12 Common Pitfalls and How to Avoid Them
11.13 Spaced Review: Concepts from Chapters 1-10
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 11: Working with Dates, Times, and Time Series Data

"Time is the most underpurchased asset in the world." — Warren Buffett

Chapter Overview

Every interesting question about the real world involves time.

When did COVID-19 vaccination rates start to plateau? How do coffee shop sales change from Monday to Sunday? Is air quality getting better or worse over the last decade? When did the website crash, and how long was it down? All of these questions are impossible to answer without dates, times, and the ability to reason about data that changes over time.

And yet, time is one of the trickiest things to handle in data. Dates arrive as strings in a dozen different formats: "2023-03-15", "03/15/2023", "March 15, 2023", "15-Mar-23". Time zones create invisible traps: a meeting at "3:00 PM" means different things in New York and London. Leap years, daylight saving time, months with different numbers of days — the calendar is a maze of special cases that computers handle poorly unless you tell them exactly what to do.

The good news: pandas has built this maze's map for you. The pd.to_datetime() function, the Timestamp type, the DatetimeIndex, and the resample method together form a powerful toolkit for working with time. By the end of this chapter, you'll be able to parse any date format, slice data by time ranges, aggregate daily data into weekly or monthly summaries, and compute rolling averages that smooth out day-to-day noise.

In this chapter, you will learn to:

Parse date strings in various formats into pandas Timestamp objects using to_datetime with format specifiers (all paths)
Perform date arithmetic using Timedelta, date_range, and DateOffset (all paths)
Set a datetime column as the DataFrame index and use time-based indexing and slicing (all paths)
Resample time series data to different frequencies (daily to weekly, monthly) with appropriate aggregation (all paths)
Compute rolling and expanding window statistics to smooth noisy time series data (all paths)

11.1 Why Dates Are Hard (And Why Pandas Makes Them Easier)

Let's start with a simple experiment:

import pandas as pd

dates = pd.Series(["2023-01-15", "2023-02-20", "2023-03-10"])
dates.sort_values()

0    2023-01-15
1    2023-02-20
2    2023-03-10
dtype: object

That sorted correctly. But watch what happens with a different format:

dates = pd.Series(["1/15/2023", "2/20/2023", "12/5/2023"])
dates.sort_values()

0    1/15/2023
2    12/5/2023
1    2/20/2023
dtype: object

"12/5/2023" sorted between January and February because pandas is sorting these as strings, and the string "12" comes after "1" but before "2" in lexicographic order. December ended up between January and February. This is not just wrong — it's the kind of wrong that's invisible in a large dataset.

The fix is to tell pandas that these are dates, not strings:

dates = pd.to_datetime(
    pd.Series(["1/15/2023", "2/20/2023", "12/5/2023"]))
dates.sort_values()

0   2023-01-15
1   2023-02-20
2   2023-12-05
dtype: datetime64[ns]

Now they sort correctly. The dtype changed from object (string) to datetime64[ns] — pandas' native date type. This unlocks everything: proper sorting, date arithmetic, time-based filtering, and resampling.

The Fundamental Problem with Dates as Strings

Strings that look like dates are useless for analysis. You can't subtract one from another to find the number of days between them. You can't ask "give me all rows from March." You can't compute a weekly average. The string "2023-03-15" is just ten characters to Python — it has no idea that it represents the fifteenth of March.

Converting dates from strings to proper datetime objects is almost always one of the first things you should do when loading a new dataset. It's up there with checking for missing values and verifying data types. If your date column has dtype object, it's a string, and it needs conversion.

Check Your Understanding

Why does sorting date strings in "MM/DD/YYYY" format produce incorrect results?

What dtype does a properly parsed date column have in pandas?

A colleague says "my date column looks fine, all the dates are there." Why might it still need conversion?

11.2 Parsing Dates with `pd.to_datetime()`

The pd.to_datetime() function is your main tool for converting strings to datetime objects. It's remarkably flexible.

Automatic Parsing

For common date formats, to_datetime just works:

pd.to_datetime("2023-03-15")          # ISO format
pd.to_datetime("March 15, 2023")      # Written out
pd.to_datetime("03/15/2023")          # US format
pd.to_datetime("15-Mar-2023")         # Abbreviated month

All of these produce:

Timestamp('2023-03-15 00:00:00')

You can also pass an entire Series or column:

df = pd.DataFrame({
    "date_str": ["2023-01-15", "2023-02-20",
                  "2023-03-10", "2023-04-05"]
})

df["date"] = pd.to_datetime(df["date_str"])
print(df.dtypes)

date_str    object
date        datetime64[ns]
dtype: object

When Automatic Parsing Goes Wrong: The Day-Month Ambiguity

Here's a trap that catches people constantly:

pd.to_datetime("01/02/2023")

Timestamp('2023-01-02 00:00:00')

Is that January 2nd or February 1st? Pandas assumed the US format (month/day/year), but if your data is from Europe, Australia, or most of the world, it means February 1st. There's no way for pandas to know which you meant.

The solution is to be explicit about the format. Use dayfirst=True for day-first formats:

pd.to_datetime("01/02/2023", dayfirst=True)

Timestamp('2023-02-01 00:00:00')

Or, better yet, specify the exact format string:

pd.to_datetime("01/02/2023", format="%d/%m/%Y")

Timestamp('2023-02-01 00:00:00')

Format Codes: Speaking the Language of Dates

Format codes are the alphabet for date parsing. Each code represents a component of a date or time:

Code	Meaning	Example
`%Y`	4-digit year	2023
`%y`	2-digit year	23
`%m`	Month (zero-padded)	03
`%d`	Day (zero-padded)	15
`%B`	Full month name	March
`%b`	Abbreviated month	Mar
`%H`	Hour (24-hour)	14
`%I`	Hour (12-hour)	02
`%M`	Minute	30
`%S`	Second	45
`%p`	AM/PM	PM
`%A`	Full weekday name	Wednesday
`%a`	Abbreviated weekday	Wed

You combine these codes with the literal characters in your date string:

# "15-Mar-2023 14:30"
pd.to_datetime("15-Mar-2023 14:30", format="%d-%b-%Y %H:%M")

Timestamp('2023-03-15 14:30:00')

# "March 15, 2023 at 2:30 PM"
pd.to_datetime("March 15, 2023 at 2:30 PM",
               format="%B %d, %Y at %I:%M %p")

Timestamp('2023-03-15 14:30:00')

Parsing an Entire Column

When you specify the format parameter, parsing is also much faster — sometimes 10x faster on large datasets — because pandas doesn't have to guess the format for each value:

df = pd.DataFrame({
    "date_str": ["15/01/2023", "20/02/2023",
                  "10/03/2023", "05/04/2023"]
})

# Slow (pandas guesses format for each value)
df["date"] = pd.to_datetime(df["date_str"])

# Fast (you tell pandas the format)
df["date"] = pd.to_datetime(df["date_str"],
                              format="%d/%m/%Y")

For large datasets (hundreds of thousands of rows), always specify the format.

Handling Parsing Errors

Real data has bad dates. Someone typed "not available" or "TBD" in a date field. to_datetime will raise an error by default:

dates = pd.Series(["2023-01-15", "TBD", "2023-03-10"])
pd.to_datetime(dates)  # Raises ValueError

Use errors="coerce" to turn unparseable values into NaT (Not a Time — the datetime equivalent of NaN):

pd.to_datetime(dates, errors="coerce")

0   2023-01-15
1          NaT
2   2023-03-10
dtype: datetime64[ns]

NaT behaves like NaN — it's ignored by aggregations, it propagates through arithmetic, and you can detect it with .isna().

Check Your Understanding

What format string would you use for dates like "2023/Mar/15"?

Why is specifying a format string faster than letting pandas guess?

What does errors="coerce" do, and when would you use it?

11.3 The Timestamp Object: Anatomy of a Moment

Once you've parsed a date, you get a Timestamp object — pandas' representation of a single point in time. It has many useful attributes:

ts = pd.Timestamp("2023-03-15 14:30:00")

ts.year        # 2023
ts.month       # 3
ts.day         # 15
ts.hour        # 14
ts.minute      # 30
ts.second      # 0
ts.dayofweek   # 2 (Wednesday; Monday=0)
ts.day_name()  # 'Wednesday'
ts.month_name() # 'March'
ts.quarter     # 1
ts.dayofyear   # 74
ts.is_leap_year # False

For a column of datetime values, you access these through the .dt accessor (similar to .str for strings):

df = pd.DataFrame({
    "date": pd.to_datetime(
        ["2023-01-15", "2023-06-20", "2023-12-25"])
})

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()
df["quarter"] = df["date"].dt.quarter
print(df)

        date  year  month    day_name  quarter
0 2023-01-15  2023      1      Sunday        1
1 2023-06-20  2023      6     Tuesday        2
2 2023-12-25  2023     12      Monday        4

The .dt accessor is the datetime equivalent of the .str accessor you learned in Chapter 10. Just as .str.lower() applies a string method to every value, .dt.month extracts the month from every datetime value.

Formatting Dates for Output

The strftime() method (string-format-time) converts a datetime back to a string in any format you want:

ts = pd.Timestamp("2023-03-15 14:30:00")

ts.strftime("%B %d, %Y")           # 'March 15, 2023'
ts.strftime("%m/%d/%y")            # '03/15/23'
ts.strftime("%A, %B %d at %I:%M %p")  # 'Wednesday, March 15 at 02:30 PM'

For a column:

df["date"].dt.strftime("%B %Y")

0    January 2023
1       June 2023
2    December 2023
dtype: object

This is useful for creating human-readable labels on charts or in reports.

11.4 Date Arithmetic: Timedelta and DateOffset

One of the biggest advantages of proper datetime objects is that you can do math with them.

Subtracting Dates Gives a Timedelta

start = pd.Timestamp("2023-01-15")
end = pd.Timestamp("2023-03-20")

difference = end - start
print(difference)
print(type(difference))

64 days 00:00:00
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>

A Timedelta represents a duration — a length of time. You can extract components:

difference.days    # 64
difference.total_seconds()  # 5529600.0

This works on entire columns:

df = pd.DataFrame({
    "dose_1": pd.to_datetime(["2023-01-15", "2023-02-01", "2023-03-10"]),
    "dose_2": pd.to_datetime(["2023-02-12", "2023-02-28", "2023-04-07"])
})

df["gap"] = df["dose_2"] - df["dose_1"]
df["gap_days"] = df["gap"].dt.days
print(df)

      dose_1     dose_2     gap  gap_days
0 2023-01-15 2023-02-12 28 days       28
1 2023-02-01 2023-02-28 27 days       27
2 2023-03-10 2023-04-07 28 days       28

Adding Time to a Date

You can add a Timedelta to a Timestamp:

appointment = pd.Timestamp("2023-03-15")
follow_up = appointment + pd.Timedelta(days=28)
print(follow_up)  # 2023-04-12 00:00:00

Create Timedelta objects in several ways:

pd.Timedelta(days=7)           # 7 days
pd.Timedelta(hours=48)         # 2 days
pd.Timedelta("3 days 4 hours") # 3 days, 4 hours
pd.Timedelta(weeks=2)          # 14 days

DateOffset: Calendar-Aware Offsets

Timedelta counts exact durations. But sometimes you need calendar-aware offsets: "one month later" should go from January 15 to February 15, not add exactly 30 days.

from pandas.tseries.offsets import MonthEnd, MonthBegin, YearEnd

date = pd.Timestamp("2023-01-15")

date + pd.DateOffset(months=1)   # 2023-02-15
date + pd.DateOffset(years=1)    # 2024-01-15
date + MonthEnd(1)               # 2023-01-31
date + MonthBegin(1)             # 2023-02-01

DateOffset understands the calendar: adding one month to January 31 gives February 28 (or 29 in a leap year), not March 3.

A Practical Example: Computing Days Since Vaccination

vaccination_dates = pd.to_datetime(pd.Series([
    "2023-01-15", "2023-02-20", "2023-03-10",
    "2023-04-05", "2023-05-12"
]))

today = pd.Timestamp("2023-12-01")
days_since = (today - vaccination_dates).dt.days

print(days_since)

0    320
1    284
2    266
3    240
4    203
dtype: int64

Now you can easily find patients due for a booster (say, more than 180 days since their last dose):

due_for_booster = days_since > 180
print(due_for_booster)

0    True
1    True
2    True
3    True
4    True
dtype: bool

Check Your Understanding

What's the difference between pd.Timedelta(days=30) and pd.DateOffset(months=1)?

What type do you get when you subtract one Timestamp from another?

How would you compute the number of hours between two datetime values?

11.5 The DatetimeIndex: Making Time the Index

Here's where things get powerful. When you set a datetime column as your DataFrame's index, pandas unlocks a whole suite of time-aware operations.

Setting the Index

df = pd.DataFrame({
    "date": pd.to_datetime([
        "2023-01-01", "2023-01-02", "2023-01-03",
        "2023-01-04", "2023-01-05"
    ]),
    "vaccinations": [1200, 1350, 980, 1500, 1425]
})

df = df.set_index("date")
print(df)

            vaccinations
date
2023-01-01          1200
2023-01-02          1350
2023-01-03           980
2023-01-04          1500
2023-01-05          1425

The index is now a DatetimeIndex:

type(df.index)  # pandas.core.indexes.datetimes.DatetimeIndex

Time-Based Slicing

With a DatetimeIndex, you can slice using date strings — no need to create Timestamp objects:

# All rows from January 2 to January 4
df["2023-01-02":"2023-01-04"]

            vaccinations
date
2023-01-02          1350
2023-01-03           980
2023-01-04          1500

# All rows in January 2023
df["2023-01"]

            vaccinations
date
2023-01-01          1200
2023-01-02          1350
2023-01-03           980
2023-01-04          1500
2023-01-05          1425

# All rows in 2023
df["2023"]

This partial-string indexing is incredibly convenient. You can slice by year, year-month, or exact date range — and pandas figures out what you mean.

Creating a DatetimeIndex with `date_range()`

Sometimes you need to create a date sequence from scratch:

# Every day in January 2023
dates = pd.date_range(start="2023-01-01",
                       end="2023-01-31",
                       freq="D")
print(len(dates))  # 31

# Every Monday in Q1 2023
mondays = pd.date_range(start="2023-01-01",
                          end="2023-03-31",
                          freq="W-MON")
print(mondays[:5])

DatetimeIndex(['2023-01-02', '2023-01-09', '2023-01-16',
               '2023-01-23', '2023-01-30'],
              dtype='datetime64[ns]', freq='W-MON')

# First day of each month in 2023
month_starts = pd.date_range(start="2023-01-01",
                               periods=12,
                               freq="MS")
print(month_starts)

The freq parameter uses frequency aliases:

Alias	Meaning
`"D"`	Calendar day
`"B"`	Business day (Mon-Fri)
`"W"`	Weekly (defaults to Sunday)
`"W-MON"`	Weekly, anchored to Monday
`"MS"`	Month start
`"ME"`	Month end
`"QS"`	Quarter start
`"YS"`	Year start
`"h"`	Hourly
`"min"`	Minutely

A Practical Example: Building a Daily Vaccination Timeline

import numpy as np

# Create a year of daily vaccination data
dates = pd.date_range("2023-01-01", "2023-12-31", freq="D")
np.random.seed(42)

# Simulate: baseline + trend + weekly pattern + noise
baseline = 5000
trend = np.linspace(0, 2000, len(dates))
weekly = 1000 * np.sin(
    2 * np.pi * np.arange(len(dates)) / 7)
noise = np.random.normal(0, 500, len(dates))

daily_doses = pd.Series(
    (baseline + trend + weekly + noise).clip(min=0),
    index=dates,
    name="daily_vaccinations"
).round(0).astype(int)

print(daily_doses.head(10))

2023-01-01    5748
2023-01-02    4431
2023-01-03    5824
2023-01-04    6262
2023-01-05    4883
2023-01-06    4387
2023-01-07    5606
2023-01-08    6019
2023-01-09    4614
2023-01-10    5577
Name: daily_vaccinations, dtype: int64

This is a time series — a sequence of values indexed by time. And now we can ask it all kinds of questions.

11.6 Resampling: Changing the Frequency of Time Series Data

Resampling is the process of converting time series data from one frequency to another. Daily data to weekly. Hourly data to daily. Monthly data to quarterly.

Downsampling: From Higher to Lower Frequency

Downsampling reduces the number of data points by aggregating them. The resample() method works like groupby() for time:

# Daily to weekly total
weekly = daily_doses.resample("W").sum()
print(weekly.head(5))

2023-01-01     5748
2023-01-08    37412
2023-01-15    39243
2023-01-22    40876
2023-01-29    42015
Name: daily_vaccinations, dtype: int64

# Daily to monthly average
monthly_avg = daily_doses.resample("ME").mean().round(0)
print(monthly_avg)

2023-01-31    5487.0
2023-02-28    5806.0
2023-03-31    6129.0
2023-04-30    6298.0
2023-05-31    6456.0
2023-06-30    6531.0
2023-07-31    6749.0
2023-08-31    6702.0
2023-09-30    6816.0
2023-10-31    6929.0
2023-11-30    6945.0
2023-12-31    7011.0
Name: daily_vaccinations, dtype: float64

You can see the upward trend clearly in the monthly averages — from about 5,500 doses per day in January to 7,000 in December.

Choosing the Right Aggregation Function

The aggregation function matters enormously:

# Total vaccinations per month (sum)
daily_doses.resample("ME").sum()

# Average daily vaccinations per month (mean)
daily_doses.resample("ME").mean()

# Peak daily vaccinations per month (max)
daily_doses.resample("ME").max()

# Number of days with data per month (count)
daily_doses.resample("ME").count()

Use sum when: You want the total over the period (total monthly sales, total weekly vaccinations).

Use mean when: You want the typical value during the period (average daily temperature, average daily vaccination rate).

Use max/min when: You want extremes (hottest day of the month, lowest daily sales).

Multiple Aggregations at Once

monthly_stats = daily_doses.resample("ME").agg(
    ["mean", "min", "max", "std"])
print(monthly_stats.head())

             mean   min   max         std
2023-01-31  5487.0  3844  7206  891.258...
2023-02-28  5806.0  4112  7498  873.445...
2023-03-31  6129.0  4501  7844  928.102...
2023-04-30  6298.0  4680  8001  875.339...
2023-05-31  6456.0  4890  8234  904.567...

Upsampling: From Lower to Higher Frequency

You can also go the other direction — from monthly to daily, for example. This creates gaps that need filling:

quarterly = daily_doses.resample("QS").mean()
print(quarterly)

2023-01-01    5807.3
2023-04-01    6428.6
2023-07-01    6756.1
2023-10-01    6961.8
Name: daily_vaccinations, dtype: float64

# Upsample to monthly, forward-fill to populate gaps
monthly = quarterly.resample("MS").ffill()
print(monthly)

2023-01-01    5807.3
2023-02-01    5807.3
2023-03-01    5807.3
2023-04-01    6428.6
2023-05-01    6428.6
...

ffill() (forward fill) copies each value forward until the next known value. You can also use bfill() (backward fill) or interpolate() for smoother transitions.

Check Your Understanding

What's the difference between .resample("W").sum() and .resample("W").mean()?

When resampling daily data to monthly, why might you choose sum for sales data but mean for temperature data?

What does ffill() do when upsampling?

11.7 Rolling Windows: Smoothing the Noise

Daily data is noisy. Day-to-day fluctuations make it hard to see the underlying trend. Rolling windows solve this by computing a statistic (usually the mean) over a sliding window of fixed size.

The Basic Rolling Mean

# 7-day rolling average
rolling_7 = daily_doses.rolling(window=7).mean()
print(rolling_7.head(10))

2023-01-01       NaN
2023-01-02       NaN
2023-01-03       NaN
2023-01-04       NaN
2023-01-05       NaN
2023-01-06       NaN
2023-01-07    5344.7
2023-01-08    5280.6
2023-01-09    5374.1
2023-01-10    5410.6
Name: daily_vaccinations, dtype: float64

The first 6 values are NaN because you need 7 data points to compute a 7-day average. Starting from the 7th day, each value is the average of that day and the 6 preceding days.

Why Rolling Averages Matter

Let's compare the raw daily data with a 7-day rolling average:

comparison = pd.DataFrame({
    "daily": daily_doses,
    "7_day_avg": daily_doses.rolling(7).mean().round(0)
})

print(comparison.loc["2023-03-01":"2023-03-14"])

            daily  7_day_avg
2023-03-01   5892     5842.0
2023-03-02   4501     5780.0
2023-03-03   6823     5946.0
2023-03-04   7123     6124.0
2023-03-05   5501     5976.0
2023-03-06   5210     5874.0
2023-03-07   6344     5913.0
2023-03-08   6112     5945.0
2023-03-09   4890     5857.0
2023-03-10   7003     6026.0
2023-03-11   6789     6007.0
2023-03-12   5544     5985.0
2023-03-13   5301     5998.0
2023-03-14   6450     6013.0

The daily values swing from 4,501 to 7,123. The 7-day average stays in a much narrower band (5,780 to 6,124). The rolling average smooths out the day-to-day noise and reveals the underlying trend.

This is exactly the calculation you see in news reports: "the 7-day rolling average of new COVID cases." It's the same technique applied to vaccination data, stock prices, website traffic, and countless other metrics.

Choosing the Window Size

The window size controls the trade-off between smoothness and responsiveness:

# Smaller window (3 days): responsive, still noisy
daily_doses.rolling(3).mean()

# Medium window (7 days): balanced — the default for most weekly patterns
daily_doses.rolling(7).mean()

# Larger window (30 days): very smooth, but slow to respond to changes
daily_doses.rolling(30).mean()

A 7-day window is the most common choice for daily data because it smooths out the day-of-week effect (weekdays vs. weekends) while still capturing weekly trends.

Centered vs. Trailing Windows

By default, the rolling window looks backward — the value for March 15 uses data from March 9-15. You can center the window instead:

# Centered: March 15 uses data from March 12-18
daily_doses.rolling(7, center=True).mean()

Centered windows are better for visualizing historical data (the smoothed value represents the middle of the window period). Trailing windows are better for real-time monitoring (you can compute today's value because it only uses past data).

Other Rolling Statistics

Rolling windows work with any aggregation, not just the mean:

daily_doses.rolling(7).std()     # rolling standard deviation
daily_doses.rolling(7).min()     # rolling minimum
daily_doses.rolling(7).max()     # rolling maximum
daily_doses.rolling(7).median()  # rolling median
daily_doses.rolling(7).sum()     # rolling sum

The rolling standard deviation is particularly useful — a widening standard deviation suggests increasing volatility, while a narrowing one suggests stabilization.

Minimum Periods

If you don't want NaN values at the start of your series, you can set min_periods:

# Compute rolling average even with fewer than 7 data points
daily_doses.rolling(7, min_periods=1).mean()

2023-01-01    5748.0  (average of 1 value)
2023-01-02    5089.5  (average of 2 values)
2023-01-03    5334.3  (average of 3 values)
...
2023-01-07    5344.7  (average of 7 values)

With min_periods=1, the first value is just itself (average of one number), the second is the average of two values, and so on. Once you reach the full window size, it behaves normally.

11.8 Expanding Windows: Cumulative Statistics

While rolling windows use a fixed-size window that moves through the data, expanding windows start at the beginning and grow to include all data up to the current point.

# Cumulative average (expanding window)
cumulative_avg = daily_doses.expanding().mean()
print(cumulative_avg.head(5))

2023-01-01    5748.0
2023-01-02    5089.5
2023-01-03    5334.3
2023-01-04    5566.3
2023-01-05    5429.6
Name: daily_vaccinations, dtype: float64

The first value is just itself. The second is the average of the first two. The third is the average of the first three. And so on. By the end of the year, it's the average of all 365 days.

Cumulative Sum

# Running total of vaccinations
cumulative_total = daily_doses.expanding().sum()
print(f"Total vaccinations by year end: "
      f"{cumulative_total.iloc[-1]:,.0f}")

Total vaccinations by year end: 2,312,456

Expanding vs. Rolling: When to Use Each

Use Case	Tool	Why
Smooth daily noise	Rolling (7-day)	Captures recent trends
Track cumulative totals	Expanding sum	Shows running total
Monitor overall average	Expanding mean	Shows long-term average
Detect recent volatility	Rolling std	Compares to recent behavior
Detect trend changes	Rolling (30-day)	Smooths short-term noise

11.9 Working with Time Zones

Time zones are a source of endless bugs. A meeting at "3:00 PM EST" is "8:00 PM UTC" is "12:00 PM PST." If your data involves events from different time zones — server logs, flight schedules, international vaccination campaigns — you need to handle this explicitly.

Naive vs. Aware Timestamps

A naive timestamp has no timezone information:

naive = pd.Timestamp("2023-03-15 14:30:00")
print(naive.tzinfo)  # None

An aware timestamp knows its timezone:

aware = pd.Timestamp("2023-03-15 14:30:00", tz="US/Eastern")
print(aware)
print(aware.tzinfo)

2023-03-15 14:30:00-04:00
US/Eastern

Converting Between Time Zones

eastern = pd.Timestamp("2023-03-15 14:30:00", tz="US/Eastern")

# Convert to UTC
utc = eastern.tz_convert("UTC")
print(utc)  # 2023-03-15 18:30:00+00:00

# Convert to Pacific
pacific = eastern.tz_convert("US/Pacific")
print(pacific)  # 2023-03-15 11:30:00-07:00

For a column:

# Add timezone to a naive column
df["date_utc"] = df["date"].dt.tz_localize("UTC")

# Convert to a different timezone
df["date_eastern"] = df["date_utc"].dt.tz_convert("US/Eastern")

Practical Advice on Time Zones

For most data science work with tabular data, here's the simple rule: store everything in UTC internally, convert to local time only for display. This avoids daylight saving time bugs, ambiguous times, and the chaos of comparing timestamps from different zones.

If your dataset doesn't specify time zones and you know all the data is from one zone, you can usually work with naive timestamps. Just document your assumption.

11.10 Practical Patterns: Common Date Operations

Here are operations you'll use constantly. Bookmark this section.

Extracting Components for Analysis

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek  # 0=Monday
df["day_name"] = df["date"].dt.day_name()
df["is_weekend"] = df["date"].dt.dayofweek >= 5
df["quarter"] = df["date"].dt.quarter
df["week_of_year"] = df["date"].dt.isocalendar().week

Filtering by Date Range

# Using boolean masks
mask = (df["date"] >= "2023-06-01") & (df["date"] < "2023-07-01")
june_data = df[mask]

# Using DatetimeIndex slicing
df_indexed = df.set_index("date")
june_data = df_indexed["2023-06"]

Grouping by Time Periods

# Average vaccination rate by month
df.groupby(df["date"].dt.month)["vaccinations"].mean()

# Total by quarter
df.groupby(df["date"].dt.quarter)["vaccinations"].sum()

# Compare weekdays vs. weekends
df.groupby(df["date"].dt.dayofweek >= 5)["vaccinations"].mean()

Finding Time Gaps

# Detect missing dates in a time series
full_range = pd.date_range(
    df["date"].min(), df["date"].max(), freq="D")
missing = full_range.difference(df["date"])
print(f"Missing {len(missing)} days of data")

Computing Year-Over-Year Changes

# Shift data by one year for comparison
df_indexed["last_year"] = df_indexed["vaccinations"].shift(365)
df_indexed["yoy_change"] = (
    df_indexed["vaccinations"] - df_indexed["last_year"])

11.11 Project Checkpoint: Parsing Dates and Computing Rolling Averages

Let's apply everything to Elena's vaccination project. She has daily vaccination counts by country, but the date column arrived as strings in an inconsistent format.

# Load the data
vax = pd.DataFrame({
    "country": (["United States"] * 10 +
                 ["Germany"] * 10 +
                 ["Brazil"] * 10),
    "date_str": [
        "01/15/2023", "01/16/2023", "01/17/2023",
        "01/18/2023", "01/19/2023", "01/20/2023",
        "01/21/2023", "01/22/2023", "01/23/2023",
        "01/24/2023",
        "15.01.2023", "16.01.2023", "17.01.2023",
        "18.01.2023", "19.01.2023", "20.01.2023",
        "21.01.2023", "22.01.2023", "23.01.2023",
        "24.01.2023",
        "2023-01-15", "2023-01-16", "2023-01-17",
        "2023-01-18", "2023-01-19", "2023-01-20",
        "2023-01-21", "2023-01-22", "2023-01-23",
        "2023-01-24"
    ],
    "daily_vaccinations": [
        45000, 52000, 48000, 51000, 55000,
        30000, 28000, 53000, 49000, 54000,
        12000, 13500, 11800, 14200, 13000,
        8500, 7200, 12500, 13800, 14500,
        85000, 92000, 78000, 95000, 88000,
        65000, 58000, 91000, 86000, 93000
    ]
})

Step 1: Parse the dates

The date formats differ by country — US uses MM/DD/YYYY, Germany uses DD.MM.YYYY, Brazil uses ISO format. Elena handles each:

def parse_mixed_dates(row):
    date_str = row["date_str"]
    if "." in date_str and len(date_str) == 10:
        return pd.to_datetime(date_str, format="%d.%m.%Y")
    elif "/" in date_str:
        return pd.to_datetime(date_str, format="%m/%d/%Y")
    else:
        return pd.to_datetime(date_str)

vax["date"] = vax.apply(parse_mixed_dates, axis=1)

Step 2: Compute rolling 7-day averages by country

vax = vax.sort_values(["country", "date"])

vax["rolling_7d"] = (vax
    .groupby("country")["daily_vaccinations"]
    .transform(lambda x: x.rolling(7, min_periods=1).mean())
    .round(0))

print(vax[vax["country"] == "United States"][
    ["date", "daily_vaccinations", "rolling_7d"]])

         date  daily_vaccinations  rolling_7d
0  2023-01-15              45000     45000.0
1  2023-01-16              52000     48500.0
2  2023-01-17              48000     48333.0
3  2023-01-18              51000     49000.0
4  2023-01-19              55000     50200.0
5  2023-01-20              30000     46833.0
6  2023-01-21              28000     44143.0
7  2023-01-22              53000     45286.0
8  2023-01-23              49000     44857.0
9  2023-01-24              54000     45714.0

The rolling average smooths out the weekend dip (January 20-21 had lower numbers) and shows a clearer picture of the vaccination pace.

Step 3: Identify weekend effects

vax["is_weekend"] = vax["date"].dt.dayofweek >= 5

weekend_effect = (vax
    .groupby(["country", "is_weekend"])
    ["daily_vaccinations"].mean()
    .round(0)
    .unstack())

weekend_effect.columns = ["Weekday", "Weekend"]
weekend_effect["Drop %"] = (
    (1 - weekend_effect["Weekend"] / weekend_effect["Weekday"]) * 100
).round(1)

print(weekend_effect)

                Weekday  Weekend  Drop %
country
Brazil          88375.0  61500.0    30.4
Germany         13175.0   7850.0    40.4
United States   50625.0  29000.0    42.7

There's a clear weekend effect — vaccinations drop 30-43% on weekends across all three countries. This is the kind of insight that time-aware analysis reveals.

Step 4: Set up for time-based analysis

# Create a time-indexed version for one country
us_vax = vax[vax["country"] == "United States"].set_index("date")
us_vax = us_vax[["daily_vaccinations"]]

# This enables time-based slicing
jan_20_onward = us_vax["2023-01-20":]
print(jan_20_onward)

Elena now has a clean, time-indexed DataFrame ready for visualization (Chapter 15) and statistical analysis (Chapter 19).

11.12 Common Pitfalls and How to Avoid Them

Pitfall 1: Mixed Date Formats in One Column

If a single column has "01/15/2023" and "15/01/2023" mixed together, to_datetime will silently misparse some entries. The date "01/02/2023" could be January 2 or February 1, and there's no way to tell without context.

Solution: Know your data source. If the format is consistent within the column, specify it with format=. If it's inconsistent, you need metadata (like the country of origin) to determine the format.

Pitfall 2: Two-Digit Years

pd.to_datetime("03/15/23")

Is that 1923 or 2023? Pandas uses a cutoff: years 00-68 become 2000-2068, and 69-99 become 1969-1999. This is almost always what you want, but if you're working with historical data, specify the format explicitly:

pd.to_datetime("03/15/23", format="%m/%d/%y")  # 2023
pd.to_datetime("03/15/23", format="%m/%d/19%y")  # No! Doesn't work.

For historical dates, use four-digit years in your data.

Pitfall 3: Forgetting to Convert

This is the most common pitfall. You load a CSV, start analyzing, and forget that the "date" column is still a string:

df = pd.read_csv("data.csv")
print(df["date"].dtype)  # object  <-- Still a string!

Always check dtypes after loading. Or, even better, parse dates during loading:

df = pd.read_csv("data.csv", parse_dates=["date"])

The parse_dates parameter tells read_csv to automatically parse the named columns as dates. This is the cleanest approach when the format is straightforward.

Pitfall 4: Off-by-One in Resampling

When resampling to weeks, what counts as "this week" depends on which day the week starts. By default, resample("W") ends weeks on Sunday. If your business week starts on Monday, use resample("W-MON").

Similarly, resample("ME") groups by month end. If you want month-start buckets, use resample("MS").

Check Your Understanding

How would you parse dates during CSV loading instead of after?

What's the risk of using to_datetime without a format string on a column with MM/DD/YYYY dates?

Why might resample("W") and resample("W-MON") give different results?

11.13 Spaced Review: Concepts from Chapters 1-10

Take five minutes to answer these questions from earlier chapters without looking back.

From Chapter 1: What's the difference between a descriptive question and a predictive question? Which comes first in a typical analysis?

From Chapter 4: Write a function that takes a list of numbers and returns the average. Do it without using any library.

From Chapter 7: How do you select rows from a pandas DataFrame where a column meets a condition? Write the syntax.

From Chapter 8: What's the difference between dropna() and fillna()? When would you use each?

From Chapter 9: What does pd.merge(df1, df2, on="key", how="left") do? What happens to rows in df1 that have no match in df2?

From Chapter 10: What regex pattern would match a US phone number like "555-123-4567"? What does \d{3} mean?

Chapter Summary

Time is the backbone of most real-world data analysis. In this chapter, you learned to handle it properly.

Parsing converts date strings into datetime objects with pd.to_datetime(). Always specify the format when possible — it's faster, safer, and prevents day-month ambiguity.

Timestamp and the .dt accessor give you access to every component of a date: year, month, day, day of week, quarter, and more.

Timedelta and DateOffset enable date arithmetic: computing durations, adding intervals, and comparing dates across a dataset.

DatetimeIndex makes your DataFrame time-aware, enabling partial-string slicing ("give me all of March") and the powerful resample method.

Resampling changes the frequency of your time series: daily to weekly, hourly to daily, monthly to quarterly. The choice of aggregation function (sum vs. mean vs. max) determines what story the resampled data tells.

Rolling windows smooth noisy data by computing statistics over a sliding window. The 7-day rolling average is the workhorse of time series analysis, revealing trends hidden by day-to-day fluctuations.

Expanding windows compute cumulative statistics: running totals, running averages, running maximums.

Together, these tools let you ask and answer temporal questions: When did vaccination rates peak? How fast are they growing? Is there a weekend effect? What's the 30-day trend? These are the questions that drive real-world decisions, and now you have the tools to answer them.

What's Next

In Chapter 12, you'll learn to load data from multiple file formats — CSV, Excel, JSON, and databases. Not all data comes in one neat file, and the ability to combine data from different sources is essential for building comprehensive analyses. If you're working on the vaccination project, you'll load population data from Excel and country metadata from JSON.

This chapter covered date parsing, time arithmetic, resampling, and rolling windows. The exercises that follow will give you practice with each of these techniques on realistic datasets. Time series analysis rewards careful attention to detail — make sure your dates are parsed correctly before computing anything.

Learning Objectives

In This Chapter

Chapter 11: Working with Dates, Times, and Time Series Data

Chapter Overview

11.1 Why Dates Are Hard (And Why Pandas Makes Them Easier)

The Fundamental Problem with Dates as Strings

11.2 Parsing Dates with pd.to_datetime()

Automatic Parsing

When Automatic Parsing Goes Wrong: The Day-Month Ambiguity

Format Codes: Speaking the Language of Dates

Parsing an Entire Column

Handling Parsing Errors

11.3 The Timestamp Object: Anatomy of a Moment

Formatting Dates for Output

11.4 Date Arithmetic: Timedelta and DateOffset

Subtracting Dates Gives a Timedelta

Adding Time to a Date

DateOffset: Calendar-Aware Offsets

A Practical Example: Computing Days Since Vaccination

11.5 The DatetimeIndex: Making Time the Index

Setting the Index

Time-Based Slicing

Creating a DatetimeIndex with date_range()

A Practical Example: Building a Daily Vaccination Timeline

11.6 Resampling: Changing the Frequency of Time Series Data

Downsampling: From Higher to Lower Frequency

Choosing the Right Aggregation Function

Multiple Aggregations at Once

Upsampling: From Lower to Higher Frequency

11.7 Rolling Windows: Smoothing the Noise

The Basic Rolling Mean

Why Rolling Averages Matter

Choosing the Window Size

Centered vs. Trailing Windows

Other Rolling Statistics

Minimum Periods

11.8 Expanding Windows: Cumulative Statistics

Cumulative Sum

Expanding vs. Rolling: When to Use Each

11.9 Working with Time Zones

Naive vs. Aware Timestamps

Converting Between Time Zones

Practical Advice on Time Zones

11.10 Practical Patterns: Common Date Operations

Extracting Components for Analysis

Filtering by Date Range

Grouping by Time Periods

Finding Time Gaps

Computing Year-Over-Year Changes

11.11 Project Checkpoint: Parsing Dates and Computing Rolling Averages

11.12 Common Pitfalls and How to Avoid Them

Pitfall 1: Mixed Date Formats in One Column

Pitfall 2: Two-Digit Years

Pitfall 3: Forgetting to Convert

Pitfall 4: Off-by-One in Resampling

11.13 Spaced Review: Concepts from Chapters 1-10

Chapter Summary

11.2 Parsing Dates with `pd.to_datetime()`

Creating a DatetimeIndex with `date_range()`