Chapter 13 Further Reading and Resources

Official Documentation

pandas User Guide — GroupBy https://pandas.pydata.org/docs/user_guide/groupby.html

The authoritative reference for .groupby(), .agg(), .transform(), and .apply(). The section on "Aggregation" includes the complete list of built-in aggregation functions and detailed examples of named aggregations. Read the "Transformation" section to understand when .transform() is the right choice over .agg().

pandas User Guide — Merging, Joining, Concatenating https://pandas.pydata.org/docs/user_guide/merging.html

Covers .merge(), .join(), and .concat() in depth with visual diagrams of each join type. Particularly useful for understanding how duplicate keys affect merge results and how to handle index-based joins.

pandas User Guide — Reshaping https://pandas.pydata.org/docs/user_guide/reshaping.html

Comprehensive coverage of .melt(), .pivot(), pd.pivot_table(), and pd.crosstab(). Includes worked examples of stacking and unstacking MultiIndex DataFrames.

pandas User Guide — Working with Text Data https://pandas.pydata.org/docs/user_guide/text.html

Complete reference for the .str accessor, including all available methods, regex integration, and performance notes on StringDtype vs. object dtype.

pandas User Guide — Time Series / Date Functionality https://pandas.pydata.org/docs/user_guide/timeseries.html

Covers pd.to_datetime(), the .dt accessor, DateOffset, Period, and resampling. The section on "Time/Date Components" lists every property available through .dt.

pandas User Guide — Windowing Operations https://pandas.pydata.org/docs/user_guide/window.html

Covers .rolling(), .expanding(), .ewm() (exponential weighted moving average), and online (streaming) statistics. Includes performance considerations and the min_periods parameter.


Books

Python for Data Analysis, 3rd Edition — Wes McKinney O'Reilly Media, 2022

Written by the creator of pandas. Chapters 10 (Data Aggregation and Group Operations) and 8 (Data Wrangling: Join, Combine, and Reshape) cover the material in this chapter at a deep technical level. Essential reading for anyone who wants to understand why pandas works the way it does, not just how to use it.

ISBN: 978-1098104030

Effective Pandas, 2nd Edition — Matt Harrison Self-published, 2023

A concise, opinionated guide to writing idiomatic pandas code. Matt Harrison covers the method-chaining style, named aggregations, and the reasoning behind using .transform() versus .agg(). Highly recommended for analysts who want to write cleaner, more Pythonic pandas code.

Python Data Science Handbook — Jake VanderPlas O'Reilly Media (also free online at https://jakevdp.github.io/PythonDataScienceHandbook/)

Chapter 3 covers pandas comprehensively, including aggregation, hierarchical indexing, and time series operations. The free online version is regularly updated.


Online Courses and Tutorials

Kaggle Learn — Pandas https://www.kaggle.com/learn/pandas

Six interactive lessons covering DataFrame operations, groupby, sorting, and data types. All exercises run in Kaggle notebooks with no local setup required. Particularly good for practicing groupby and merge.

Real Python — Pandas GroupBy: Your Guide to Grouping Data in Python https://realpython.com/pandas-groupby/

A thorough tutorial covering the full split-apply-combine workflow, with worked examples and clear explanations of the GroupBy object lifecycle.

Real Python — Pandas Merge, Join, Append, and Concatenate https://realpython.com/pandas-merge-join-and-concatenate/

Visual explanations of all join types with business-relevant examples.

Towards Data Science — Reshaping DataFrames: pivot, melt, stack, unstack https://towardsdatascience.com

Search for "pandas reshape pivot melt" to find several high-quality walkthroughs. The articles by Omer Ansari and Towards Data Science authors tend to include visual diagrams of the transformation.


SQL Parallels

If you have a SQL background, the following equivalences will help you map pandas operations to familiar concepts:

SQL pandas
SELECT col, SUM(revenue) FROM t GROUP BY col df.groupby("col")["revenue"].sum()
SELECT ... FROM t LEFT JOIN c ON t.id = c.id pd.merge(t, c, on="id", how="left")
UNION ALL pd.concat([df1, df2], ignore_index=True)
CASE WHEN revenue > 5000 THEN 'Large' ELSE 'Small' END df["revenue"].apply(lambda x: "Large" if x > 5000 else "Small")
PIVOT pd.pivot_table()
UNPIVOT .melt()
OVER (PARTITION BY region ORDER BY date ROWS 3 PRECEDING) .groupby("region")["revenue"].rolling(3)
SUM(...) OVER (ORDER BY date) .expanding().sum()

Advanced Topics to Explore Next

.resample() — Like .groupby() but specifically for time-indexed Series or DataFrames. Supports resampling to any frequency ("D" for daily, "W" for weekly, "ME" for month-end, "QE" for quarter-end). Covered in the pandas Time Series documentation.

pd.crosstab() — A convenience wrapper around pd.pivot_table() that automatically counts co-occurrences. Useful for frequency tables and contingency tables.

MultiIndex DataFrames — When you group by multiple columns or create pivot tables with multiple value columns, pandas creates hierarchical column indexes. Understanding how to index and slice MultiIndex objects unlocks more advanced data manipulation.

pd.cut() and pd.qcut() — Bin continuous variables into discrete categories. pd.cut() uses fixed bin edges; pd.qcut() uses quantiles to ensure equal-frequency bins.

Category dtype — Converting string columns with low cardinality (e.g., "region", "tier") to pd.CategoricalDtype can dramatically speed up groupby operations on large DataFrames.


Practice Datasets

The following public datasets are excellent for practicing the techniques in this chapter:

Superstore Sales Dataset — Available on Kaggle. Contains retail transaction data with region, category, sub-category, sales, profit, and discount columns. Perfect for groupby, pivot tables, and merge exercises.

NYC Yellow Taxi Trip Data — Available from the NYC Taxi & Limousine Commission. Large-scale time-series data with pickup/dropoff times, fare amounts, and trip distances. Excellent for date operations, rolling averages, and .resample().

Global COVID-19 Data (Johns Hopkins) — Available on GitHub. Wide-format time series data that must be melted into long format before analysis. A real-world use case for .melt().

Airbnb Listings Data (Inside Airbnb) — Multi-table dataset with listings, reviews, and calendar data. Requires multiple merge operations to combine into an analysis-ready DataFrame.