Further Reading: Data Cleaning and Preparation

Essential Textbooks

Data Wrangling and Cleaning

"Python for Data Analysis" by Wes McKinney (3rd Edition, 2022) - Written by the creator of pandas - Comprehensive coverage of data cleaning with pandas - Best practices for data manipulation - ISBN: 978-1098104030 - Essential for any Python data analyst
"Data Wrangling with Python" by Jacqueline Kazil and Katharine Jarmul - Focus on practical data cleaning techniques - Web scraping and data acquisition - Real-world messy data examples - ISBN: 978-1491948811
"Bad Data Handbook" edited by Q. Ethan McCallum - Essays on handling problematic data - Industry practitioners' perspectives - Covers many edge cases - ISBN: 978-1449321888

Missing Data

"Flexible Imputation of Missing Data" by Stef van Buuren (2nd Edition) - Definitive resource on missing data handling - Multiple imputation techniques - Statistical theory and practical application - ISBN: 978-1138588318 - Free online: https://stefvanbuuren.name/fimd/
"Statistical Analysis with Missing Data" by Little and Rubin - Theoretical foundation for missing data - MCAR, MAR, MNAR frameworks - Academic but essential - ISBN: 978-0471183860

Tidy Data and Data Design

"Tidy Data" by Hadley Wickham (2014) - Foundational paper on data structure - Journal of Statistical Software, 59(10) - Free: https://www.jstatsoft.org/article/view/v059i10 - Defines principles used throughout data science
"The Grammar of Data Manipulation" (dplyr/pandas design) - Concepts behind modern data manipulation - Verb-based operations: filter, select, mutate, group - R: https://dplyr.tidyverse.org/ - Python: pandas documentation

Data Quality and Validation

"Data Quality: The Accuracy Dimension" by Jack E. Olson - Comprehensive data quality framework - Assessment methodologies - Enterprise data quality management - ISBN: 978-1558608917
"Great Expectations" Documentation - Python library for data validation - Declarative data quality rules - https://greatexpectations.io/ - Industry-standard tool for data pipelines
"Pandera" Documentation
- DataFrame validation for pandas
- Schema-based data validation
- https://pandera.readthedocs.io/
- Integrates well with pandas workflows

Sports Data Specific Resources

Books

"Analyzing Baseball Data with R" by Marchi, Albert, and Baumer
- While baseball-focused, excellent data cleaning examples
- Handling sports data quirks
- Reproducible analysis workflows
- ISBN: 978-0815353515
"Football Analytics with Python & R" by Eric Eager
- Direct application to football
- NFL data sources and cleaning
- Modern analytical techniques
- ISBN: 978-1492099611

Online Resources

CollegeFootballData.com API Documentation
- Official API for college football data
- Data dictionary and schema
- https://collegefootballdata.com/api/docs
- Python package: cfbd
nflfastR Documentation
- NFL play-by-play data processing
- Data cleaning best practices
- https://www.nflfastr.com/
- Python equivalent: nfl_data_py
Sports Reference Data Guide
- Understanding sports-reference.com data
- Glossary of statistics
- https://www.sports-reference.com/

Python Libraries for Data Cleaning

Core Libraries

pandas Documentation
- Official pandas documentation
- Data cleaning functions
- https://pandas.pydata.org/docs/
- Essential reference
NumPy Documentation
- Numerical operations
- Array manipulation
- https://numpy.org/doc/

Specialized Libraries

fuzzywuzzy / rapidfuzz
- Fuzzy string matching
- Name standardization
- https://github.com/maxbachmann/RapidFuzz
- For matching similar but not identical names
recordlinkage
- Entity resolution and deduplication
- Probabilistic matching
- https://recordlinkage.readthedocs.io/
- For advanced deduplication tasks
pyjanitor
- Pandas utilities for data cleaning
- Method chaining for cleaning pipelines
- https://pyjanitor-devs.github.io/pyjanitor/
- Convenient cleaning functions
dataprep
- Automatic data preparation
- Data profiling and cleaning
- https://dataprep.ai/
- Quick initial data exploration

Entity Resolution and Matching

"An Introduction to Entity Resolution" by Peter Christen
- Definitive textbook on entity matching
- Algorithms for record linkage
- ISBN: 978-3030495688
"Data Matching" by Peter Christen
- Practical entity resolution techniques
- Quality measures and evaluation
- ISBN: 978-3642311635

Data Pipeline Engineering

"Data Pipelines Pocket Reference" by James Densmore
- Modern data pipeline design
- ETL/ELT patterns
- ISBN: 978-1492087830
- Practical and concise
"Fundamentals of Data Engineering" by Reis and Housley
- Comprehensive data engineering coverage
- Pipeline architecture
- ISBN: 978-1098108304
- Modern best practices
Apache Airflow Documentation
- Workflow orchestration
- Pipeline scheduling and monitoring
- https://airflow.apache.org/
- Industry standard for pipelines

Version Control for Data

"DVC (Data Version Control)" Documentation
- Git for data and ML projects
- Track data changes
- https://dvc.org/
- Essential for reproducibility
"Git for Data Science" (Various Resources)
- Using git with data projects
- .gitignore for data files
- Large file handling with Git LFS

Online Courses

DataCamp: "Cleaning Data in Python"
- Interactive Python course
- Hands-on exercises
- https://www.datacamp.com/
- Good for beginners
Coursera: "Introduction to Data Cleaning" (University of Michigan)
- Part of larger data science program
- Academic but practical
- https://www.coursera.org/
Kaggle: Data Cleaning Course
- Free, interactive tutorials
- Real dataset examples
- https://www.kaggle.com/learn/data-cleaning

Blogs and Articles

Towards Data Science (Medium)
- Data cleaning articles
- pandas tips and tricks
- https://towardsdatascience.com/
Real Python
- Python tutorials including pandas
- Best practices
- https://realpython.com/
PyData Conference Talks
- YouTube channel with presentations
- pandas and data cleaning talks
- https://www.youtube.com/c/PyDataTV

Practice Resources

Datasets for Practice

Kaggle Datasets
- Real-world messy data
- Many sports datasets
- https://www.kaggle.com/datasets
Data.gov
- Government open data
- Varied data quality
- https://data.gov/
FiveThirtyEight Data
- Curated datasets
- Sports focus
- https://github.com/fivethirtyeight/data

Challenges

Kaggle Competitions
- Data cleaning is often key to winning
- Learn from top solutions
- https://www.kaggle.com/competitions

Academic Papers

"Data Cleaning: Problems and Current Approaches" by Rahm and Do (2000)
- Classic survey paper
- Taxonomy of data quality problems
- IEEE Data Engineering Bulletin
"Data Preprocessing for Machine Learning" (Various)
- Connection between cleaning and ML performance
- Feature engineering techniques
- Search on arXiv or Google Scholar

Suggested Learning Path

Week 1-2: Foundations

Read: McKinney's "Python for Data Analysis" Chapters 6-8
Practice: Basic pandas cleaning operations
Tool: Set up pandas and Jupyter environment

Week 3-4: Missing Data

Read: Van Buuren's FIMD (online chapters 1-3)
Practice: Imputation techniques on real data
Tool: Explore missingno library for visualization

Week 5-6: Name Standardization

Read: Christen's entity resolution overview
Practice: Build team/player name standardizer
Tool: Try fuzzywuzzy for fuzzy matching

Week 7-8: Integration and Validation

Read: Great Expectations documentation
Practice: Multi-source data merge
Tool: Implement validation rules

Week 9-10: Pipelines

Read: "Data Pipelines Pocket Reference"
Practice: Build reusable cleaning pipeline
Tool: Structure code as importable module

Citation Format

When referencing data cleaning work:

McKinney, W. (2022). Python for Data Analysis (3rd ed.). O'Reilly Media.

Van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.).
CRC Press. https://stefvanbuuren.name/fimd/

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10),
1-23. https://doi.org/10.18637/jss.v059.i10