Further Reading: Data Cleaning and Preparation
Essential Textbooks
Data Wrangling and Cleaning
-
"Python for Data Analysis" by Wes McKinney (3rd Edition, 2022) - Written by the creator of pandas - Comprehensive coverage of data cleaning with pandas - Best practices for data manipulation - ISBN: 978-1098104030 - Essential for any Python data analyst
-
"Data Wrangling with Python" by Jacqueline Kazil and Katharine Jarmul - Focus on practical data cleaning techniques - Web scraping and data acquisition - Real-world messy data examples - ISBN: 978-1491948811
-
"Bad Data Handbook" edited by Q. Ethan McCallum - Essays on handling problematic data - Industry practitioners' perspectives - Covers many edge cases - ISBN: 978-1449321888
Missing Data
-
"Flexible Imputation of Missing Data" by Stef van Buuren (2nd Edition) - Definitive resource on missing data handling - Multiple imputation techniques - Statistical theory and practical application - ISBN: 978-1138588318 - Free online: https://stefvanbuuren.name/fimd/
-
"Statistical Analysis with Missing Data" by Little and Rubin - Theoretical foundation for missing data - MCAR, MAR, MNAR frameworks - Academic but essential - ISBN: 978-0471183860
Tidy Data and Data Design
-
"Tidy Data" by Hadley Wickham (2014) - Foundational paper on data structure - Journal of Statistical Software, 59(10) - Free: https://www.jstatsoft.org/article/view/v059i10 - Defines principles used throughout data science
-
"The Grammar of Data Manipulation" (dplyr/pandas design) - Concepts behind modern data manipulation - Verb-based operations: filter, select, mutate, group - R: https://dplyr.tidyverse.org/ - Python: pandas documentation
Data Quality and Validation
-
"Data Quality: The Accuracy Dimension" by Jack E. Olson - Comprehensive data quality framework - Assessment methodologies - Enterprise data quality management - ISBN: 978-1558608917
-
"Great Expectations" Documentation - Python library for data validation - Declarative data quality rules - https://greatexpectations.io/ - Industry-standard tool for data pipelines
-
"Pandera" Documentation
- DataFrame validation for pandas
- Schema-based data validation
- https://pandera.readthedocs.io/
- Integrates well with pandas workflows
Sports Data Specific Resources
Books
-
"Analyzing Baseball Data with R" by Marchi, Albert, and Baumer
- While baseball-focused, excellent data cleaning examples
- Handling sports data quirks
- Reproducible analysis workflows
- ISBN: 978-0815353515
-
"Football Analytics with Python & R" by Eric Eager
- Direct application to football
- NFL data sources and cleaning
- Modern analytical techniques
- ISBN: 978-1492099611
Online Resources
-
CollegeFootballData.com API Documentation
- Official API for college football data
- Data dictionary and schema
- https://collegefootballdata.com/api/docs
- Python package:
cfbd
-
nflfastR Documentation
- NFL play-by-play data processing
- Data cleaning best practices
- https://www.nflfastr.com/
- Python equivalent:
nfl_data_py
-
Sports Reference Data Guide
- Understanding sports-reference.com data
- Glossary of statistics
- https://www.sports-reference.com/
Python Libraries for Data Cleaning
Core Libraries
-
pandas Documentation
- Official pandas documentation
- Data cleaning functions
- https://pandas.pydata.org/docs/
- Essential reference
-
NumPy Documentation
- Numerical operations
- Array manipulation
- https://numpy.org/doc/
Specialized Libraries
-
fuzzywuzzy / rapidfuzz
- Fuzzy string matching
- Name standardization
- https://github.com/maxbachmann/RapidFuzz
- For matching similar but not identical names
-
recordlinkage
- Entity resolution and deduplication
- Probabilistic matching
- https://recordlinkage.readthedocs.io/
- For advanced deduplication tasks
-
pyjanitor
- Pandas utilities for data cleaning
- Method chaining for cleaning pipelines
- https://pyjanitor-devs.github.io/pyjanitor/
- Convenient cleaning functions
-
dataprep
- Automatic data preparation
- Data profiling and cleaning
- https://dataprep.ai/
- Quick initial data exploration
Entity Resolution and Matching
-
"An Introduction to Entity Resolution" by Peter Christen
- Definitive textbook on entity matching
- Algorithms for record linkage
- ISBN: 978-3030495688
-
"Data Matching" by Peter Christen
- Practical entity resolution techniques
- Quality measures and evaluation
- ISBN: 978-3642311635
Data Pipeline Engineering
-
"Data Pipelines Pocket Reference" by James Densmore
- Modern data pipeline design
- ETL/ELT patterns
- ISBN: 978-1492087830
- Practical and concise
-
"Fundamentals of Data Engineering" by Reis and Housley
- Comprehensive data engineering coverage
- Pipeline architecture
- ISBN: 978-1098108304
- Modern best practices
-
Apache Airflow Documentation
- Workflow orchestration
- Pipeline scheduling and monitoring
- https://airflow.apache.org/
- Industry standard for pipelines
Version Control for Data
-
"DVC (Data Version Control)" Documentation
- Git for data and ML projects
- Track data changes
- https://dvc.org/
- Essential for reproducibility
-
"Git for Data Science" (Various Resources)
- Using git with data projects
- .gitignore for data files
- Large file handling with Git LFS
Online Courses
-
DataCamp: "Cleaning Data in Python"
- Interactive Python course
- Hands-on exercises
- https://www.datacamp.com/
- Good for beginners
-
Coursera: "Introduction to Data Cleaning" (University of Michigan)
- Part of larger data science program
- Academic but practical
- https://www.coursera.org/
-
Kaggle: Data Cleaning Course
- Free, interactive tutorials
- Real dataset examples
- https://www.kaggle.com/learn/data-cleaning
Blogs and Articles
-
Towards Data Science (Medium)
- Data cleaning articles
- pandas tips and tricks
- https://towardsdatascience.com/
-
Real Python
- Python tutorials including pandas
- Best practices
- https://realpython.com/
-
PyData Conference Talks
- YouTube channel with presentations
- pandas and data cleaning talks
- https://www.youtube.com/c/PyDataTV
Practice Resources
Datasets for Practice
-
Kaggle Datasets
- Real-world messy data
- Many sports datasets
- https://www.kaggle.com/datasets
-
Data.gov
- Government open data
- Varied data quality
- https://data.gov/
-
FiveThirtyEight Data
- Curated datasets
- Sports focus
- https://github.com/fivethirtyeight/data
Challenges
- Kaggle Competitions
- Data cleaning is often key to winning
- Learn from top solutions
- https://www.kaggle.com/competitions
Academic Papers
-
"Data Cleaning: Problems and Current Approaches" by Rahm and Do (2000)
- Classic survey paper
- Taxonomy of data quality problems
- IEEE Data Engineering Bulletin
-
"Data Preprocessing for Machine Learning" (Various)
- Connection between cleaning and ML performance
- Feature engineering techniques
- Search on arXiv or Google Scholar
Suggested Learning Path
Week 1-2: Foundations
- Read: McKinney's "Python for Data Analysis" Chapters 6-8
- Practice: Basic pandas cleaning operations
- Tool: Set up pandas and Jupyter environment
Week 3-4: Missing Data
- Read: Van Buuren's FIMD (online chapters 1-3)
- Practice: Imputation techniques on real data
- Tool: Explore missingno library for visualization
Week 5-6: Name Standardization
- Read: Christen's entity resolution overview
- Practice: Build team/player name standardizer
- Tool: Try fuzzywuzzy for fuzzy matching
Week 7-8: Integration and Validation
- Read: Great Expectations documentation
- Practice: Multi-source data merge
- Tool: Implement validation rules
Week 9-10: Pipelines
- Read: "Data Pipelines Pocket Reference"
- Practice: Build reusable cleaning pipeline
- Tool: Structure code as importable module
Citation Format
When referencing data cleaning work:
McKinney, W. (2022). Python for Data Analysis (3rd ed.). O'Reilly Media.
Van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.).
CRC Press. https://stefvanbuuren.name/fimd/
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10),
1-23. https://doi.org/10.18637/jss.v059.i10