Further Reading: Data Cleaning and Preparation

Essential Textbooks

Data Wrangling and Cleaning

  1. "Python for Data Analysis" by Wes McKinney (3rd Edition, 2022) - Written by the creator of pandas - Comprehensive coverage of data cleaning with pandas - Best practices for data manipulation - ISBN: 978-1098104030 - Essential for any Python data analyst

  2. "Data Wrangling with Python" by Jacqueline Kazil and Katharine Jarmul - Focus on practical data cleaning techniques - Web scraping and data acquisition - Real-world messy data examples - ISBN: 978-1491948811

  3. "Bad Data Handbook" edited by Q. Ethan McCallum - Essays on handling problematic data - Industry practitioners' perspectives - Covers many edge cases - ISBN: 978-1449321888

Missing Data

  1. "Flexible Imputation of Missing Data" by Stef van Buuren (2nd Edition) - Definitive resource on missing data handling - Multiple imputation techniques - Statistical theory and practical application - ISBN: 978-1138588318 - Free online: https://stefvanbuuren.name/fimd/

  2. "Statistical Analysis with Missing Data" by Little and Rubin - Theoretical foundation for missing data - MCAR, MAR, MNAR frameworks - Academic but essential - ISBN: 978-0471183860


Tidy Data and Data Design

  1. "Tidy Data" by Hadley Wickham (2014) - Foundational paper on data structure - Journal of Statistical Software, 59(10) - Free: https://www.jstatsoft.org/article/view/v059i10 - Defines principles used throughout data science

  2. "The Grammar of Data Manipulation" (dplyr/pandas design) - Concepts behind modern data manipulation - Verb-based operations: filter, select, mutate, group - R: https://dplyr.tidyverse.org/ - Python: pandas documentation


Data Quality and Validation

  1. "Data Quality: The Accuracy Dimension" by Jack E. Olson - Comprehensive data quality framework - Assessment methodologies - Enterprise data quality management - ISBN: 978-1558608917

  2. "Great Expectations" Documentation - Python library for data validation - Declarative data quality rules - https://greatexpectations.io/ - Industry-standard tool for data pipelines

  3. "Pandera" Documentation

    • DataFrame validation for pandas
    • Schema-based data validation
    • https://pandera.readthedocs.io/
    • Integrates well with pandas workflows

Sports Data Specific Resources

Books

  1. "Analyzing Baseball Data with R" by Marchi, Albert, and Baumer

    • While baseball-focused, excellent data cleaning examples
    • Handling sports data quirks
    • Reproducible analysis workflows
    • ISBN: 978-0815353515
  2. "Football Analytics with Python & R" by Eric Eager

    • Direct application to football
    • NFL data sources and cleaning
    • Modern analytical techniques
    • ISBN: 978-1492099611

Online Resources

  1. CollegeFootballData.com API Documentation

    • Official API for college football data
    • Data dictionary and schema
    • https://collegefootballdata.com/api/docs
    • Python package: cfbd
  2. nflfastR Documentation

    • NFL play-by-play data processing
    • Data cleaning best practices
    • https://www.nflfastr.com/
    • Python equivalent: nfl_data_py
  3. Sports Reference Data Guide

    • Understanding sports-reference.com data
    • Glossary of statistics
    • https://www.sports-reference.com/

Python Libraries for Data Cleaning

Core Libraries

  1. pandas Documentation

    • Official pandas documentation
    • Data cleaning functions
    • https://pandas.pydata.org/docs/
    • Essential reference
  2. NumPy Documentation

    • Numerical operations
    • Array manipulation
    • https://numpy.org/doc/

Specialized Libraries

  1. fuzzywuzzy / rapidfuzz

    • Fuzzy string matching
    • Name standardization
    • https://github.com/maxbachmann/RapidFuzz
    • For matching similar but not identical names
  2. recordlinkage

    • Entity resolution and deduplication
    • Probabilistic matching
    • https://recordlinkage.readthedocs.io/
    • For advanced deduplication tasks
  3. pyjanitor

    • Pandas utilities for data cleaning
    • Method chaining for cleaning pipelines
    • https://pyjanitor-devs.github.io/pyjanitor/
    • Convenient cleaning functions
  4. dataprep

    • Automatic data preparation
    • Data profiling and cleaning
    • https://dataprep.ai/
    • Quick initial data exploration

Entity Resolution and Matching

  1. "An Introduction to Entity Resolution" by Peter Christen

    • Definitive textbook on entity matching
    • Algorithms for record linkage
    • ISBN: 978-3030495688
  2. "Data Matching" by Peter Christen

    • Practical entity resolution techniques
    • Quality measures and evaluation
    • ISBN: 978-3642311635

Data Pipeline Engineering

  1. "Data Pipelines Pocket Reference" by James Densmore

    • Modern data pipeline design
    • ETL/ELT patterns
    • ISBN: 978-1492087830
    • Practical and concise
  2. "Fundamentals of Data Engineering" by Reis and Housley

    • Comprehensive data engineering coverage
    • Pipeline architecture
    • ISBN: 978-1098108304
    • Modern best practices
  3. Apache Airflow Documentation

    • Workflow orchestration
    • Pipeline scheduling and monitoring
    • https://airflow.apache.org/
    • Industry standard for pipelines

Version Control for Data

  1. "DVC (Data Version Control)" Documentation

    • Git for data and ML projects
    • Track data changes
    • https://dvc.org/
    • Essential for reproducibility
  2. "Git for Data Science" (Various Resources)

    • Using git with data projects
    • .gitignore for data files
    • Large file handling with Git LFS

Online Courses

  1. DataCamp: "Cleaning Data in Python"

    • Interactive Python course
    • Hands-on exercises
    • https://www.datacamp.com/
    • Good for beginners
  2. Coursera: "Introduction to Data Cleaning" (University of Michigan)

    • Part of larger data science program
    • Academic but practical
    • https://www.coursera.org/
  3. Kaggle: Data Cleaning Course

    • Free, interactive tutorials
    • Real dataset examples
    • https://www.kaggle.com/learn/data-cleaning

Blogs and Articles

  1. Towards Data Science (Medium)

    • Data cleaning articles
    • pandas tips and tricks
    • https://towardsdatascience.com/
  2. Real Python

    • Python tutorials including pandas
    • Best practices
    • https://realpython.com/
  3. PyData Conference Talks

    • YouTube channel with presentations
    • pandas and data cleaning talks
    • https://www.youtube.com/c/PyDataTV

Practice Resources

Datasets for Practice

  1. Kaggle Datasets

    • Real-world messy data
    • Many sports datasets
    • https://www.kaggle.com/datasets
  2. Data.gov

    • Government open data
    • Varied data quality
    • https://data.gov/
  3. FiveThirtyEight Data

    • Curated datasets
    • Sports focus
    • https://github.com/fivethirtyeight/data

Challenges

  1. Kaggle Competitions
    • Data cleaning is often key to winning
    • Learn from top solutions
    • https://www.kaggle.com/competitions

Academic Papers

  1. "Data Cleaning: Problems and Current Approaches" by Rahm and Do (2000)

    • Classic survey paper
    • Taxonomy of data quality problems
    • IEEE Data Engineering Bulletin
  2. "Data Preprocessing for Machine Learning" (Various)

    • Connection between cleaning and ML performance
    • Feature engineering techniques
    • Search on arXiv or Google Scholar

Suggested Learning Path

Week 1-2: Foundations

  • Read: McKinney's "Python for Data Analysis" Chapters 6-8
  • Practice: Basic pandas cleaning operations
  • Tool: Set up pandas and Jupyter environment

Week 3-4: Missing Data

  • Read: Van Buuren's FIMD (online chapters 1-3)
  • Practice: Imputation techniques on real data
  • Tool: Explore missingno library for visualization

Week 5-6: Name Standardization

  • Read: Christen's entity resolution overview
  • Practice: Build team/player name standardizer
  • Tool: Try fuzzywuzzy for fuzzy matching

Week 7-8: Integration and Validation

  • Read: Great Expectations documentation
  • Practice: Multi-source data merge
  • Tool: Implement validation rules

Week 9-10: Pipelines

  • Read: "Data Pipelines Pocket Reference"
  • Practice: Build reusable cleaning pipeline
  • Tool: Structure code as importable module

Citation Format

When referencing data cleaning work:

McKinney, W. (2022). Python for Data Analysis (3rd ed.). O'Reilly Media.

Van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.).
CRC Press. https://stefvanbuuren.name/fimd/

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10),
1-23. https://doi.org/10.18637/jss.v059.i10