Further Reading: Working with Text Data

You've built a solid foundation for working with text data in pandas. Whether you want to deepen your regex skills, explore natural language processing, or understand the theory behind pattern matching, here are resources organized by what caught your attention.


Tier 1: Verified Sources

These are published books with full bibliographic details.

Jeffrey E.F. Friedl, Mastering Regular Expressions (O'Reilly, 3rd edition, 2006). This is the definitive reference on regular expressions. Friedl explains not just the syntax but how regex engines work internally — why some patterns are fast and others are catastrophically slow, how backtracking works, and how to write efficient patterns. The book covers regex flavors across multiple languages (Perl, Java, .NET, Python). If the threshold concept of "regex as a mini-language" resonated with you and you want to truly master it, this is the book. The chapter on performance alone will save you hours of debugging.

Wes McKinney, Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter (O'Reilly, 3rd edition, 2022). Chapter 7 of McKinney's book covers string manipulation in pandas in detail, including the .str accessor methods, regex integration, and practical text cleaning workflows. Since McKinney created pandas, this is the authoritative source for understanding design decisions behind the string API. If you want more examples of .str.extract(), .str.findall(), and vectorized string operations, start here.

Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python (O'Reilly, 2nd edition, 2015). Often called "the NLTK book," this text goes well beyond what we covered in this chapter — into tokenization, part-of-speech tagging, named entity recognition, and text classification. If the survey response categorization problem in Case Study 1 made you curious about more sophisticated text analysis, this book is your next step. It's also freely available online through the NLTK project website.

Al Sweigart, Automate the Boring Stuff with Python (No Starch Press, 2nd edition, 2019). Chapter 7 covers regular expressions with a practical, task-oriented approach: finding phone numbers, email addresses, and patterns in text files. Sweigart's style is beginner-friendly and example-driven. If you found this chapter's regex introduction helpful but want more practice with simple, concrete examples before tackling advanced patterns, this is an excellent next resource.

Daniel Jurafsky and James H. Martin, Speech and Language Processing (3rd edition draft, ongoing). This is the standard textbook for computational linguistics and natural language processing at the university level. The chapters on regular expressions, text normalization, and edit distance are directly relevant to what you learned in this chapter. The draft of the 3rd edition is freely available on the authors' Stanford website. It's more mathematically rigorous than the other recommendations here, but the early chapters are accessible to anyone with basic programming skills.


Tier 2: Attributed Resources

These are articles, talks, and online resources that are well-known in the data science and programming communities.

Regular Expressions 101 (regex101.com). An interactive online tool for building and testing regex patterns. You paste a pattern and test string, and it highlights matches, explains each part of the pattern in plain English, and shows capture group contents. It supports Python's regex flavor (and others). This is the single most useful tool for learning and debugging regex — bookmark it.

Python re module documentation (docs.python.org). The official documentation for Python's regex module. It includes a thorough description of pattern syntax, all function signatures (search, match, findall, sub, compile), flag options (IGNORECASE, VERBOSE, MULTILINE), and examples. The "Regular Expression HOWTO" linked from the documentation is an excellent supplementary tutorial.

pandas string handling documentation (pandas.pydata.org). The "Working with text data" section of the pandas documentation covers every .str method with examples. It's particularly useful as a reference when you need a method you haven't used before — for instance, .str.pad(), .str.wrap(), or .str.get_dummies().

Institute for Safe Medication Practices (ISMP), "List of Confused Drug Names." If Case Study 2 on medication name cleaning caught your attention, the ISMP publishes a regularly updated list of drug name pairs that are frequently confused. Search for "ISMP confused drug names" to find it. This list is used by hospitals worldwide to prevent medication errors and illustrates why text standardization in healthcare is literally a matter of life and safety.


  • If you want more regex practice: Work through the exercises at regexone.com, which teaches regex interactively through increasingly complex challenges. Then try regex crossword puzzles at regexcrossword.com for a fun way to test your understanding.

  • If you want to explore NLP: Start with the NLTK book (Bird, Klein, and Loper) and experiment with tokenization, stemming, and simple text classification. The jump from regex-based text cleaning to NLP is a natural progression.

  • If you're interested in fuzzy matching: Explore the thefuzz library (formerly fuzzywuzzy), which uses edit distance to find similar strings. This handles the misspelling problem that regex can't solve — matching "metforman" to "metformin" even though no exact pattern would catch it.

  • If you're working with the vaccination project: Apply the text cleaning techniques from this chapter to any free-text columns in your dataset, then move on to Chapter 11 to tackle date and time data, or Chapter 12 to load additional data sources.

  • If you love the theory: Read about finite automata and formal language theory — the computer science foundations that explain why regex works and what its limitations are. Jurafsky and Martin's textbook covers this, as does any introductory theory of computation textbook.