Chapter 25: Key Takeaways

Quick Reference Summary

Text Processing Fundamentals

Concept Description Application
Tokenization Split text into words/phrases Basic unit for analysis
Normalization Standardize text format Consistent processing
Stop Words Common words to filter Reduce noise
N-grams Consecutive word sequences Capture phrases
TF-IDF Term weighting scheme Feature extraction

Essential Code Patterns

Text Cleaning:

import re

def clean_football_text(text: str) -> str:
    # Lowercase
    text = text.lower()
    # Remove special chars (keep hyphens)
    text = re.sub(r'[^\w\s\-]', ' ', text)
    # Normalize whitespace
    text = ' '.join(text.split())
    return text

Stat Extraction:

stat_pattern = re.compile(
    r'(\d+\.?\d*)\s*(yards?|tds?|catches|completions?)',
    re.IGNORECASE
)
stats = stat_pattern.findall(text)

Simple Sentiment:

def sentiment_score(text, positive_words, negative_words):
    words = text.lower().split()
    pos = sum(1 for w in words if w in positive_words)
    neg = sum(1 for w in words if w in negative_words)
    return pos / (pos + neg) if (pos + neg) > 0 else 0.5

Named Entity Recognition

Entity Type Examples Detection Method
Teams Ohio State, Buckeyes Alias mapping
Players Kyle McCord Name patterns
Positions QB, WR, linebacker Keyword lists
Stats 324 yards, 3 TDs Regex patterns
Coaches Ryan Day Gazetteers

Sentiment Analysis

Football-Specific Vocabulary:

Positive Terms Negative Terms
elite, dominant concerns, limited
explosive, athletic struggles, raw
polished, smart stiff, inconsistent
cannon arm, burst lacks, weak

Attribute Lexicons:

ATTRIBUTES = {
    'arm_strength': {
        'positive': ['cannon', 'rocket', 'velocity', 'zip'],
        'negative': ['weak arm', 'lacks velocity', 'floats']
    },
    'athleticism': {
        'positive': ['explosive', 'dynamic', 'burst', 'agile'],
        'negative': ['stiff', 'plodding', 'slow', 'tight hips']
    }
}

Topic Modeling

LDA Configuration:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(
    n_components=10,      # Number of topics
    max_iter=20,          # Training iterations
    random_state=42
)

Common Football Topics: 1. Game Analysis (yards, touchdowns, defense) 2. Recruiting (commits, prospects, visits) 3. NFL Draft (round, pick, projection) 4. Transfers (portal, destination, NIL) 5. Injuries (out, questionable, surgery)

Feature Engineering

Feature Type Example Use Case
TF-IDF Word importance Classification
Sentiment scores 0-1 scale Grade prediction
Entity counts 3 team mentions Context detection
N-gram presence "arm strength" Attribute extraction
Length features Word count Quality assessment

Model Building Pipeline

Raw Text → Clean → Tokenize → Vectorize → Model → Predict
    ↓         ↓        ↓         ↓         ↓
Scouting → Remove  → Split   → TF-IDF → GBM → Grade
Report    noise     words     features        prediction

Performance Benchmarks

Task Typical Accuracy Key Metrics
Sentiment Classification 80-85% F1, Accuracy
Entity Extraction 85-90% Precision, Recall
Grade Prediction MAE: 4-5 points MAE, R²
Topic Classification 75-85% F1, Accuracy

Common Pitfalls

Issue Problem Solution
Case sensitivity "TD" vs "td" Normalize before lowercase
Multi-word terms "tight end" split N-grams or custom tokenizer
Context loss "not good" → positive Use negation handling
Domain vocabulary General sentiment fails Custom lexicons
Sparse data Rare terms ignored Lower min_df threshold

Evaluation Metrics

Classification: - Precision: True positives / Predicted positives - Recall: True positives / Actual positives - F1 Score: Harmonic mean of precision and recall

Regression: - MAE: Average absolute error - RMSE: Root mean squared error - R²: Variance explained

Applications Summary

Application Input Output
Draft Grade Prediction Scouting report Numeric grade
Player Comparison Two reports Similarity + differences
Media Monitoring Article stream Sentiment trends
Attribute Extraction Report text Skill scores
Topic Discovery Article corpus Topic distribution

Quick Implementation Checklist

For Scouting Report Analysis: - [ ] Clean and normalize text - [ ] Extract entities (players, teams, positions) - [ ] Calculate sentiment scores - [ ] Extract attribute mentions - [ ] Generate structured output

For Media Monitoring: - [ ] Process incoming articles - [ ] Extract entity mentions - [ ] Calculate sentiment per entity - [ ] Track over time - [ ] Generate alerts

Industry Tools

Tool Use Case Notes
NLTK General NLP Good for basics
spaCy Production NER Fast, accurate
scikit-learn ML pipelines TF-IDF, models
Hugging Face Deep learning BERT, transformers
TextBlob Quick sentiment Simple API