Chapter 25: Key Takeaways

DataField.Dev

Chapter 25: Key Takeaways

Quick Reference Summary

Text Processing Fundamentals

Concept	Description	Application
Tokenization	Split text into words/phrases	Basic unit for analysis
Normalization	Standardize text format	Consistent processing
Stop Words	Common words to filter	Reduce noise
N-grams	Consecutive word sequences	Capture phrases
TF-IDF	Term weighting scheme	Feature extraction

Essential Code Patterns

Text Cleaning:

import re

def clean_football_text(text: str) -> str:
    # Lowercase
    text = text.lower()
    # Remove special chars (keep hyphens)
    text = re.sub(r'[^\w\s\-]', ' ', text)
    # Normalize whitespace
    text = ' '.join(text.split())
    return text

Stat Extraction:

stat_pattern = re.compile(
    r'(\d+\.?\d*)\s*(yards?|tds?|catches|completions?)',
    re.IGNORECASE
)
stats = stat_pattern.findall(text)

Simple Sentiment:

def sentiment_score(text, positive_words, negative_words):
    words = text.lower().split()
    pos = sum(1 for w in words if w in positive_words)
    neg = sum(1 for w in words if w in negative_words)
    return pos / (pos + neg) if (pos + neg) > 0 else 0.5

Named Entity Recognition

Entity Type	Examples	Detection Method
Teams	Ohio State, Buckeyes	Alias mapping
Players	Kyle McCord	Name patterns
Positions	QB, WR, linebacker	Keyword lists
Stats	324 yards, 3 TDs	Regex patterns
Coaches	Ryan Day	Gazetteers

Sentiment Analysis

Football-Specific Vocabulary:

Positive Terms	Negative Terms
elite, dominant	concerns, limited
explosive, athletic	struggles, raw
polished, smart	stiff, inconsistent
cannon arm, burst	lacks, weak

Attribute Lexicons:

ATTRIBUTES = {
    'arm_strength': {
        'positive': ['cannon', 'rocket', 'velocity', 'zip'],
        'negative': ['weak arm', 'lacks velocity', 'floats']
    },
    'athleticism': {
        'positive': ['explosive', 'dynamic', 'burst', 'agile'],
        'negative': ['stiff', 'plodding', 'slow', 'tight hips']
    }
}

Topic Modeling

LDA Configuration:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(
    n_components=10,      # Number of topics
    max_iter=20,          # Training iterations
    random_state=42
)

Common Football Topics: 1. Game Analysis (yards, touchdowns, defense) 2. Recruiting (commits, prospects, visits) 3. NFL Draft (round, pick, projection) 4. Transfers (portal, destination, NIL) 5. Injuries (out, questionable, surgery)

Feature Engineering

Feature Type	Example	Use Case
TF-IDF	Word importance	Classification
Sentiment scores	0-1 scale	Grade prediction
Entity counts	3 team mentions	Context detection
N-gram presence	"arm strength"	Attribute extraction
Length features	Word count	Quality assessment

Model Building Pipeline

Raw Text → Clean → Tokenize → Vectorize → Model → Predict
    ↓         ↓        ↓         ↓         ↓
Scouting → Remove  → Split   → TF-IDF → GBM → Grade
Report    noise     words     features        prediction

Performance Benchmarks

Task	Typical Accuracy	Key Metrics
Sentiment Classification	80-85%	F1, Accuracy
Entity Extraction	85-90%	Precision, Recall
Grade Prediction	MAE: 4-5 points	MAE, R²
Topic Classification	75-85%	F1, Accuracy

Common Pitfalls

Issue	Problem	Solution
Case sensitivity	"TD" vs "td"	Normalize before lowercase
Multi-word terms	"tight end" split	N-grams or custom tokenizer
Context loss	"not good" → positive	Use negation handling
Domain vocabulary	General sentiment fails	Custom lexicons
Sparse data	Rare terms ignored	Lower min_df threshold

Evaluation Metrics

Classification: - Precision: True positives / Predicted positives - Recall: True positives / Actual positives - F1 Score: Harmonic mean of precision and recall

Regression: - MAE: Average absolute error - RMSE: Root mean squared error - R²: Variance explained

Applications Summary

Application	Input	Output
Draft Grade Prediction	Scouting report	Numeric grade
Player Comparison	Two reports	Similarity + differences
Media Monitoring	Article stream	Sentiment trends
Attribute Extraction	Report text	Skill scores
Topic Discovery	Article corpus	Topic distribution

Quick Implementation Checklist

For Scouting Report Analysis: - [ ] Clean and normalize text - [ ] Extract entities (players, teams, positions) - [ ] Calculate sentiment scores - [ ] Extract attribute mentions - [ ] Generate structured output

For Media Monitoring: - [ ] Process incoming articles - [ ] Extract entity mentions - [ ] Calculate sentiment per entity - [ ] Track over time - [ ] Generate alerts

Industry Tools

Tool	Use Case	Notes
NLTK	General NLP	Good for basics
spaCy	Production NER	Fast, accurate
scikit-learn	ML pipelines	TF-IDF, models
Hugging Face	Deep learning	BERT, transformers
TextBlob	Quick sentiment	Simple API