Chapter 25: Key Takeaways
Quick Reference Summary
Text Processing Fundamentals
| Concept | Description | Application |
|---|---|---|
| Tokenization | Split text into words/phrases | Basic unit for analysis |
| Normalization | Standardize text format | Consistent processing |
| Stop Words | Common words to filter | Reduce noise |
| N-grams | Consecutive word sequences | Capture phrases |
| TF-IDF | Term weighting scheme | Feature extraction |
Essential Code Patterns
Text Cleaning:
import re
def clean_football_text(text: str) -> str:
# Lowercase
text = text.lower()
# Remove special chars (keep hyphens)
text = re.sub(r'[^\w\s\-]', ' ', text)
# Normalize whitespace
text = ' '.join(text.split())
return text
Stat Extraction:
stat_pattern = re.compile(
r'(\d+\.?\d*)\s*(yards?|tds?|catches|completions?)',
re.IGNORECASE
)
stats = stat_pattern.findall(text)
Simple Sentiment:
def sentiment_score(text, positive_words, negative_words):
words = text.lower().split()
pos = sum(1 for w in words if w in positive_words)
neg = sum(1 for w in words if w in negative_words)
return pos / (pos + neg) if (pos + neg) > 0 else 0.5
Named Entity Recognition
| Entity Type | Examples | Detection Method |
|---|---|---|
| Teams | Ohio State, Buckeyes | Alias mapping |
| Players | Kyle McCord | Name patterns |
| Positions | QB, WR, linebacker | Keyword lists |
| Stats | 324 yards, 3 TDs | Regex patterns |
| Coaches | Ryan Day | Gazetteers |
Sentiment Analysis
Football-Specific Vocabulary:
| Positive Terms | Negative Terms |
|---|---|
| elite, dominant | concerns, limited |
| explosive, athletic | struggles, raw |
| polished, smart | stiff, inconsistent |
| cannon arm, burst | lacks, weak |
Attribute Lexicons:
ATTRIBUTES = {
'arm_strength': {
'positive': ['cannon', 'rocket', 'velocity', 'zip'],
'negative': ['weak arm', 'lacks velocity', 'floats']
},
'athleticism': {
'positive': ['explosive', 'dynamic', 'burst', 'agile'],
'negative': ['stiff', 'plodding', 'slow', 'tight hips']
}
}
Topic Modeling
LDA Configuration:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
n_components=10, # Number of topics
max_iter=20, # Training iterations
random_state=42
)
Common Football Topics: 1. Game Analysis (yards, touchdowns, defense) 2. Recruiting (commits, prospects, visits) 3. NFL Draft (round, pick, projection) 4. Transfers (portal, destination, NIL) 5. Injuries (out, questionable, surgery)
Feature Engineering
| Feature Type | Example | Use Case |
|---|---|---|
| TF-IDF | Word importance | Classification |
| Sentiment scores | 0-1 scale | Grade prediction |
| Entity counts | 3 team mentions | Context detection |
| N-gram presence | "arm strength" | Attribute extraction |
| Length features | Word count | Quality assessment |
Model Building Pipeline
Raw Text → Clean → Tokenize → Vectorize → Model → Predict
↓ ↓ ↓ ↓ ↓
Scouting → Remove → Split → TF-IDF → GBM → Grade
Report noise words features prediction
Performance Benchmarks
| Task | Typical Accuracy | Key Metrics |
|---|---|---|
| Sentiment Classification | 80-85% | F1, Accuracy |
| Entity Extraction | 85-90% | Precision, Recall |
| Grade Prediction | MAE: 4-5 points | MAE, R² |
| Topic Classification | 75-85% | F1, Accuracy |
Common Pitfalls
| Issue | Problem | Solution |
|---|---|---|
| Case sensitivity | "TD" vs "td" | Normalize before lowercase |
| Multi-word terms | "tight end" split | N-grams or custom tokenizer |
| Context loss | "not good" → positive | Use negation handling |
| Domain vocabulary | General sentiment fails | Custom lexicons |
| Sparse data | Rare terms ignored | Lower min_df threshold |
Evaluation Metrics
Classification: - Precision: True positives / Predicted positives - Recall: True positives / Actual positives - F1 Score: Harmonic mean of precision and recall
Regression: - MAE: Average absolute error - RMSE: Root mean squared error - R²: Variance explained
Applications Summary
| Application | Input | Output |
|---|---|---|
| Draft Grade Prediction | Scouting report | Numeric grade |
| Player Comparison | Two reports | Similarity + differences |
| Media Monitoring | Article stream | Sentiment trends |
| Attribute Extraction | Report text | Skill scores |
| Topic Discovery | Article corpus | Topic distribution |
Quick Implementation Checklist
For Scouting Report Analysis: - [ ] Clean and normalize text - [ ] Extract entities (players, teams, positions) - [ ] Calculate sentiment scores - [ ] Extract attribute mentions - [ ] Generate structured output
For Media Monitoring: - [ ] Process incoming articles - [ ] Extract entity mentions - [ ] Calculate sentiment per entity - [ ] Track over time - [ ] Generate alerts
Industry Tools
| Tool | Use Case | Notes |
|---|---|---|
| NLTK | General NLP | Good for basics |
| spaCy | Production NER | Fast, accurate |
| scikit-learn | ML pipelines | TF-IDF, models |
| Hugging Face | Deep learning | BERT, transformers |
| TextBlob | Quick sentiment | Simple API |