Chapter 25 Exercises: Natural Language Processing for Scouting
Exercise Set Overview
These exercises progress from basic text processing to advanced NLP applications in football scouting and media analysis.
Level 1: Foundational Exercises
Exercise 25.1: Text Preprocessing Pipeline
Objective: Build a text cleaning pipeline for football text data.
Task: Implement a comprehensive text preprocessor for scouting reports.
import re
from typing import List, Dict
class FootballTextPreprocessor:
"""Preprocess football-related text data."""
def __init__(self):
self.stat_patterns = []
self.football_terms = {}
def clean_text(self, text: str) -> str:
"""
Clean raw text:
- Convert to lowercase
- Remove special characters (keep hyphens, periods)
- Normalize whitespace
- Handle football-specific abbreviations
"""
# Your code here
pass
def extract_numbers_with_context(self, text: str) -> List[Dict]:
"""
Extract numbers with surrounding context.
Return: [{'number': 42, 'context': 'rushed for 42 yards', 'type': 'rushing'}]
"""
# Your code here
pass
def tokenize(self, text: str, remove_stopwords: bool = True) -> List[str]:
"""
Tokenize text with football-aware processing.
Handle multi-word terms like 'offensive line', 'running back'
"""
# Your code here
pass
# Test cases
sample_texts = [
"QB threw for 324 yards and 3 TDs against Cover-2 defense",
"The RB averaged 5.2 YPC on 18 carries in the 4th quarter",
"WR ran a 4.42 40-yard dash at the NFL Combine"
]
Exercise 25.2: Named Entity Recognition
Objective: Extract football entities from text.
Task: Build a basic NER system for football content.
class FootballNER:
"""Named Entity Recognition for football text."""
def __init__(self):
self.team_patterns = self._build_team_patterns()
self.player_pattern = re.compile(r'\b([A-Z][a-z]+\.?\s+[A-Z][a-z]+)\b')
def _build_team_patterns(self) -> Dict[str, str]:
"""Build pattern to canonical team name mapping."""
return {
'ohio state': 'Ohio State',
'buckeyes': 'Ohio State',
'osu': 'Ohio State',
# Add more teams...
}
def extract_entities(self, text: str) -> Dict[str, List]:
"""
Extract all entities from text.
Returns:
{
'teams': ['Ohio State', 'Michigan'],
'players': ['Kyle McCord', 'J.J. McCarthy'],
'positions': ['QB', 'WR'],
'stats': [{'value': 324, 'type': 'yards'}]
}
"""
# Your code here
pass
def link_entities(self, entities: Dict, context: Dict) -> Dict:
"""
Link extracted entities to database records.
Resolve ambiguous player names using team context.
"""
# Your code here
pass
Exercise 25.3: Stat Extraction
Objective: Extract and normalize statistical mentions from text.
Task: Parse complex stat strings into structured data.
def extract_game_stats(text: str) -> Dict:
"""
Extract game statistics from text.
Example input:
"McCord went 21-of-28 for 286 yards with 2 TDs and 0 INTs.
Henderson rushed 15 times for 98 yards and a touchdown."
Expected output:
{
'passing': {
'completions': 21,
'attempts': 28,
'yards': 286,
'touchdowns': 2,
'interceptions': 0
},
'rushing': {
'carries': 15,
'yards': 98,
'touchdowns': 1
}
}
"""
# Your code here
pass
# Test with various stat formats
test_texts = [
"Threw for 324 yards on 28 attempts",
"Finished 18/24, 215 yards, 2 TD, 1 INT",
"Ran for 125 yards on 22 carries (5.7 YPC)"
]
Exercise 25.4: Position Normalization
Objective: Standardize position references in text.
Task: Map various position descriptions to standard abbreviations.
class PositionNormalizer:
"""Normalize position references to standard abbreviations."""
POSITION_MAP = {
# Offensive positions
'quarterback': 'QB',
'signal caller': 'QB',
'passer': 'QB',
'running back': 'RB',
'halfback': 'RB',
'tailback': 'RB',
# Add more mappings...
}
def normalize_position(self, text: str) -> str:
"""Convert position text to standard abbreviation."""
# Your code here
pass
def extract_all_positions(self, text: str) -> List[str]:
"""Find and normalize all positions mentioned."""
# Your code here
pass
# Test with various phrasings
test_cases = [
"The signal-caller showed poise under pressure",
"As a pass-rusher, he dominated the offensive tackle",
"The nickel corner covered the slot receiver effectively"
]
Level 2: Intermediate Exercises
Exercise 25.5: Sentiment Analysis for Scouting
Objective: Analyze sentiment in scouting report language.
Task: Build a football-specific sentiment analyzer.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
class ScoutingSentimentAnalyzer:
"""Analyze sentiment in scouting reports."""
POSITIVE_INDICATORS = [
'elite', 'excellent', 'outstanding', 'dominant', 'impressive',
'quick', 'explosive', 'strong', 'polished', 'natural'
]
NEGATIVE_INDICATORS = [
'concerns', 'limited', 'struggles', 'inconsistent', 'raw',
'stiff', 'lacks', 'poor', 'below-average', 'question'
]
def __init__(self):
self.vectorizer = TfidfVectorizer(ngram_range=(1, 2))
self.model = LogisticRegression()
def analyze_sentence(self, sentence: str) -> Dict:
"""
Analyze sentiment of a single sentence.
Returns:
{
'sentiment': 'positive' | 'negative' | 'neutral',
'confidence': 0.0-1.0,
'key_phrases': ['elite athleticism', 'questions about']
}
"""
# Your code here
pass
def analyze_report(self, report: str) -> Dict:
"""
Analyze full scouting report.
Returns:
{
'overall_sentiment': 0.0-1.0,
'strengths': ['arm strength', 'pocket presence'],
'weaknesses': ['footwork', 'reads'],
'sentence_sentiments': [...]
}
"""
# Your code here
pass
# Test with real scouting language
sample_report = """
Elite arm talent with the ability to make every throw. Shows excellent
pocket presence and poise under pressure. Some concerns about consistency
in his footwork, particularly on throws to his left. Lacks ideal size
for the position but compensates with quick release.
"""
Exercise 25.6: Attribute Extraction
Objective: Extract player attribute mentions from scouting reports.
Task: Identify and score attribute discussions.
class AttributeExtractor:
"""Extract player attributes from scouting text."""
ATTRIBUTE_LEXICONS = {
'arm_strength': {
'positive': ['cannon', 'rocket', 'strong arm', 'zip', 'velocity'],
'negative': ['weak arm', 'limited arm', 'floats', 'lacks velocity']
},
'athleticism': {
'positive': ['explosive', 'athletic', 'agile', 'quick', 'burst'],
'negative': ['stiff', 'plodding', 'lacks athleticism', 'slow']
},
'football_iq': {
'positive': ['smart', 'instinctive', 'anticipates', 'reads well'],
'negative': ['slow processor', 'misreads', 'late', 'fooled']
},
# Add more attributes...
}
def extract_attributes(self, text: str) -> Dict[str, Dict]:
"""
Extract and score attributes from text.
Returns:
{
'arm_strength': {'mentioned': True, 'sentiment': 0.8, 'phrases': [...]},
'athleticism': {'mentioned': True, 'sentiment': 0.6, 'phrases': [...]},
...
}
"""
# Your code here
pass
def compare_to_position_avg(self,
attributes: Dict,
position: str) -> Dict:
"""Compare extracted attributes to position averages."""
# Your code here
pass
Exercise 25.7: Topic Modeling
Objective: Discover discussion topics in football text collections.
Task: Build a topic model for football articles.
from sklearn.decomposition import LatentDirichletAllocation
class FootballTopicModeler:
"""Discover topics in football text collections."""
def __init__(self, n_topics: int = 10):
self.n_topics = n_topics
self.vectorizer = TfidfVectorizer(
max_features=2000,
max_df=0.95,
min_df=2,
stop_words='english'
)
self.model = LatentDirichletAllocation(
n_components=n_topics,
random_state=42
)
def fit(self, documents: List[str]):
"""Fit topic model to document collection."""
# Your code here
pass
def get_topics(self, n_words: int = 10) -> List[Dict]:
"""
Get topics with top words.
Returns:
[
{'topic_id': 0, 'words': ['quarterback', 'pass', ...], 'label': 'Passing Game'},
...
]
"""
# Your code here
pass
def classify_document(self, document: str) -> Dict:
"""
Classify new document into topics.
Returns:
{
'primary_topic': 0,
'topic_distribution': [0.4, 0.3, 0.1, ...],
'topic_labels': ['Passing Game', 'Recruiting', ...]
}
"""
# Your code here
pass
# Train on sample football articles
sample_articles = [
"The quarterback prospect showed elite arm talent at the combine...",
"Alabama secured another top recruiting class with 5-star commits...",
"The defense struggled against the run, allowing 200+ yards...",
# Add more articles
]
Exercise 25.8: Text Similarity for Player Comparison
Objective: Compare players based on scouting report text.
Task: Build a text-based player comparison system.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class PlayerTextComparator:
"""Compare players using text similarity."""
def __init__(self):
self.vectorizer = TfidfVectorizer(ngram_range=(1, 2))
self.player_vectors = {}
def add_player_report(self, player_id: str, report: str):
"""Add a player's scouting report."""
# Your code here
pass
def find_similar_players(self,
player_id: str,
n: int = 5) -> List[Tuple[str, float]]:
"""
Find most similar players based on scouting reports.
Returns:
[('Player B', 0.85), ('Player C', 0.72), ...]
"""
# Your code here
pass
def compare_two_players(self,
player1_id: str,
player2_id: str) -> Dict:
"""
Detailed comparison of two players.
Returns:
{
'similarity_score': 0.75,
'shared_strengths': ['arm strength', 'athleticism'],
'differences': {'player1': ['size'], 'player2': ['speed']},
'common_phrases': ['elite prospect', 'first round talent']
}
"""
# Your code here
pass
Level 3: Advanced Exercises
Exercise 25.9: Scouting Report Grade Prediction
Objective: Predict draft grades from scouting report text.
Task: Build a model to predict numerical grades.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
class GradePredictor:
"""Predict draft grades from scouting reports."""
def __init__(self):
self.pipeline = Pipeline([
('vectorizer', TfidfVectorizer(max_features=1000)),
('regressor', GradientBoostingRegressor(n_estimators=100))
])
def train(self, reports: List[str], grades: List[float]):
"""Train grade prediction model."""
# Your code here
pass
def predict_grade(self, report: str) -> Dict:
"""
Predict grade for new report.
Returns:
{
'predicted_grade': 78.5,
'confidence_interval': (72.0, 85.0),
'key_positive_phrases': [...],
'key_negative_phrases': [...]
}
"""
# Your code here
pass
def explain_prediction(self, report: str) -> Dict:
"""
Explain what drove the grade prediction.
Returns feature importance and influential phrases.
"""
# Your code here
pass
Exercise 25.10: Media Sentiment Tracking System
Objective: Track sentiment about teams/players over time.
Task: Build a real-time sentiment tracking system.
import pandas as pd
from datetime import datetime, timedelta
class MediaSentimentTracker:
"""Track media sentiment over time."""
def __init__(self):
self.sentiment_analyzer = ScoutingSentimentAnalyzer()
self.history = pd.DataFrame()
def ingest_article(self,
article: str,
date: datetime,
source: str,
entities: List[str]):
"""
Ingest new article and update sentiment tracking.
"""
# Your code here
pass
def get_sentiment_trend(self,
entity: str,
days: int = 30) -> pd.DataFrame:
"""
Get sentiment trend for entity over time.
Returns DataFrame with:
- date
- avg_sentiment
- article_count
- sentiment_std
"""
# Your code here
pass
def detect_sentiment_shifts(self,
entity: str,
threshold: float = 0.2) -> List[Dict]:
"""
Detect significant sentiment changes.
Returns:
[
{
'date': datetime,
'shift': 0.3,
'direction': 'positive',
'possible_cause': 'Big win against rival'
}
]
"""
# Your code here
pass
def generate_report(self, entity: str) -> str:
"""Generate human-readable sentiment report."""
# Your code here
pass
Exercise 25.11: Automated Scouting Report Generator
Objective: Generate scouting report summaries from structured data.
Task: Build a template-based report generator.
class ScoutingReportGenerator:
"""Generate scouting reports from structured data."""
TEMPLATES = {
'arm_strength': {
'high': "Possesses a cannon arm with elite velocity.",
'medium': "Shows adequate arm strength for the position.",
'low': "Limited arm strength may restrict throw types."
},
# Add more templates...
}
def __init__(self):
self.templates = self.TEMPLATES
def generate_report(self,
player_data: Dict,
stats: Dict,
attributes: Dict) -> str:
"""
Generate a scouting report from structured data.
Args:
player_data: {'name': 'John Smith', 'position': 'QB', ...}
stats: {'pass_yards': 3245, 'td': 28, 'int': 8, ...}
attributes: {'arm_strength': 0.8, 'athleticism': 0.6, ...}
Returns:
Full scouting report text (300-500 words)
"""
# Your code here
pass
def generate_comparison_report(self,
player1_data: Dict,
player2_data: Dict) -> str:
"""Generate a comparison report for two players."""
# Your code here
pass
Exercise 25.12: Transfer Portal Sentiment Analysis
Objective: Analyze sentiment around transfer portal decisions.
Task: Build a specialized sentiment analyzer for transfer news.
class TransferPortalAnalyzer:
"""Analyze transfer portal sentiment and trends."""
def __init__(self):
self.sentiment_analyzer = ScoutingSentimentAnalyzer()
def analyze_transfer_announcement(self, text: str) -> Dict:
"""
Analyze sentiment around transfer announcement.
Returns:
{
'player': 'John Smith',
'from_school': 'Alabama',
'to_school': 'Ohio State',
'sentiment': {
'fan_reaction': 0.7,
'media_reaction': 0.6,
'overall': 0.65
},
'reasons_mentioned': ['playing time', 'NIL'],
'expected_impact': 'high'
}
"""
# Your code here
pass
def track_transfer_class_sentiment(self,
school: str,
articles: List[Dict]) -> Dict:
"""Track sentiment about school's transfer class."""
# Your code here
pass
def compare_school_portal_success(self,
schools: List[str]) -> pd.DataFrame:
"""Compare schools' transfer portal sentiment."""
# Your code here
pass
Level 4: Expert Challenges
Exercise 25.13: End-to-End NLP Pipeline
Objective: Build a complete scouting NLP system.
Task: Create an integrated pipeline from raw text to insights.
class ComprehensiveScoutingNLP:
"""Complete NLP pipeline for scouting analysis."""
def __init__(self):
self.preprocessor = FootballTextPreprocessor()
self.ner = FootballNER()
self.sentiment = ScoutingSentimentAnalyzer()
self.attributes = AttributeExtractor()
self.grade_predictor = GradePredictor()
def analyze_report(self, raw_text: str) -> Dict:
"""
Complete analysis of scouting report.
Returns comprehensive analysis including:
- Extracted entities
- Sentiment analysis
- Attribute scores
- Predicted grade
- Comparison to similar players
- Confidence scores
"""
# Your code here
pass
def batch_analyze(self, reports: List[Dict]) -> pd.DataFrame:
"""Analyze multiple reports and return structured results."""
# Your code here
pass
def generate_insights(self, analysis_results: pd.DataFrame) -> Dict:
"""Generate high-level insights from batch analysis."""
# Your code here
pass
Exercise 25.14: Real-Time Media Monitor
Objective: Monitor live media for football insights.
Task: Build a streaming media analysis system.
import asyncio
from collections import deque
class RealTimeMediaMonitor:
"""Real-time monitoring and analysis of football media."""
def __init__(self):
self.article_buffer = deque(maxlen=1000)
self.entity_tracker = {}
self.alert_rules = []
async def process_article(self, article: Dict) -> Dict:
"""
Process incoming article in real-time.
Returns:
{
'entities': [...],
'sentiment': {...},
'alerts': [...],
'trending_topics': [...]
}
"""
# Your code here
pass
def add_alert_rule(self, rule: Dict):
"""
Add alerting rule.
Example rule:
{
'entity': 'Ohio State',
'condition': 'sentiment_drop',
'threshold': 0.2,
'action': 'notify'
}
"""
# Your code here
pass
async def run(self, article_stream):
"""Main processing loop."""
# Your code here
pass
Submission Guidelines
- Code Quality: Include docstrings and type hints
- Testing: Provide test cases for each function
- Sample Data: Include sample inputs and expected outputs
- Documentation: Explain your approach for complex algorithms
Evaluation Criteria
| Level | Criteria | Points |
|---|---|---|
| Level 1 | Correct preprocessing, entity extraction | 25 |
| Level 2 | Working sentiment and topic models | 30 |
| Level 3 | Advanced NLP applications | 30 |
| Level 4 | Complete pipeline integration | 15 |
Resources
- NLTK documentation
- spaCy tutorials for NER
- scikit-learn text processing guide
- Hugging Face transformers (optional advanced)