Chapter 25 Exercises: Natural Language Processing for Scouting

Exercise Set Overview

These exercises progress from basic text processing to advanced NLP applications in football scouting and media analysis.


Level 1: Foundational Exercises

Exercise 25.1: Text Preprocessing Pipeline

Objective: Build a text cleaning pipeline for football text data.

Task: Implement a comprehensive text preprocessor for scouting reports.

import re
from typing import List, Dict

class FootballTextPreprocessor:
    """Preprocess football-related text data."""

    def __init__(self):
        self.stat_patterns = []
        self.football_terms = {}

    def clean_text(self, text: str) -> str:
        """
        Clean raw text:
        - Convert to lowercase
        - Remove special characters (keep hyphens, periods)
        - Normalize whitespace
        - Handle football-specific abbreviations
        """
        # Your code here
        pass

    def extract_numbers_with_context(self, text: str) -> List[Dict]:
        """
        Extract numbers with surrounding context.
        Return: [{'number': 42, 'context': 'rushed for 42 yards', 'type': 'rushing'}]
        """
        # Your code here
        pass

    def tokenize(self, text: str, remove_stopwords: bool = True) -> List[str]:
        """
        Tokenize text with football-aware processing.
        Handle multi-word terms like 'offensive line', 'running back'
        """
        # Your code here
        pass

# Test cases
sample_texts = [
    "QB threw for 324 yards and 3 TDs against Cover-2 defense",
    "The RB averaged 5.2 YPC on 18 carries in the 4th quarter",
    "WR ran a 4.42 40-yard dash at the NFL Combine"
]

Exercise 25.2: Named Entity Recognition

Objective: Extract football entities from text.

Task: Build a basic NER system for football content.

class FootballNER:
    """Named Entity Recognition for football text."""

    def __init__(self):
        self.team_patterns = self._build_team_patterns()
        self.player_pattern = re.compile(r'\b([A-Z][a-z]+\.?\s+[A-Z][a-z]+)\b')

    def _build_team_patterns(self) -> Dict[str, str]:
        """Build pattern to canonical team name mapping."""
        return {
            'ohio state': 'Ohio State',
            'buckeyes': 'Ohio State',
            'osu': 'Ohio State',
            # Add more teams...
        }

    def extract_entities(self, text: str) -> Dict[str, List]:
        """
        Extract all entities from text.

        Returns:
            {
                'teams': ['Ohio State', 'Michigan'],
                'players': ['Kyle McCord', 'J.J. McCarthy'],
                'positions': ['QB', 'WR'],
                'stats': [{'value': 324, 'type': 'yards'}]
            }
        """
        # Your code here
        pass

    def link_entities(self, entities: Dict, context: Dict) -> Dict:
        """
        Link extracted entities to database records.
        Resolve ambiguous player names using team context.
        """
        # Your code here
        pass

Exercise 25.3: Stat Extraction

Objective: Extract and normalize statistical mentions from text.

Task: Parse complex stat strings into structured data.

def extract_game_stats(text: str) -> Dict:
    """
    Extract game statistics from text.

    Example input:
    "McCord went 21-of-28 for 286 yards with 2 TDs and 0 INTs.
     Henderson rushed 15 times for 98 yards and a touchdown."

    Expected output:
    {
        'passing': {
            'completions': 21,
            'attempts': 28,
            'yards': 286,
            'touchdowns': 2,
            'interceptions': 0
        },
        'rushing': {
            'carries': 15,
            'yards': 98,
            'touchdowns': 1
        }
    }
    """
    # Your code here
    pass

# Test with various stat formats
test_texts = [
    "Threw for 324 yards on 28 attempts",
    "Finished 18/24, 215 yards, 2 TD, 1 INT",
    "Ran for 125 yards on 22 carries (5.7 YPC)"
]

Exercise 25.4: Position Normalization

Objective: Standardize position references in text.

Task: Map various position descriptions to standard abbreviations.

class PositionNormalizer:
    """Normalize position references to standard abbreviations."""

    POSITION_MAP = {
        # Offensive positions
        'quarterback': 'QB',
        'signal caller': 'QB',
        'passer': 'QB',
        'running back': 'RB',
        'halfback': 'RB',
        'tailback': 'RB',
        # Add more mappings...
    }

    def normalize_position(self, text: str) -> str:
        """Convert position text to standard abbreviation."""
        # Your code here
        pass

    def extract_all_positions(self, text: str) -> List[str]:
        """Find and normalize all positions mentioned."""
        # Your code here
        pass

# Test with various phrasings
test_cases = [
    "The signal-caller showed poise under pressure",
    "As a pass-rusher, he dominated the offensive tackle",
    "The nickel corner covered the slot receiver effectively"
]

Level 2: Intermediate Exercises

Exercise 25.5: Sentiment Analysis for Scouting

Objective: Analyze sentiment in scouting report language.

Task: Build a football-specific sentiment analyzer.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

class ScoutingSentimentAnalyzer:
    """Analyze sentiment in scouting reports."""

    POSITIVE_INDICATORS = [
        'elite', 'excellent', 'outstanding', 'dominant', 'impressive',
        'quick', 'explosive', 'strong', 'polished', 'natural'
    ]

    NEGATIVE_INDICATORS = [
        'concerns', 'limited', 'struggles', 'inconsistent', 'raw',
        'stiff', 'lacks', 'poor', 'below-average', 'question'
    ]

    def __init__(self):
        self.vectorizer = TfidfVectorizer(ngram_range=(1, 2))
        self.model = LogisticRegression()

    def analyze_sentence(self, sentence: str) -> Dict:
        """
        Analyze sentiment of a single sentence.

        Returns:
            {
                'sentiment': 'positive' | 'negative' | 'neutral',
                'confidence': 0.0-1.0,
                'key_phrases': ['elite athleticism', 'questions about']
            }
        """
        # Your code here
        pass

    def analyze_report(self, report: str) -> Dict:
        """
        Analyze full scouting report.

        Returns:
            {
                'overall_sentiment': 0.0-1.0,
                'strengths': ['arm strength', 'pocket presence'],
                'weaknesses': ['footwork', 'reads'],
                'sentence_sentiments': [...]
            }
        """
        # Your code here
        pass

# Test with real scouting language
sample_report = """
Elite arm talent with the ability to make every throw. Shows excellent
pocket presence and poise under pressure. Some concerns about consistency
in his footwork, particularly on throws to his left. Lacks ideal size
for the position but compensates with quick release.
"""

Exercise 25.6: Attribute Extraction

Objective: Extract player attribute mentions from scouting reports.

Task: Identify and score attribute discussions.

class AttributeExtractor:
    """Extract player attributes from scouting text."""

    ATTRIBUTE_LEXICONS = {
        'arm_strength': {
            'positive': ['cannon', 'rocket', 'strong arm', 'zip', 'velocity'],
            'negative': ['weak arm', 'limited arm', 'floats', 'lacks velocity']
        },
        'athleticism': {
            'positive': ['explosive', 'athletic', 'agile', 'quick', 'burst'],
            'negative': ['stiff', 'plodding', 'lacks athleticism', 'slow']
        },
        'football_iq': {
            'positive': ['smart', 'instinctive', 'anticipates', 'reads well'],
            'negative': ['slow processor', 'misreads', 'late', 'fooled']
        },
        # Add more attributes...
    }

    def extract_attributes(self, text: str) -> Dict[str, Dict]:
        """
        Extract and score attributes from text.

        Returns:
            {
                'arm_strength': {'mentioned': True, 'sentiment': 0.8, 'phrases': [...]},
                'athleticism': {'mentioned': True, 'sentiment': 0.6, 'phrases': [...]},
                ...
            }
        """
        # Your code here
        pass

    def compare_to_position_avg(self,
                                 attributes: Dict,
                                 position: str) -> Dict:
        """Compare extracted attributes to position averages."""
        # Your code here
        pass

Exercise 25.7: Topic Modeling

Objective: Discover discussion topics in football text collections.

Task: Build a topic model for football articles.

from sklearn.decomposition import LatentDirichletAllocation

class FootballTopicModeler:
    """Discover topics in football text collections."""

    def __init__(self, n_topics: int = 10):
        self.n_topics = n_topics
        self.vectorizer = TfidfVectorizer(
            max_features=2000,
            max_df=0.95,
            min_df=2,
            stop_words='english'
        )
        self.model = LatentDirichletAllocation(
            n_components=n_topics,
            random_state=42
        )

    def fit(self, documents: List[str]):
        """Fit topic model to document collection."""
        # Your code here
        pass

    def get_topics(self, n_words: int = 10) -> List[Dict]:
        """
        Get topics with top words.

        Returns:
            [
                {'topic_id': 0, 'words': ['quarterback', 'pass', ...], 'label': 'Passing Game'},
                ...
            ]
        """
        # Your code here
        pass

    def classify_document(self, document: str) -> Dict:
        """
        Classify new document into topics.

        Returns:
            {
                'primary_topic': 0,
                'topic_distribution': [0.4, 0.3, 0.1, ...],
                'topic_labels': ['Passing Game', 'Recruiting', ...]
            }
        """
        # Your code here
        pass

# Train on sample football articles
sample_articles = [
    "The quarterback prospect showed elite arm talent at the combine...",
    "Alabama secured another top recruiting class with 5-star commits...",
    "The defense struggled against the run, allowing 200+ yards...",
    # Add more articles
]

Exercise 25.8: Text Similarity for Player Comparison

Objective: Compare players based on scouting report text.

Task: Build a text-based player comparison system.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class PlayerTextComparator:
    """Compare players using text similarity."""

    def __init__(self):
        self.vectorizer = TfidfVectorizer(ngram_range=(1, 2))
        self.player_vectors = {}

    def add_player_report(self, player_id: str, report: str):
        """Add a player's scouting report."""
        # Your code here
        pass

    def find_similar_players(self,
                              player_id: str,
                              n: int = 5) -> List[Tuple[str, float]]:
        """
        Find most similar players based on scouting reports.

        Returns:
            [('Player B', 0.85), ('Player C', 0.72), ...]
        """
        # Your code here
        pass

    def compare_two_players(self,
                             player1_id: str,
                             player2_id: str) -> Dict:
        """
        Detailed comparison of two players.

        Returns:
            {
                'similarity_score': 0.75,
                'shared_strengths': ['arm strength', 'athleticism'],
                'differences': {'player1': ['size'], 'player2': ['speed']},
                'common_phrases': ['elite prospect', 'first round talent']
            }
        """
        # Your code here
        pass

Level 3: Advanced Exercises

Exercise 25.9: Scouting Report Grade Prediction

Objective: Predict draft grades from scouting report text.

Task: Build a model to predict numerical grades.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline

class GradePredictor:
    """Predict draft grades from scouting reports."""

    def __init__(self):
        self.pipeline = Pipeline([
            ('vectorizer', TfidfVectorizer(max_features=1000)),
            ('regressor', GradientBoostingRegressor(n_estimators=100))
        ])

    def train(self, reports: List[str], grades: List[float]):
        """Train grade prediction model."""
        # Your code here
        pass

    def predict_grade(self, report: str) -> Dict:
        """
        Predict grade for new report.

        Returns:
            {
                'predicted_grade': 78.5,
                'confidence_interval': (72.0, 85.0),
                'key_positive_phrases': [...],
                'key_negative_phrases': [...]
            }
        """
        # Your code here
        pass

    def explain_prediction(self, report: str) -> Dict:
        """
        Explain what drove the grade prediction.

        Returns feature importance and influential phrases.
        """
        # Your code here
        pass

Exercise 25.10: Media Sentiment Tracking System

Objective: Track sentiment about teams/players over time.

Task: Build a real-time sentiment tracking system.

import pandas as pd
from datetime import datetime, timedelta

class MediaSentimentTracker:
    """Track media sentiment over time."""

    def __init__(self):
        self.sentiment_analyzer = ScoutingSentimentAnalyzer()
        self.history = pd.DataFrame()

    def ingest_article(self,
                       article: str,
                       date: datetime,
                       source: str,
                       entities: List[str]):
        """
        Ingest new article and update sentiment tracking.
        """
        # Your code here
        pass

    def get_sentiment_trend(self,
                            entity: str,
                            days: int = 30) -> pd.DataFrame:
        """
        Get sentiment trend for entity over time.

        Returns DataFrame with:
        - date
        - avg_sentiment
        - article_count
        - sentiment_std
        """
        # Your code here
        pass

    def detect_sentiment_shifts(self,
                                 entity: str,
                                 threshold: float = 0.2) -> List[Dict]:
        """
        Detect significant sentiment changes.

        Returns:
            [
                {
                    'date': datetime,
                    'shift': 0.3,
                    'direction': 'positive',
                    'possible_cause': 'Big win against rival'
                }
            ]
        """
        # Your code here
        pass

    def generate_report(self, entity: str) -> str:
        """Generate human-readable sentiment report."""
        # Your code here
        pass

Exercise 25.11: Automated Scouting Report Generator

Objective: Generate scouting report summaries from structured data.

Task: Build a template-based report generator.

class ScoutingReportGenerator:
    """Generate scouting reports from structured data."""

    TEMPLATES = {
        'arm_strength': {
            'high': "Possesses a cannon arm with elite velocity.",
            'medium': "Shows adequate arm strength for the position.",
            'low': "Limited arm strength may restrict throw types."
        },
        # Add more templates...
    }

    def __init__(self):
        self.templates = self.TEMPLATES

    def generate_report(self,
                        player_data: Dict,
                        stats: Dict,
                        attributes: Dict) -> str:
        """
        Generate a scouting report from structured data.

        Args:
            player_data: {'name': 'John Smith', 'position': 'QB', ...}
            stats: {'pass_yards': 3245, 'td': 28, 'int': 8, ...}
            attributes: {'arm_strength': 0.8, 'athleticism': 0.6, ...}

        Returns:
            Full scouting report text (300-500 words)
        """
        # Your code here
        pass

    def generate_comparison_report(self,
                                    player1_data: Dict,
                                    player2_data: Dict) -> str:
        """Generate a comparison report for two players."""
        # Your code here
        pass

Exercise 25.12: Transfer Portal Sentiment Analysis

Objective: Analyze sentiment around transfer portal decisions.

Task: Build a specialized sentiment analyzer for transfer news.

class TransferPortalAnalyzer:
    """Analyze transfer portal sentiment and trends."""

    def __init__(self):
        self.sentiment_analyzer = ScoutingSentimentAnalyzer()

    def analyze_transfer_announcement(self, text: str) -> Dict:
        """
        Analyze sentiment around transfer announcement.

        Returns:
            {
                'player': 'John Smith',
                'from_school': 'Alabama',
                'to_school': 'Ohio State',
                'sentiment': {
                    'fan_reaction': 0.7,
                    'media_reaction': 0.6,
                    'overall': 0.65
                },
                'reasons_mentioned': ['playing time', 'NIL'],
                'expected_impact': 'high'
            }
        """
        # Your code here
        pass

    def track_transfer_class_sentiment(self,
                                        school: str,
                                        articles: List[Dict]) -> Dict:
        """Track sentiment about school's transfer class."""
        # Your code here
        pass

    def compare_school_portal_success(self,
                                       schools: List[str]) -> pd.DataFrame:
        """Compare schools' transfer portal sentiment."""
        # Your code here
        pass

Level 4: Expert Challenges

Exercise 25.13: End-to-End NLP Pipeline

Objective: Build a complete scouting NLP system.

Task: Create an integrated pipeline from raw text to insights.

class ComprehensiveScoutingNLP:
    """Complete NLP pipeline for scouting analysis."""

    def __init__(self):
        self.preprocessor = FootballTextPreprocessor()
        self.ner = FootballNER()
        self.sentiment = ScoutingSentimentAnalyzer()
        self.attributes = AttributeExtractor()
        self.grade_predictor = GradePredictor()

    def analyze_report(self, raw_text: str) -> Dict:
        """
        Complete analysis of scouting report.

        Returns comprehensive analysis including:
        - Extracted entities
        - Sentiment analysis
        - Attribute scores
        - Predicted grade
        - Comparison to similar players
        - Confidence scores
        """
        # Your code here
        pass

    def batch_analyze(self, reports: List[Dict]) -> pd.DataFrame:
        """Analyze multiple reports and return structured results."""
        # Your code here
        pass

    def generate_insights(self, analysis_results: pd.DataFrame) -> Dict:
        """Generate high-level insights from batch analysis."""
        # Your code here
        pass

Exercise 25.14: Real-Time Media Monitor

Objective: Monitor live media for football insights.

Task: Build a streaming media analysis system.

import asyncio
from collections import deque

class RealTimeMediaMonitor:
    """Real-time monitoring and analysis of football media."""

    def __init__(self):
        self.article_buffer = deque(maxlen=1000)
        self.entity_tracker = {}
        self.alert_rules = []

    async def process_article(self, article: Dict) -> Dict:
        """
        Process incoming article in real-time.

        Returns:
            {
                'entities': [...],
                'sentiment': {...},
                'alerts': [...],
                'trending_topics': [...]
            }
        """
        # Your code here
        pass

    def add_alert_rule(self, rule: Dict):
        """
        Add alerting rule.

        Example rule:
        {
            'entity': 'Ohio State',
            'condition': 'sentiment_drop',
            'threshold': 0.2,
            'action': 'notify'
        }
        """
        # Your code here
        pass

    async def run(self, article_stream):
        """Main processing loop."""
        # Your code here
        pass

Submission Guidelines

  1. Code Quality: Include docstrings and type hints
  2. Testing: Provide test cases for each function
  3. Sample Data: Include sample inputs and expected outputs
  4. Documentation: Explain your approach for complex algorithms

Evaluation Criteria

Level Criteria Points
Level 1 Correct preprocessing, entity extraction 25
Level 2 Working sentiment and topic models 30
Level 3 Advanced NLP applications 30
Level 4 Complete pipeline integration 15

Resources

  • NLTK documentation
  • spaCy tutorials for NER
  • scikit-learn text processing guide
  • Hugging Face transformers (optional advanced)