Chapter 25 Quiz: Natural Language Processing for Scouting
Instructions
- 35 questions total
- Mix of multiple choice, true/false, and short answer
- Time limit: 45 minutes
- Passing score: 70%
Section 1: Text Processing Fundamentals (10 questions)
Question 1
What is the primary purpose of text preprocessing in NLP pipelines?
A) To reduce file size B) To standardize text format for consistent analysis C) To remove all numbers from text D) To translate text to another language
Question 2
In football NLP, why might you want to preserve hyphens during text cleaning?
A) They look nice in output B) Hyphenated terms like "play-action" and "run-pass" are meaningful C) Standard practice requires keeping all punctuation D) It makes tokenization faster
Question 3
True or False: Stop word removal should always be applied in football text analysis.
Question 4
Which tokenization challenge is unique to football text?
A) Handling Unicode characters B) Multi-word terms like "tight end" and "offensive line" C) Sentence boundary detection D) Removing whitespace
Question 5
What regex pattern would best extract "21-of-28" completion statistics?
A) \d+-\d+
B) \d+-of-\d+
C) [0-9]+/[0-9]+
D) \d+ of \d+
Question 6
Short Answer: Explain why case normalization (converting to lowercase) might cause issues when processing football text. Give an example.
Question 7
The TF-IDF weighting scheme gives higher scores to terms that:
A) Appear frequently across all documents B) Appear frequently in one document but rarely across the corpus C) Are the longest words in the document D) Appear at the beginning of documents
Question 8
True or False: N-grams are sequences of N consecutive words, and bigrams (N=2) can help capture phrases like "arm strength" as single features.
Question 9
When extracting statistics from text like "threw for 324 yards", the context word "yards" helps:
A) Determine the player position B) Classify the statistic type C) Calculate the team score D) Identify the game location
Question 10
Which preprocessing step would help handle variations like "TD", "TDs", "touchdown", and "touchdowns"?
A) Stop word removal B) Lemmatization/stemming C) Term normalization mapping D) Spell checking
Section 2: Named Entity Recognition (8 questions)
Question 11
In football NER, which entities are typically most difficult to disambiguate?
A) Team names B) Player names (due to common surnames) C) Position abbreviations D) Statistics
Question 12
The pattern [A-Z][a-z]+ [A-Z][a-z]+ would match:
A) Any two words B) Two capitalized words (potential player names) C) Team abbreviations D) Position codes
Question 13
True or False: Team nicknames like "Buckeyes" or "Crimson Tide" should be mapped to their official team names for consistent analysis.
Question 14
Short Answer: How would you distinguish between "Michigan" the team and "Michigan" the state in football text?
Question 15
Entity linking in football NER refers to:
A) Connecting mentions to database records B) Drawing lines between players on a field C) Linking social media accounts D) Connecting paragraphs in a report
Question 16
Which approach would help resolve ambiguous player references like "Smith"?
A) Always pick the most famous player B) Use surrounding context (team mentions, position) for disambiguation C) Ignore ambiguous mentions D) Ask the user to clarify
Question 17
True or False: Regular expressions alone are sufficient for production-quality football NER systems.
Question 18
When building a team name recognizer, including variations like "OSU", "Ohio St.", and "Buckeyes" mapping to "Ohio State" is an example of:
A) Machine learning B) Alias mapping / gazetteers C) Deep learning D) Semantic analysis
Section 3: Sentiment and Topic Analysis (10 questions)
Question 19
In scouting report sentiment analysis, the word "explosive" is typically:
A) Negative (implies volatility) B) Positive (describes athleticism) C) Neutral (purely descriptive) D) Context-dependent
Question 20
What challenge does football sentiment analysis face that general sentiment analysis doesn't?
A) Longer documents B) Domain-specific vocabulary with different meanings C) More grammatical errors D) Foreign languages
Question 21
True or False: A word like "limited" is always negative in scouting reports.
Question 22
Topic modeling with LDA (Latent Dirichlet Allocation) assumes:
A) Each document contains exactly one topic B) Documents are mixtures of topics, each topic is a distribution over words C) Topics are predefined by the user D) All documents have the same length
Question 23
Short Answer: Name three potential topics you might discover when applying LDA to a corpus of football articles.
Question 24
When analyzing scouting reports, why might sentence-level sentiment be more useful than document-level?
A) Faster to compute B) Reports often contain both strengths and weaknesses in different sentences C) Documents are too short D) Required by industry standards
Question 25
The phrase "concerns about his footwork" in a scouting report would be classified as:
A) Strength identification B) Weakness identification C) Neutral observation D) Statistical mention
Question 26
True or False: Higher TF-IDF scores indicate words that are both common in the document and unique across the corpus.
Question 27
Which metric best measures topic model quality?
A) Word count B) Coherence score C) Document length D) Vocabulary size
Question 28
In a scouting context, what does attribute extraction typically aim to identify?
A) Player jersey numbers B) Qualities like arm strength, athleticism, football IQ C) Team schedules D) Game scores
Section 4: Applications (7 questions)
Question 29
When building a text-based player comparison system, cosine similarity measures:
A) The absolute difference between word counts B) The angle between document vectors (semantic similarity) C) The length ratio of documents D) The number of shared words
Question 30
True or False: Media sentiment about a team typically correlates with their recent win-loss record.
Question 31
Short Answer: Describe how you would use NLP to detect when a player's media perception is changing significantly.
Question 32
Which application would benefit most from real-time NLP processing?
A) Historical draft grade analysis B) Transfer portal news monitoring C) Career retrospectives D) Rule book analysis
Question 33
When predicting draft grades from scouting reports, which feature type typically has the highest predictive power?
A) Document length B) Number of sentences C) Sentiment-weighted attribute mentions D) Author name
Question 34
True or False: Scouting report language tends to be more formal and standardized than social media commentary.
Question 35
The main challenge in building automated scouting report generators is:
A) Database storage B) Producing natural, varied language that reads like human writing C) Processing speed D) File format compatibility
Answer Key
Section 1: Text Processing Fundamentals
-
B) To standardize text format for consistent analysis - Preprocessing ensures consistent input for downstream NLP tasks.
-
B) Hyphenated terms like "play-action" and "run-pass" are meaningful - These compound terms have specific football meanings that would be lost if split.
-
False - Stop words should be selectively removed; some like "no" (as in "no interceptions") carry meaning. Context matters.
-
B) Multi-word terms like "tight end" and "offensive line" - Football has many multi-word position and concept names that should be treated as single tokens.
-
B)
\d+-of-\d+- This pattern specifically matches the "X-of-Y" completion format common in football statistics. -
Sample Answer: Case normalization can cause issues with acronyms and abbreviations that derive meaning from capitalization. For example, "TD" (touchdown) and "td" could be different after lowercasing - we might confuse it with other meanings. Also, player name initials like "J.J. McCarthy" lose information when lowercased. The solution is to normalize acronyms before case conversion.
-
B) Appear frequently in one document but rarely across the corpus - TF-IDF rewards terms that are distinctive to specific documents.
-
True - Bigrams capture two-word phrases as single features, preserving multi-word concepts.
-
B) Classify the statistic type - The context word identifies this as a yardage statistic versus touchdowns, completions, etc.
-
C) Term normalization mapping - Creating a mapping that converts all variations to a canonical form (e.g., all → "touchdown").
Section 2: Named Entity Recognition
-
B) Player names (due to common surnames) - Many players share surnames (Smith, Jones, Williams), requiring context for disambiguation.
-
B) Two capitalized words (potential player names) - This pattern matches the typical "First Last" name format.
-
True - Normalizing team names ensures consistent entity references throughout analysis.
-
Sample Answer: Use context clues: If surrounded by football terms (played, beat, vs.), team mentions, or game contexts, it's likely the team. If surrounded by geographic or demographic terms (state of, located in, population), it's likely the state. Also check for team-specific terms nearby (Wolverines, Big Ten, etc.).
-
A) Connecting mentions to database records - Entity linking resolves text mentions to specific entities in a knowledge base.
-
B) Use surrounding context (team mentions, position) for disambiguation - Context provides clues for resolving ambiguous references.
-
False - Production systems typically combine regex with machine learning for better accuracy and handling of edge cases.
-
B) Alias mapping / gazetteers - A gazetteer is a dictionary mapping aliases to canonical entity names.
Section 3: Sentiment and Topic Analysis
-
B) Positive (describes athleticism) - In scouting, "explosive" refers to quick, powerful movements - a positive athletic attribute.
-
B) Domain-specific vocabulary with different meanings - Words like "limited," "raw," and "average" have specific connotations in scouting.
-
True - In scouting contexts, "limited" almost always indicates a weakness or deficiency.
-
B) Documents are mixtures of topics, each topic is a distribution over words - LDA's fundamental assumption.
-
Sample Answer: (1) Recruiting - commits, prospects, stars, visits; (2) Game Analysis - yards, touchdowns, defense, offense; (3) NFL Draft - pick, round, combine, projection; (4) Injuries - out, questionable, surgery, recovery; (5) Transfer Portal - entering, destination, NIL.
-
B) Reports often contain both strengths and weaknesses in different sentences - Scouting reports typically discuss both positives and negatives.
-
B) Weakness identification - "Concerns" signals a weakness or area for improvement.
-
True - TF-IDF balances term frequency (TF) with inverse document frequency (IDF).
-
B) Coherence score - Coherence measures how well the top words in each topic relate to each other semantically.
-
B) Qualities like arm strength, athleticism, football IQ - Attributes are the player characteristics scouts evaluate.
Section 4: Applications
-
B) The angle between document vectors (semantic similarity) - Cosine similarity measures how similar document vectors are directionally.
-
True - Media sentiment typically rises after wins and falls after losses, though other factors (injuries, controversies) also influence it.
-
Sample Answer: (1) Track sentiment scores over time using rolling averages; (2) Calculate sentiment change (delta) between periods; (3) Set threshold for "significant" change (e.g., >0.2 on 0-1 scale); (4) When threshold exceeded, flag as sentiment shift; (5) Analyze concurrent articles to identify potential cause; (6) Compare to baseline variance to avoid false positives from normal fluctuation.
-
B) Transfer portal news monitoring - Transfer portal news breaks quickly and requires real-time tracking.
-
C) Sentiment-weighted attribute mentions - The combination of which attributes are discussed and in what sentiment context best predicts grades.
-
True - Professional scouting reports follow more structured formats than informal social media posts.
-
B) Producing natural, varied language that reads like human writing - Natural language generation is challenging; avoiding repetitive, templated output requires sophisticated techniques.
Scoring Guide
| Score | Grade | Feedback |
|---|---|---|
| 32-35 | A | Excellent NLP understanding |
| 28-31 | B | Good grasp of concepts, review sentiment analysis |
| 25-27 | C | Satisfactory, focus on applications |
| 21-24 | D | Needs improvement in core concepts |
| <21 | F | Re-study chapter material |