Quiz: Data Sources and Collection in Soccer

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed. Time: ~35 minutes


Section 1: Multiple Choice (1 point each)

1. What is the primary difference between event data and tracking data?

  • A) Event data is more accurate than tracking data
  • B) Event data records discrete actions while tracking data records continuous positions
  • C) Tracking data only covers training sessions, not matches
  • D) Event data includes video while tracking data does not
Answer **B)** Event data records discrete actions while tracking data records continuous positions *Explanation:* Event data captures specific actions (passes, shots, tackles) as discrete records. Tracking data captures the continuous position of every player 25 times per second. Reference Section 2.1.

2. How many approximate data points does tracking data generate per match at 25 frames per second?

  • A) 10,000 - 50,000
  • B) 100,000 - 500,000
  • C) 1 million - 2 million
  • D) 4 million - 6 million
Answer **D)** 4 million - 6 million *Explanation:* 25 frames/second × 90 minutes × 60 seconds = 135,000 frames. With 22 players × coordinates, plus ball, velocities, etc., this generates approximately 4-6 million data points per match. Reference Section 2.1.3.

3. Which of the following is NOT a typical method for collecting tracking data?

  • A) Optical cameras positioned around the stadium
  • B) GPS devices worn by players
  • C) Human coders watching live video feeds
  • D) Accelerometers in player equipment
Answer **C)** Human coders watching live video feeds *Explanation:* Human coders are used for event data collection, not tracking data. Tracking data is collected through automated systems: optical tracking (cameras) or wearable devices (GPS, accelerometers). Reference Section 2.2.2.

4. What are "freeze frames" in soccer data?

  • A) Paused video clips for analysis
  • B) Snapshots of all player positions at key moments
  • C) Frozen player statistics at mid-season
  • D) Error states in tracking data
Answer **B)** Snapshots of all player positions at key moments *Explanation:* Freeze frames capture the position of all 22 players (and the ball) at specific moments, typically during key events like shots. They provide spatial context without the full data volume of tracking data. Reference Section 2.1.4.

5. Which data provider offers free "Open Data" for educational use?

  • A) Opta
  • B) Wyscout
  • C) StatsBomb
  • D) Second Spectrum
Answer **C)** StatsBomb *Explanation:* StatsBomb releases free event data from select competitions (World Cups, etc.) through their Open Data initiative. This has been crucial for soccer analytics education. Reference Section 2.4.2.

6. What is the typical accuracy of GPS-based player tracking compared to optical tracking?

  • A) GPS is more accurate (5-10cm vs 30-50cm)
  • B) Optical is more accurate (10-30cm vs 1-5m)
  • C) They are approximately equal (within 10cm)
  • D) It depends entirely on the specific system vendor
Answer **B)** Optical is more accurate (10-30cm vs 1-5m) *Explanation:* Optical tracking systems using stadium cameras achieve approximately 10-30cm accuracy, while GPS systems are typically accurate to 1-5 meters. However, GPS provides additional physical metrics. Reference Section 2.2.2.

7. Which of the following is a PRIMARY strength of Wyscout compared to other providers?

  • A) Highest data quality and event detail
  • B) Integrated video platform with broad league coverage
  • C) Free access to all data
  • D) Best tracking data coverage
Answer **B)** Integrated video platform with broad league coverage *Explanation:* Wyscout's key differentiator is its integrated video analysis platform combined with broad coverage of leagues including smaller and lower-division competitions. Reference Section 2.3.4.

8. How many events are typically recorded per match in event data?

  • A) 200-500
  • B) 500-1,000
  • C) 2,000-3,000
  • D) 10,000-15,000
Answer **C)** 2,000-3,000 *Explanation:* Event data providers typically tag 2,000-3,000 events per match, including all passes, shots, tackles, duels, and other actions. Reference Section 2.1.2.

9. What is the PRIMARY limitation of using only event data for analysis?

  • A) It's too expensive for most analysts
  • B) It doesn't capture off-ball movement and positioning
  • C) It's not available for major leagues
  • D) The data is always outdated by several weeks
Answer **B)** It doesn't capture off-ball movement and positioning *Explanation:* Event data only records what happens at the moment of discrete actions. Everything that occurs between events—player positioning, space creation, pressing movements—is invisible in event data. Reference Section 2.1.2.

10. Which format is recommended for storing large analytical soccer datasets?

  • A) CSV because it's human-readable
  • B) JSON because it handles hierarchical data
  • C) Parquet because it's fast and compressed
  • D) XML because it's a web standard
Answer **C)** Parquet because it's fast and compressed *Explanation:* Parquet format is recommended for large analytical datasets because it's columnar (efficient for analytical queries), compressed (smaller file sizes), and maintains data types. Reference Section 2.6.3.

Section 2: True/False (1 point each)

11. Event data is collected automatically by computer vision systems without human involvement.

Answer **False** *Explanation:* Event data is primarily collected by human coders watching live or recorded video. They use specialized software to log events in real-time. Automated event detection is emerging but not the primary method. Reference Section 2.2.1.

12. FBref provides free access to player statistics powered by StatsBomb data.

Answer **True** *Explanation:* FBref (Football Reference) provides free access to aggregated player and team statistics, with advanced stats powered by StatsBomb data since 2017-18. Reference Section 2.4.3.

13. All major European leagues have complete tracking data available for public access.

Answer **False** *Explanation:* Tracking data remains expensive and limited in availability. It requires either stadium infrastructure (optical) or team cooperation (GPS), and is not publicly available for most leagues. Reference Sections 2.1.3 and 2.3.5.

14. Event data accuracy is typically 95-98% agreement between different coders.

Answer **True** *Explanation:* Event data from professional providers achieves approximately 95-98% agreement between coders for most event classifications, though subjective events may have lower consistency. Reference Section 2.2.1.

15. Second Spectrum and TRACAB are primarily known for their event data products.

Answer **False** *Explanation:* Second Spectrum and TRACAB (ChyronHego) are tracking data providers, not event data providers. They use optical systems to capture player positions. Reference Section 2.3.5.

16. Missing data should always be filled in with average values to maintain dataset completeness.

Answer **False** *Explanation:* For analytical work, it's often better to exclude problematic data than to impute values. The appropriate handling depends on why data is missing and how it will be used. Imputation can introduce bias. Reference Section 2.5.4.

Section 3: Fill in the Blank (1 point each)

17. __ data records the continuous position of every player at 25 frames per second.

Answer **Tracking** *Explanation:* Tracking data captures continuous positional information, distinguishing it from event data which only captures discrete actions.

18. The StatsBomb __ Data initiative provides free event data for educational use.

Answer **Open** *Explanation:* StatsBomb Open Data is the official name of their free data initiative, covering World Cups and other select competitions.

19. In the data pipeline, __ involves checking data for errors, inconsistencies, and quality issues before analysis.

Answer **Validation** *Explanation:* Data validation is the process of systematically checking data quality, including completeness, accuracy, and consistency checks.

20. __ tracking uses cameras positioned around the stadium to determine player positions through computer vision.

Answer **Optical** *Explanation:* Optical tracking systems use multiple high-definition cameras and computer vision algorithms to track player positions.

Section 4: Short Answer (2 points each)

21. Name three types of data quality issues that commonly occur in soccer event data and briefly explain each.

Sample Answer 1. **Missing Data:** Events or records absent from the dataset, such as entire matches missing or specific event types not recorded. 2. **Coordinate Errors:** Inaccurate location data where events are recorded outside pitch boundaries or with systematic biases. 3. **Inconsistent Classification:** The same action being classified differently by different coders or across different seasons/providers. Other valid answers: temporal issues (events out of sequence), entity resolution problems (inconsistent player IDs), duplicate records. *Key points for full credit:* - Three distinct types identified - Brief explanation of each

22. Explain why a club analyst might need both event data AND tracking data for comprehensive analysis.

Sample Answer Event data tells you what happened (passes, shots, tackles) but not the context of how it happened. Tracking data provides this context—where were teammates and opponents positioned? How much space was available? Was the player under pressure? For example, understanding pressing effectiveness requires tracking data to see all 22 players' positions, not just the person who won the ball. Similarly, analyzing "progressive passes" is more meaningful with tracking data showing what space the pass opened up. *Key points for full credit:* - Recognition that each data type captures different information - Example of analysis requiring both

23. A colleague wants to start learning soccer analytics but has no budget for data. What data sources would you recommend and why?

Sample Answer I would recommend: 1. **StatsBomb Open Data** - Highest quality free data, includes World Cup matches with freeze frames. Available via Python API or direct download. Best for learning event data analysis. 2. **FBref** - Free aggregated statistics for major leagues. Good for learning to work with player/team statistics. Requires web scraping but is well-structured. 3. **Understat** - Free xG data and shot maps. Good for learning about expected goals concepts. These sources cover the major learning needs without cost, though they have coverage limitations compared to paid providers. *Key points for full credit:* - At least two specific free sources named - Brief justification for recommendations

Section 5: Code Analysis (2 points each)

24. The following code attempts to validate event coordinates. Identify the bug and explain how to fix it.

def validate_coordinates(df, pitch_length=120, pitch_width=80):
    """Check for invalid coordinates."""
    invalid = df[(df['x'] < 0) & (df['x'] > pitch_length)]
    return len(invalid)
Answer **Bug:** The condition `(df['x'] < 0) & (df['x'] > pitch_length)` is impossible to satisfy. A value cannot simultaneously be less than 0 AND greater than 120. **Fix:** Change `&` to `|` (OR):
invalid = df[(df['x'] < 0) | (df['x'] > pitch_length) |
             (df['y'] < 0) | (df['y'] > pitch_width)]
*Explanation:* The original code uses AND when it should use OR—we want to find coordinates that are EITHER below 0 OR above the maximum.

25. What will the following code print, and what potential issue does it reveal about the data?

df = pd.DataFrame({
    'timestamp': [1.0, 2.5, 2.3, 4.0, 5.2],
    'event': ['pass', 'pass', 'shot', 'pass', 'pass']
})
df_sorted = df.sort_values('timestamp')
negative_diff = (df_sorted['timestamp'].diff() < 0).sum()
print(f"Out of sequence: {negative_diff}")
Answer **Output:** `Out of sequence: 0` After sorting by timestamp, the order will be [1.0, 2.3, 2.5, 4.0, 5.2]. The diff() will produce [NaN, 1.3, 0.2, 1.5, 1.2], all non-negative. **Issue Revealed:** The original data had events out of sequence (2.5 before 2.3), which could indicate data quality problems. In the original unsorted data, if we ran the same check without sorting, we would find 1 out-of-sequence event. This test should probably be run BEFORE sorting to detect the issue:
negative_diff_original = (df['timestamp'].diff() < 0).sum()  # Would return 1

Scoring

Section Points Your Score
Multiple Choice (1-10) 10 ___
True/False (11-16) 6 ___
Fill in Blank (17-20) 4 ___
Short Answer (21-23) 6 ___
Code Analysis (24-25) 4 ___
Total 30 ___

Passing Score: 21/30 (70%)


Review Recommendations

  • Score < 50%: Re-read entire chapter, focusing on Sections 2.1-2.3
  • Score 50-70%: Review Sections 2.4 (Free Data) and 2.5 (Data Quality), complete exercises Part A-C
  • Score 70-85%: Good understanding! Review any missed topics before proceeding
  • Score > 85%: Excellent! Ready for Chapter 3

Next Steps

If you scored 70% or higher, proceed to: - Complete at least one case study from this chapter - Begin Chapter 3: Statistical Foundations for Soccer Analysis

If you scored below 70%: - Review the sections indicated above - Re-attempt the exercises in Parts A and B - Retake the quiz before proceeding