Chapter 23: Quiz

Instructions: Select the best answer for each question. Each question has exactly one correct answer unless otherwise stated.


Question 1. What is the primary advantage of video analysis over purely statistical analysis in soccer?

(a) Video analysis is faster to process (b) Video provides contextual information that numbers cannot capture, such as body shape, off-ball movement, and defensive structure (c) Video analysis is more objective than statistical analysis (d) Video data is easier to store than numerical data


Question 2. In a video tagging schema, the principle of "mutual exclusivity" means:

(a) Each event can only be tagged by one analyst (b) Categories within a single dimension should not overlap (c) Each match should only be tagged once (d) Tags should be exclusive to one team


Question 3. Cohen's kappa ($\kappa$) is used to measure:

(a) The accuracy of a computer vision model (b) The speed of a video tagging system (c) The agreement between two taggers, corrected for chance agreement (d) The number of events in a match


Question 4. A Cohen's kappa value of 0.45 would be interpreted as:

(a) Almost perfect agreement (b) Substantial agreement (c) Moderate agreement (d) Fair agreement -- the tagging schema likely needs refinement


Question 5. In the equation $\mathbf{I} \in \mathbb{R}^{H \times W \times C}$, what does $C$ represent for a standard color video frame?

(a) The number of cameras (b) The number of color channels (typically 3 for RGB) (c) The compression ratio (d) The number of classes for detection


Question 6. Why is HSV color space preferred over RGB for pitch detection in soccer video?

(a) HSV uses fewer bytes per pixel (b) HSV separates color information (hue) from brightness, making it easier to isolate the green pitch under varying lighting (c) HSV is faster to compute than RGB (d) HSV provides higher resolution


Question 7. A homography matrix is used to:

(a) Compress video files for storage (b) Transform coordinates from one plane to another, mapping pixel positions to real-world pitch coordinates (c) Detect edges in an image (d) Classify player jerseys by color


Question 8. How many point correspondences are needed at minimum to estimate a homography matrix?

(a) 2 (b) 3 (c) 4 (d) 8


Question 9. Transfer learning in the context of soccer CV refers to:

(a) Transferring video files between analysis platforms (b) Using a model pre-trained on a large general dataset and fine-tuning it on soccer-specific data (c) Transferring player tracking data between clubs (d) Moving a camera from one stadium to another


Question 10. Which object in soccer video is typically the most difficult to detect using CV, and why?

(a) Players, because they move unpredictably (b) The ball, because of its small size in wide-angle footage, motion blur, and frequent occlusion (c) The referee, because of the dark jersey (d) The goalposts, because they are partially obscured by the net


Question 11. In the YOLO detector's loss function, focal loss is used to:

(a) Improve detection of fast-moving objects (b) Address class imbalance by down-weighting easy, well-classified examples (c) Increase the resolution of small objects (d) Reduce computational cost


Question 12. Intersection over Union (IoU) is calculated as:

(a) The area of the predicted box divided by the area of the ground truth box (b) The area of intersection divided by the area of union of the predicted and ground truth boxes (c) The area of union divided by the area of intersection (d) The perimeter of intersection divided by the perimeter of union


Question 13. In multi-object tracking, the Hungarian algorithm is used for:

(a) Detecting objects in each frame (b) Estimating the homography matrix (c) Finding the optimal assignment between detections and existing tracks (d) Classifying team jerseys


Question 14. A Kalman filter in a tracking system maintains which type of information?

(a) Only the current position of each tracked object (b) A state estimate including position, dimensions, and their velocities, along with uncertainty (c) Only the color histogram of each player (d) The complete history of all past positions


Question 15. The single largest source of tracking errors in soccer CV is:

(a) Camera noise (b) Pitch markings confusing the detector (c) Occlusion, where one player obscures another from the camera's view (d) Changes in lighting during the match


Question 16. For team identification, why is only the torso region of the bounding box used rather than the full box?

(a) The torso region is always visible (b) The torso contains the jersey, which has the most distinctive team color, while legs/feet and the background add noise (c) The torso is computationally cheaper to process (d) The torso has the highest resolution in the image


Question 17. When projecting player positions from pixel coordinates to pitch coordinates, which point of the bounding box is typically used?

(a) The center of the bounding box (b) The top center (head position) (c) The bottom center (foot point), as it approximates the player's contact with the pitch (d) The top-left corner


Question 18. In the COCO keypoint format for pose estimation, how many keypoints are defined per person?

(a) 12 (b) 17 (c) 21 (d) 25


Question 19. Top-down pose estimation approaches:

(a) Detect all keypoints first, then group them into individuals (b) First detect persons (bounding boxes), then estimate pose within each box (c) Process the image from the top row of pixels to the bottom (d) Require cameras mounted above the pitch


Question 20. A pose-based "fatigue index" comparing trunk lean angle early vs. late in a match would be most reliable when:

(a) Using broadcast video with distant camera angles (b) Using close-up camera views or multi-camera setups with sufficient resolution (c) Applied only to goalkeepers (d) Calculated from a single frame


Question 21. Which type of soccer event is typically the MOST difficult for automated detection systems?

(a) Goals (b) Corner kicks (c) Fouls, because they require subjective judgment (d) Throw-ins


Question 22. A hybrid event detection system combines:

(a) Video from multiple cameras (b) Rule-based detection for well-defined events with ML-based detection for ambiguous events (c) Audio and visual features (d) GPS data with accelerometer data


Question 23. The "long tail" problem in soccer event detection refers to:

(a) Events that occur near the end of the match (b) The severe class imbalance where rare but important events (goals, red cards) have very few training examples (c) The delay between an event occurring and being detected (d) The length of video that must be processed


Question 24. Which emerging technology allows viewing a match from any angle, including perspectives that no physical camera captured?

(a) 4K broadcast cameras (b) 3D scene reconstruction using techniques like neural radiance fields (NeRFs) or Gaussian splatting (c) Drone-mounted cameras (d) Virtual reality headsets


Question 25. The most significant ethical concern with increasing use of CV in soccer is:

(a) The cost of computing hardware (b) The risk that expensive CV technology widens the competitive gap between wealthy and less wealthy clubs, combined with privacy implications of tracking individuals (c) The visual quality of the output (d) The speed of internet connections required


Answer Key

  1. (b) -- Video captures contextual, qualitative information (body orientation, off-ball movement, defensive shape) that statistical summaries cannot represent.

  2. (b) -- Mutual exclusivity means a single event cannot belong to two categories within the same classification dimension simultaneously.

  3. (c) -- Cohen's kappa measures inter-rater reliability, correcting for the agreement that would be expected by random chance.

  4. (d) -- A kappa of 0.45 falls in the "fair" to low "moderate" range (0.41--0.60), indicating the schema or tagger training needs improvement.

  5. (b) -- $C$ represents the color channels. Standard RGB images have 3 channels (Red, Green, Blue).

  6. (b) -- HSV separates hue (color) from saturation and value (brightness), making color-based segmentation more robust to lighting changes.

  7. (b) -- A homography is a projective transformation that maps points between two planes, enabling pixel-to-pitch coordinate conversion.

  8. (c) -- A homography has 8 degrees of freedom (9 elements minus 1 for scale normalization), and each point correspondence provides 2 equations, so at least 4 points are needed.

  9. (b) -- Transfer learning uses weights learned on large, general datasets as a starting point, then adapts them to the specific soccer domain with limited labeled data.

  10. (b) -- The ball is small (often fewer than 20 pixels in broadcast footage), moves fast (causing motion blur), and is frequently occluded by players' bodies.

  11. (b) -- Focal loss applies a modulating factor $(1-p_t)^\gamma$ that reduces the contribution of easy negatives (background regions), focusing training on hard examples.

  12. (b) -- IoU = area of intersection / area of union. It measures the overlap quality between predicted and ground truth bounding boxes.

  13. (c) -- The Hungarian algorithm solves the assignment problem, finding the minimum-cost matching between detections in the current frame and existing tracks.

  14. (b) -- The Kalman filter maintains a full state vector (position, size, velocities) along with a covariance matrix representing the uncertainty in each component.

  15. (c) -- Occlusion causes detections to be missed, tracks to be lost, and identities to be confused, especially during set pieces and dense defensive situations.

  16. (b) -- The torso region contains the jersey with the team's distinctive colors. Including legs (often white/black for all teams), boots, and background pitch adds noisy, non-discriminative color information.

  17. (c) -- The bottom center of the bounding box (foot point) approximates where the player contacts the pitch surface, which is the relevant coordinate for spatial analysis.

  18. (b) -- The COCO keypoint format defines 17 keypoints: nose, eyes (2), ears (2), shoulders (2), elbows (2), wrists (2), hips (2), knees (2), ankles (2).

  19. (b) -- Top-down approaches first run a person detector to get bounding boxes, then apply a pose estimation model within each cropped box.

  20. (b) -- Pose estimation requires sufficient pixel resolution (approximately 100+ pixels in height) for reliable keypoint detection, which close-up or multi-camera views provide.

  21. (c) -- Fouls require subjective judgment about intent, severity, and context that is difficult for automated systems to replicate. Typical F1 scores are 0.60--0.75.

  22. (b) -- Hybrid systems use rules for unambiguous events (ball crossing goal line = goal) and ML for events requiring interpretation (pass type, pressing triggers).

  23. (b) -- The "long tail" refers to the many rare event types that have very few training examples, creating severe class imbalance that standard ML approaches struggle with.

  24. (b) -- 3D scene reconstruction techniques (NeRFs, Gaussian splatting) can synthesize novel views from captured multi-view data, enabling virtual camera placement at arbitrary positions.

  25. (b) -- The dual concerns of competitive fairness (technology access gap between rich and poor clubs) and privacy (tracking and identifying individuals) represent the most significant ethical issues.