Quiz: Exploratory Data Analysis for Football

Target: 70% or higher to proceed.


Section 1: Multiple Choice (1 point each)

1. What is the primary purpose of Exploratory Data Analysis?

  • A) To prove a pre-determined hypothesis
  • B) To discover patterns and generate hypotheses before formal modeling
  • C) To create publication-ready visualizations
  • D) To clean data for machine learning
Answer **B)** To discover patterns and generate hypotheses before formal modeling *Explanation:* EDA is about understanding data through visualization and summary statistics before applying formal statistical tests or models.

2. Which visualization is most appropriate for showing the distribution of EPA values?

  • A) Pie chart
  • B) Bar chart
  • C) Histogram or density plot
  • D) Line chart
Answer **C)** Histogram or density plot *Explanation:* Continuous variables like EPA are best visualized with histograms or density plots that show the shape of the distribution.

3. When analyzing missing data in NFL play-by-play, why would receiver_player_name have more missing values than play_id?

  • A) Data collection errors
  • B) Only passes to receivers populate this field, not runs or incomplete passes
  • C) Receiver names are harder to record
  • D) The column is deprecated
Answer **B)** Only passes to receivers populate this field, not runs or incomplete passes *Explanation:* Many columns in PBP data are only populated for specific play types. Receiver info is only relevant for completed passes.

4. What does a positive skewness value indicate about a distribution?

  • A) The distribution is symmetric
  • B) The tail extends more to the left
  • C) The tail extends more to the right
  • D) The distribution has multiple peaks
Answer **C)** The tail extends more to the right *Explanation:* Positive skewness means the right tail (higher values) is longer, indicating more extreme high values than low values.

5. Which type of plot is best for showing the relationship between two continuous variables?

  • A) Bar chart
  • B) Scatter plot
  • C) Pie chart
  • D) Box plot
Answer **B)** Scatter plot *Explanation:* Scatter plots show individual data points for two continuous variables, revealing patterns, trends, and outliers.

6. What is the main advantage of using hexbin plots over scatter plots for large datasets?

  • A) They are more colorful
  • B) They show density when points would otherwise overlap
  • C) They calculate correlations automatically
  • D) They use less computer memory
Answer **B)** They show density when points would otherwise overlap *Explanation:* With thousands of points, scatter plots become unreadable. Hexbin plots aggregate points into hexagonal bins with color indicating density.

7. When binning a continuous variable like yards to go, what's the best practice?

  • A) Always use 10 equal-width bins
  • B) Use meaningful domain-specific boundaries (e.g., short, medium, long)
  • C) Use as many bins as possible for precision
  • D) Use only 2 bins (high/low)
Answer **B)** Use meaningful domain-specific boundaries (e.g., short, medium, long) *Explanation:* Domain knowledge should guide binning. For yards to go, categories like "short (1-3)", "medium (4-7)", "long (8+)" are more meaningful than arbitrary cuts.

8. What does a correlation coefficient of -0.7 indicate?

  • A) Strong positive relationship
  • B) Strong negative relationship
  • C) Weak relationship
  • D) No relationship
Answer **B)** Strong negative relationship *Explanation:* Correlation ranges from -1 to 1. Values near -1 indicate strong negative (inverse) relationships: as one variable increases, the other decreases.

9. Which is NOT a typical stage in the EDA process?

  • A) Data inspection
  • B) Univariate analysis
  • C) Hypothesis testing
  • D) Bivariate analysis
Answer **C)** Hypothesis testing *Explanation:* EDA focuses on exploration and hypothesis generation, not formal hypothesis testing, which comes later in confirmatory analysis.

10. When creating a team comparison visualization, what should always be included?

  • A) Animation effects
  • B) 3D graphics
  • C) Clear labels and a title stating the insight
  • D) As many colors as possible
Answer **C)** Clear labels and a title stating the insight *Explanation:* Good visualizations communicate clearly. Labels, titles, and meaningful annotations are essential; decorative elements are not.

Section 2: True/False (1 point each)

11. A box plot's whiskers always extend to the minimum and maximum values in the data.

Answer **False** *Explanation:* Whiskers typically extend to 1.5× IQR from the quartiles. Values beyond are shown as individual outlier points.

12. EDA should always be completed before any data cleaning.

Answer **False** *Explanation:* EDA and data cleaning are iterative. Initial EDA reveals data quality issues, which inform cleaning, which then enables further EDA.

13. Correlation does not imply causation.

Answer **True** *Explanation:* Two variables may be correlated due to a third variable, reverse causation, or coincidence. Correlation alone cannot prove causal relationships.

14. When comparing distributions, overlapping histograms are always better than side-by-side box plots.

Answer **False** *Explanation:* Both have uses. Overlapping histograms show distribution shapes but can be hard to read. Box plots compactly compare medians and spread.

15. Principal Component Analysis (PCA) is a technique for reducing the number of variables while preserving information.

Answer **True** *Explanation:* PCA creates new variables (principal components) that capture the most variance from the original variables, enabling dimensionality reduction.

Section 3: Code Analysis (2 points each)

16. What does this code produce?

pbp['score_bucket'] = pd.cut(
    pbp['score_differential'],
    bins=[-50, -14, 0, 14, 50],
    labels=['Trailing big', 'Trailing', 'Winning', 'Winning big']
)
Answer Creates a new categorical column that bins the score differential into four categories: - Trailing big: score_differential from -50 to -14 - Trailing: score_differential from -14 to 0 - Winning: score_differential from 0 to 14 - Winning big: score_differential from 14 to 50 Note: The bins are exclusive on the left and inclusive on the right by default.

17. What's wrong with this visualization code?

plt.figure(figsize=(10, 6))
plt.scatter(team_stats['epa'], team_stats['success_rate'])
plt.show()
Answer **Missing essential elements:** - No axis labels (`plt.xlabel()`, `plt.ylabel()`) - No title (`plt.title()`) - No indication of what each point represents A better version:
plt.figure(figsize=(10, 6))
plt.scatter(team_stats['epa'], team_stats['success_rate'])
plt.xlabel('EPA per Play')
plt.ylabel('Success Rate')
plt.title('Team Offensive Efficiency: EPA vs Success Rate')
for team in team_stats.index:
    plt.annotate(team, (team_stats.loc[team, 'epa'],
                        team_stats.loc[team, 'success_rate']))
plt.show()

18. What will this code return?

pbp.groupby('down')['epa'].agg(['mean', 'std', 'count'])
Answer A DataFrame with: - **Index**: down values (1, 2, 3, 4) - **Columns**: `mean`, `std`, `count` - **Content**: For each down, the mean EPA, standard deviation of EPA, and number of plays This provides a summary of EPA distribution by down.

19. Identify the issue with this missing data check:

missing_pct = df.isnull().sum() / len(df)
print(missing_pct)
Answer **Not necessarily wrong, but incomplete:** 1. The output will be hard to read without sorting: ```python missing_pct = (df.isnull().sum() / len(df)).sort_values(ascending=False) ``` 2. Should filter to show only columns with missing data: ```python missing_pct = missing_pct[missing_pct > 0] ``` 3. Format as percentage for readability: ```python print((missing_pct * 100).round(2)) ```

Section 4: Short Answer (2 points each)

20. Explain why pass plays typically have higher variance in EPA than run plays.

Sample Answer Pass plays have higher variance because: 1. **Larger outcome range**: Passes can result in long completions (high positive EPA) or interceptions (very negative EPA), while runs have a narrower outcome distribution 2. **Binary nature of completion**: Incomplete passes are essentially zero-yard plays, while runs usually gain at least a few yards 3. **Big play potential**: Deep passes can gain 40+ yards in a single play, which is rare for runs 4. **Turnover risk**: Interceptions create large negative EPA swings that runs don't face (fumbles are less common) This higher variance is why "pass to win, run to not lose" has become a common saying in analytics.

21. What are three things you should check when first inspecting a new football dataset?

Sample Answer When inspecting a new dataset, check: 1. **Data shape and coverage**: How many rows/columns? What time period does it cover? Are all teams/games represented? 2. **Missing data patterns**: Which columns have missing values? Are they missing at random or systematically (e.g., only for certain play types)? 3. **Data types and ranges**: Are numeric columns actually numeric? Are categorical columns properly categorized? Are values within expected ranges (e.g., downs should be 1-4)? Additional checks: Duplicate rows, column naming conventions, key column definitions

22. Why is it important to visualize data before calculating summary statistics?

Sample Answer Visualizing before summarizing is important because: 1. **Anscombe's Quartet**: Very different data patterns can produce identical summary statistics (same mean, variance, correlation). Only visualization reveals the true pattern. 2. **Outlier detection**: Summary statistics are sensitive to outliers. A single extreme value can dramatically shift the mean. Visualization reveals if outliers exist. 3. **Distribution shape**: Mean and standard deviation assume certain distribution shapes. A histogram reveals if the data is normal, skewed, bimodal, or has other characteristics that affect interpretation. 4. **Relationship patterns**: Correlation measures linear relationships. A scatter plot might reveal non-linear patterns that a correlation coefficient would miss.

Section 5: Application (3 points each)

23. You're asked to compare passing efficiency between two quarterbacks. Describe your EDA approach in 3-4 steps.

Sample Answer **Step 1: Filter and prepare data** - Filter to pass plays for each QB - Verify sample sizes are adequate (100+ dropbacks each) - Check for missing EPA values **Step 2: Univariate analysis for each QB** - Calculate summary statistics (mean, median, std of EPA) - Create histograms of EPA distribution for each - Note any differences in distribution shape **Step 3: Direct comparison** - Create overlapping density plots or side-by-side box plots - Calculate and compare key metrics: EPA/dropback, CPOE, success rate - Test for statistical significance if sample sizes allow **Step 4: Contextual analysis** - Compare in different situations (down, field position, score) - Look at supporting cast (receiver quality, O-line pressure) - Visualize with a multi-panel comparison dashboard

24. A team's offensive coordinator asks: "Should we pass more in the red zone?" How would you use EDA to inform this question?

Sample Answer **Analysis approach:** 1. **Current state**: Calculate the team's current red zone pass/run ratio and compare to league average 2. **Efficiency comparison**: - EPA per play for passes vs runs in the red zone - Success rate for each play type - Touchdown rate by play type 3. **Down-specific analysis**: - Break down by down (1st vs 2nd vs 3rd/4th) - Where does each play type excel? 4. **Personnel considerations**: - QB red zone EPA vs league average - RB goal-line efficiency vs league average - Receiver target distribution in red zone 5. **Visualization**: Create a 4-panel dashboard showing: - Current play mix vs league - EPA by play type - Success rate by down and play type - Trend over the season **Deliverable**: Summary report with specific recommendations based on where the data shows opportunity (e.g., "Pass rate on 1st-and-goal is 45% vs league average 52%, and your pass EPA in that situation is 0.3 vs run EPA of -0.1")

Section 6: Matching (1 point each)

Match the visualization type with its best use case:

Visualization Use Case
25a. Box plot A. Showing distribution of a single continuous variable
25b. Histogram B. Comparing distributions across groups
25c. Scatter plot C. Showing categorical variable frequencies
25d. Bar chart D. Showing relationship between two continuous variables
Answers **25a. B** - Box plot: Comparing distributions across groups (compact summary of median, quartiles, outliers) **25b. A** - Histogram: Showing distribution of a single continuous variable (reveals shape, center, spread) **25c. D** - Scatter plot: Showing relationship between two continuous variables (reveals correlation, patterns) **25d. C** - Bar chart: Showing categorical variable frequencies (easy to compare counts/percentages across categories)

Scoring

Section Points Your Score
Multiple Choice (1-10) 10 ___
True/False (11-15) 5 ___
Code Analysis (16-19) 8 ___
Short Answer (20-22) 6 ___
Application (23-24) 6 ___
Matching (25) 4 ___
Total 39 ___

Passing Score: 27/39 (70%)