Chapter 4 Quiz: Exploratory Data Analysis for Basketball

Instructions

This quiz tests your understanding of exploratory data analysis concepts and techniques covered in Chapter 4. Answer all questions to the best of your ability. Each question is worth 1 point unless otherwise noted.

Time Limit: 45 minutes Total Points: 30


Section A: Data Loading and Inspection (Questions 1-6)

Question 1

What pandas method provides a summary including column names, non-null counts, and data types?

A) df.describe() B) df.info() C) df.head() D) df.summary()


Question 2

Which pandas method is most appropriate for viewing the distribution of categorical values like player positions?

A) df['position'].describe() B) df['position'].value_counts() C) df['position'].unique() D) df['position'].hist()


Question 3

In a basketball dataset, if a column that should contain numeric values appears as 'object' dtype, what is the most likely cause?

A) The data is too large B) The column contains non-numeric values like 'N/A' or text C) pandas version is outdated D) The file was compressed


Question 4

What is True Shooting Percentage (TS%) and what does it measure?

A) Field goals made divided by field goals attempted B) A measure of shooting efficiency accounting for 2-pointers, 3-pointers, and free throws C) Three-point percentage adjusted for volume D) Points per minute played


Question 5

The formula for True Shooting Percentage is:

A) PTS / FGA B) PTS / (2 * (FGA + FTA)) C) PTS / (2 * (FGA + 0.44 * FTA)) D) (FGM + 1.5 * 3PM) / FGA


Question 6

When creating derived features like points per minute during data inspection, which approach is most appropriate?

A) Modify the original data file B) Create new columns in the working DataFrame C) Store calculated values in separate variables D) Calculate on-the-fly in every analysis


Section B: Data Cleaning (Questions 7-12)

Question 7

In basketball data, which of the following is NOT a valid reason for a player having a missing three-point percentage?

A) The player hasn't attempted any three-pointers B) Data collection error C) The player only plays in the post D) The player's shooting percentage is too low to display


Question 8

Which statistical validation check makes sense for basketball data?

A) Points per game cannot exceed 100 B) Field goal percentage must be between 0 and 1 C) Assists must always equal turnovers D) Minutes played must be exactly 48 for starters


Question 9

When standardizing team abbreviations in a dataset, what is the best practice?

A) Use the most common abbreviation found in the data B) Create a mapping dictionary and apply it consistently C) Delete rows with non-standard abbreviations D) Convert all abbreviations to team full names


Question 10

A player appears multiple times in a season statistics dataset. What is the most likely legitimate explanation?

A) Data duplication error B) The player was traded mid-season C) The player played in multiple positions D) Database synchronization issue


Question 11

What does "Missing Completely at Random" (MCAR) mean in the context of basketball data?

A) Missing values follow a random pattern related to the statistic B) Missing values have no relationship to any variables in the dataset C) Missing values occur randomly across different seasons D) Missing values affect a random sample of players


Question 12

For a player with zero three-point attempts, the appropriate handling of their 3PT% is:

A) Set to 0% B) Set to league average C) Leave as NaN/missing D) Impute with position average


Section C: Visualizing Distributions (Questions 13-18)

Question 13

What does a right-skewed distribution of NBA scoring averages indicate?

A) Most players score above average B) Most players score below average, with a few high scorers pulling the mean up C) Scoring is evenly distributed D) The data has collection errors


Question 14 (2 points)

Match each visualization type with its best use case:

Visualization Use Case
1. Histogram A. Comparing distributions across groups
2. Box plot B. Showing distribution shape with overlapping groups
3. Violin plot C. Showing the frequency distribution of a single variable
4. KDE plot D. Identifying outliers and quartiles

Question 15

In a box plot, what do the "whiskers" typically represent?

A) The standard deviation B) 1.5 times the interquartile range from Q1 and Q3 C) The minimum and maximum values D) The 10th and 90th percentiles


Question 16

What does it mean when the mean is significantly higher than the median in a points per game distribution?

A) The distribution is left-skewed B) The distribution is right-skewed C) The distribution is bimodal D) There are data errors


Question 17

Which plot is most effective for comparing scoring distributions across all 5 positions simultaneously?

A) Multiple histograms B) A single scatter plot C) Grouped box plots or violin plots D) A pie chart


Question 18

The Kernel Density Estimation (KDE) plot is preferred over a histogram when:

A) You want to see exact bin counts B) You need to overlay multiple distributions for comparison C) You have categorical data D) You want to identify outliers


Section D: Visualizing Relationships (Questions 19-23)

Question 19

A correlation coefficient of r = 0.85 between minutes played and points scored indicates:

A) No relationship B) A strong positive relationship C) A strong negative relationship D) A perfect relationship


Question 20

When creating a correlation heatmap, why is it useful to mask the upper triangle?

A) To hide negative correlations B) The upper triangle contains the same information as the lower triangle C) To improve rendering performance D) To emphasize the diagonal


Question 21

What is the coefficient of determination (R-squared) and what does it measure?

A) The correlation coefficient squared, measuring the proportion of variance explained B) The standard deviation of residuals C) The statistical significance of a relationship D) The slope of the regression line


Question 22

In a pair plot (scatter plot matrix), what information does the diagonal typically show?

A) The correlation coefficients B) Distribution plots (histograms or KDE) for each variable C) The identity line (y=x) D) Summary statistics


Question 23

A bubble chart showing Points vs True Shooting % with bubble size representing minutes played allows you to see:

A) One dimension of data B) Two dimensions of data C) Three dimensions of data D) Four dimensions of data


Question 24

What is the primary benefit of using rolling averages when analyzing player performance over a season?

A) It increases the total statistics B) It smooths out game-to-game variance to reveal underlying trends C) It corrects for missing games D) It makes all players comparable


Question 25

When comparing cumulative points for multiple players across a season, what does a steeper slope indicate?

A) Lower scoring rate B) Higher scoring rate C) More consistent scoring D) More games played


Question 26

The coefficient of variation (CV) is calculated as standard deviation divided by mean. A higher CV for a player's scoring indicates:

A) More consistent performance B) Higher average scoring C) More variable/inconsistent performance D) Better efficiency


Section F: Shot Chart Analysis (Questions 27-30)

Question 27

In NBA shot chart coordinates, what do LOC_X and LOC_Y represent?

A) GPS coordinates of the arena B) Position in tenths of feet from the basket C) Pixel positions on a standard court image D) Distance and angle from the basket


Question 28

Which visualization technique is best for showing shooting frequency across different court zones?

A) Basic scatter plot of makes/misses B) Hexbin plot with shot counts C) Line chart D) Pie chart


Question 29

What is the approximate valid range for LOC_X coordinates in NBA shot data (in tenths of feet)?

A) -50 to 50 B) -100 to 100 C) -250 to 250 D) -500 to 500


Question 30 (2 points)

When creating a shot chart showing shooting efficiency by zone, why is it important to set a minimum shot count threshold (e.g., mincnt=3)?

Explain in 1-2 sentences.


Bonus Questions (2 points each)

Bonus Question 1

You're analyzing a player's shooting data and notice their three-point percentage is much higher in the first half of the season than the second half. Describe at least three hypotheses you would investigate and what visualizations you would create to explore them.


Bonus Question 2

Explain the difference between conducting EDA on a single player's season data versus league-wide data. What different questions would you ask, and what visualizations would be more appropriate for each?


Answer Key

Section A: Data Loading and Inspection

  1. B) df.info() - info() shows column names, data types, non-null counts, and memory usage. describe() shows statistical summaries for numeric columns.

  2. B) df['position'].value_counts() - value_counts() returns a Series with counts of unique values, ideal for categorical data.

  3. B) The column contains non-numeric values like 'N/A' or text - pandas assigns 'object' dtype when it can't interpret all values as a single numeric type.

  4. B) A measure of shooting efficiency accounting for 2-pointers, 3-pointers, and free throws - TS% is a comprehensive efficiency metric that weighs all scoring attempts.

  5. C) PTS / (2 * (FGA + 0.44 * FTA)) - The 0.44 factor estimates the proportion of free throw attempts that end possessions.

  6. B) Create new columns in the working DataFrame - This maintains the original data, allows reproducibility, and keeps derived features accessible for analysis.

Section B: Data Cleaning

  1. D) The player's shooting percentage is too low to display - Low percentage is still a valid percentage. The other options are legitimate reasons for missing 3PT%.

  2. B) Field goal percentage must be between 0 and 1 - This is a logical constraint. The other options are not universally true (players can score 100+, starters don't always play 48 minutes, etc.).

  3. B) Create a mapping dictionary and apply it consistently - This ensures reproducibility and handles all cases systematically.

  4. B) The player was traded mid-season

    • Traded players often have separate rows for each team they played on during the season.
  5. B) Missing values have no relationship to any variables in the dataset

    • MCAR means the missingness is purely random, not related to observed or unobserved values.
  6. C) Leave as NaN/missing

    • A percentage based on zero attempts is mathematically undefined (0/0), not zero.

Section C: Visualizing Distributions

  1. B) Most players score below average, with a few high scorers pulling the mean up

    • Right-skewed distributions have a long tail on the right side.
  2. Answers: 1-C, 2-D, 3-A or B, 4-B (2 points)

    • Histogram: Single variable frequency
    • Box plot: Outliers and quartiles, group comparison
    • Violin plot: Distribution shape comparison across groups
    • KDE: Smooth overlapping distributions
  3. B) 1.5 times the interquartile range from Q1 and Q3

    • This is the standard definition. Points beyond whiskers are plotted as outliers.
  4. B) The distribution is right-skewed

    • When mean > median, the distribution has a longer right tail.
  5. C) Grouped box plots or violin plots

    • These allow direct comparison of distributions across multiple categories in a single plot.
  6. B) You need to overlay multiple distributions for comparison

    • KDE provides smooth curves that overlay well without the visual clutter of overlapping histograms.

Section D: Visualizing Relationships

  1. B) A strong positive relationship

    • r = 0.85 indicates strong positive correlation (typically r > 0.7 is considered strong).
  2. B) The upper triangle contains the same information as the lower triangle

    • Correlation matrices are symmetric, so the upper triangle is redundant.
  3. A) The correlation coefficient squared, measuring the proportion of variance explained

    • R-squared indicates what percentage of the variance in Y is explained by X.
  4. B) Distribution plots (histograms or KDE) for each variable

    • The diagonal shows univariate distributions, while off-diagonal elements show bivariate relationships.
  5. C) Three dimensions of data

    • X-axis (TS%), Y-axis (Points), and bubble size (Minutes) represent three dimensions.
  1. B) It smooths out game-to-game variance to reveal underlying trends

    • Rolling averages reduce noise to show sustained performance changes.
  2. B) Higher scoring rate

    • A steeper slope in cumulative points means more points accumulated per game.
  3. C) More variable/inconsistent performance

    • Higher CV means higher relative variability compared to the mean.

Section F: Shot Chart Analysis

  1. B) Position in tenths of feet from the basket

    • LOC_X and LOC_Y are measured in tenths of feet (decifeet) from the basket location.
  2. B) Hexbin plot with shot counts

    • Hexbin aggregates shots into bins, showing frequency through color intensity.
  3. C) -250 to 250

    • This represents -25 to 25 feet from center court, covering the half-court width.
  4. (2 points) Small sample sizes produce unreliable percentages. A zone with 1 shot made of 1 attempt shows 100% but is not meaningful. Minimum thresholds ensure displayed percentages are based on sufficient data to be reliable.

Bonus Questions

Bonus 1: (2 points) Hypotheses and visualizations: 1. Fatigue/Injury: Plot shooting percentage by month or by cumulative games played; examine minutes trend 2. Defender adjustment: Compare shot distribution early vs late season; opposing teams may have adapted 3. Teammate changes: Check if lineup changes or trades affected shot selection 4. Shot selection changes: Compare shot zone distributions between halves using side-by-side shot charts 5. Random variance: Calculate confidence intervals; check if difference is statistically significant

Bonus 2: (2 points) Single player analysis: - Questions: Performance trends, consistency, shot selection, comparison to career averages - Visualizations: Time series of stats, shot charts, game-by-game distributions

League-wide analysis: - Questions: Position differences, team comparisons, league trends, outlier identification - Visualizations: Grouped comparisons, histograms of stat distributions, correlation matrices, scatter plots with player annotations

The scale and context differ significantly, affecting both the questions asked and the appropriate statistical approaches.


Scoring Guide

Score Grade Feedback
28-34 A Excellent understanding of EDA techniques
24-27 B Good grasp of core concepts; review visualization selection
20-23 C Adequate understanding; practice with more datasets
16-19 D Review chapter material and complete exercises
Below 16 F Seek additional help; re-read chapter before proceeding

Post-Quiz Reflection

After completing this quiz, consider:

  1. Which types of visualizations do you feel most/least comfortable creating?
  2. What data cleaning challenges have you encountered in your own work?
  3. How would you apply these EDA techniques to a new basketball dataset?
  4. What additional visualizations might be useful for basketball analysis?

Take time to revisit sections where you scored below 80% before moving to the next chapter.