Self-Assessment Quiz: Data Cleaning and Preparation

Test your understanding of data cleaning concepts and techniques for football analytics.

Section 1: Missing Data (Questions 1-7)

Question 1

A dataset of game weather conditions has temperature and wind data missing for all dome games. This pattern of missing data is best described as:

A) Missing Completely at Random (MCAR) B) Missing at Random (MAR) C) Missing Not at Random (MNAR) D) Data entry error

Question 2

You have play-by-play data where 15% of the yards_gained values are missing. The missing values are scattered randomly across all games and plays. The most appropriate strategy is:

A) Delete all rows with missing yards_gained B) Fill with the mean yards_gained value C) Fill with the median yards_gained value D) Fill with zero

Question 3

When calculating season-long completion percentage from game-by-game data where some games have missing attempt counts, you should:

A) Calculate the percentage for each game and average them B) Exclude games with missing data and sum completions/attempts for remaining games C) Fill missing attempts with the season average, then calculate D) Use median imputation for missing attempts

Question 4

A player statistics dataset has rushing yards missing for quarterbacks. What type of missingness is this?

A) MCAR - The values are randomly missing B) MAR - Missing is related to the position column C) MNAR - The missing values would all be near zero D) This is not missing data, it's a valid null

Question 5

You're imputing missing values for a quarterback's game-by-game passing yards. Which approach is MOST appropriate?

A) Fill with 0 B) Fill with the QB's season average C) Fill with the league-wide average D) Fill with a random sample from the QB's other games

Question 6

Before imputing missing values, you should:

A) Immediately apply the imputation method B) Analyze why the data is missing C) Delete rows with any missing values D) Convert all values to the same data type

Question 7

What percentage of missing values in a column generally indicates you should consider dropping the column entirely?

A) More than 5% B) More than 20% C) More than 50% D) Any missing values warrant dropping the column

Section 2: Data Standardization (Questions 8-13)

Question 8

The following team names all refer to the same team: "Ohio State", "OSU", "ohio st.", "OHIO STATE". The process of converting these to a single standard format is called:

A) Data normalization B) Name standardization C) Data deduplication D) Schema matching

Question 9

When merging data from two sources, Source A uses "Texas A&M" and Source B uses "TAMU". The merge fails because:

A) The data types don't match B) The values are not exactly equal C) The columns have different names D) One source has more rows than the other

Question 10

A player name appears as both "Mike Smith" and "Michael Smith" in different records. The best approach is to:

A) Treat them as different players B) Create a mapping from nicknames to formal names C) Delete one of the records D) Change both to lowercase

Question 11

Which pandas method is most useful for standardizing case in a string column?

A) .lower() B) .title() C) .strip() D) .replace()

Question 12

You're building a team name standardization function. It should:

A) Only handle exact matches in your mapping B) Handle exact matches and return the original if not found C) Convert unmatched names to uppercase D) Raise an error for unmatched names

Question 13

When standardizing team names, what should you do with suffixes like "Crimson Tide" or "Buckeyes"?

A) Always keep them B) Always remove them C) Remove them before matching, add back if desired D) Convert them to abbreviations

Section 3: Data Types and Validation (Questions 14-19)

Question 14

A column containing game IDs like "001", "002", "003" is stored as strings. What's the best data type to convert it to?

A) int64 B) float64 C) Keep as string (object) D) category

Question 15

A down column should only contain values 1, 2, 3, or 4. Finding a value of 5 in this column indicates:

A) A valid fifth down penalty situation B) A data quality issue that needs correction C) An overtime indicator D) A missing value placeholder

Question 16

Which pandas dtype is most memory-efficient for a column with only 10 unique team names appearing 10,000 times?

A) object (string) B) category C) int64 D) bool

Question 17

To validate that all values in a quarter column are between 1 and 5:

A) Use df['quarter'].between(1, 5).all() B) Use df['quarter'].isin([1, 2, 3, 4, 5]).all() C) Use df['quarter'].min() >= 1 and df['quarter'].max() <= 5 D) All of the above are valid approaches

Question 18

A yards_gained column has the value "12" (string) instead of 12 (integer). The best way to convert it is:

A) df['yards_gained'].astype(int) B) pd.to_numeric(df['yards_gained'], errors='coerce') C) int(df['yards_gained']) D) df['yards_gained'].apply(int)

Question 19

You're converting a game_date column from strings like "09-01-2023" to datetime. Which function should you use?

A) df['game_date'].astype('datetime64') B) pd.to_datetime(df['game_date']) C) datetime.strptime(df['game_date'], '%m-%d-%Y') D) df['game_date'].str.to_datetime()

Section 4: Outliers and Data Quality (Questions 20-24)

Question 20

A rushing play shows 85 yards gained. This is:

A) Definitely a data error B) Definitely a legitimate big play C) Needs investigation - could be either D) Should be removed as an outlier

Question 21

Using the IQR method for outlier detection, a value is considered an outlier if:

A) It is more than 1 IQR from the median B) It is more than 1.5 × IQR below Q1 or above Q3 C) It is more than 2 standard deviations from the mean D) It is the maximum or minimum value

Question 22

In football analytics, which outlier handling approach is generally recommended?

A) Always remove all outliers automatically B) Flag outliers for review rather than automatic removal C) Replace all outliers with the mean D) Outliers should never be touched

Question 23

A play-by-play dataset shows a down value of 0. This should be:

A) Kept as is - it might be a kickoff B) Changed to 1 - it's probably a typo C) Flagged as invalid - down must be 1-4 D) Deleted from the dataset

Question 24

You detect an outlier in a player's single-game rushing yards (250 yards). Before deciding how to handle it:

A) Replace it with the player's average B) Delete the record C) Verify against the official box score D) Apply winsorization

Section 5: Merging and Integration (Questions 25-30)

Question 25

When joining play-by-play data (many rows per game) with game-level data (one row per game), the relationship is:

A) One-to-one B) One-to-many C) Many-to-one D) Many-to-many

Question 26

A left join of games (left) with weather data (right) will:

A) Keep only games that have weather data B) Keep all games, with null weather for missing matches C) Keep all weather records, with null games for extras D) Keep only the overlapping records

Question 27

After merging two datasets, you have more rows than expected. This is likely caused by:

A) Missing values in the join key B) Duplicate join keys in one or both tables C) Using the wrong join type D) Column name mismatches

Question 28

The validate parameter in pd.merge() helps you:

A) Check data types match B) Ensure the join relationship is as expected C) Remove duplicates automatically D) Handle missing values in join keys

Question 29

When merging data from different seasons that might have different column names:

A) Rename columns to match before merging B) Use outer join to keep all columns C) Use suffixes parameter to distinguish columns D) Merge on index instead of columns

Question 30

To add opponent conference information to a games dataset, you need:

A) One merge using the opponent team as the key B) Two merges - one for home conference, one for opponent conference C) A cross join between games and conferences D) No merge - conferences can be inferred from team names

Answer Key

Question	Answer	Explanation
1	B	MAR - missingness is related to a known variable (venue type), not the value itself
2	C	Median is robust to outliers in yards gained distribution
3	B	Exclude incomplete data and calculate true ratio from available data
4	D	This is structural - QBs genuinely don't have rushing stats in many games
5	B	Player-specific average maintains individual baseline
6	B	Understanding missingness pattern determines appropriate handling
7	C	>50% missing usually means the column isn't useful
8	B	Name standardization converts variants to canonical form
9	B	String matching requires exact equality
10	B	Create a systematic nickname-to-formal mapping
11	B	.title() gives proper case for team/player names
12	B	Graceful handling of unknown values is important
13	C	Remove for matching flexibility, can add back for display
14	A	Game IDs are numeric identifiers, integer is appropriate
15	B	Down can only be 1-4, value 5 is invalid
16	B	Category type stores unique values once with references
17	D	All three approaches correctly validate the range
18	B	to_numeric with coerce handles errors gracefully
19	B	pd.to_datetime is the standard pandas datetime conversion
20	C	85-yard runs are rare but possible; verify before acting
21	B	Standard IQR method uses 1.5 × IQR beyond quartiles
22	B	Flagging allows human review of edge cases
23	C	Down 0 is not valid in football; flag for correction
24	C	Verify against authoritative source before modifying
25	C	Many plays join to one game (many-to-one)
26	B	Left join keeps all left table rows
27	B	Duplicates in join keys cause row multiplication
28	B	Validate checks one-to-one, one-to-many, etc.
29	A	Standardize column names for clean merge
30	A	Single merge of opponents to conference lookup table

Scoring Guide

27-30 correct: Excellent! Strong grasp of data cleaning principles.
22-26 correct: Good understanding. Review missed topics.
17-21 correct: Fair. More practice with hands-on cleaning needed.
Below 17: Review chapter material thoroughly.

Topics to Review by Question

Questions	Topic
1-7	Missing data types and handling strategies
8-13	Name standardization and matching
14-19	Data types and validation
20-24	Outlier detection and handling
25-30	Merging and data integration

Key Concepts Summary

Missing Data

MCAR: Random missingness, safe to drop
MAR: Related to observed data, can impute
MNAR: Related to missing value itself, needs careful handling

Standardization

Use consistent naming conventions
Build reusable mapping dictionaries
Handle unknown values gracefully

Validation

Define expected ranges for each column
Check data types match expectations
Verify referential integrity

Merging

Understand relationship type before merging
Use validate parameter to catch issues
Check row counts before and after