Self-Assessment Quiz: Data Cleaning and Preparation
Test your understanding of data cleaning concepts and techniques for football analytics.
Section 1: Missing Data (Questions 1-7)
Question 1
A dataset of game weather conditions has temperature and wind data missing for all dome games. This pattern of missing data is best described as:
A) Missing Completely at Random (MCAR) B) Missing at Random (MAR) C) Missing Not at Random (MNAR) D) Data entry error
Question 2
You have play-by-play data where 15% of the yards_gained values are missing. The missing values are scattered randomly across all games and plays. The most appropriate strategy is:
A) Delete all rows with missing yards_gained B) Fill with the mean yards_gained value C) Fill with the median yards_gained value D) Fill with zero
Question 3
When calculating season-long completion percentage from game-by-game data where some games have missing attempt counts, you should:
A) Calculate the percentage for each game and average them B) Exclude games with missing data and sum completions/attempts for remaining games C) Fill missing attempts with the season average, then calculate D) Use median imputation for missing attempts
Question 4
A player statistics dataset has rushing yards missing for quarterbacks. What type of missingness is this?
A) MCAR - The values are randomly missing B) MAR - Missing is related to the position column C) MNAR - The missing values would all be near zero D) This is not missing data, it's a valid null
Question 5
You're imputing missing values for a quarterback's game-by-game passing yards. Which approach is MOST appropriate?
A) Fill with 0 B) Fill with the QB's season average C) Fill with the league-wide average D) Fill with a random sample from the QB's other games
Question 6
Before imputing missing values, you should:
A) Immediately apply the imputation method B) Analyze why the data is missing C) Delete rows with any missing values D) Convert all values to the same data type
Question 7
What percentage of missing values in a column generally indicates you should consider dropping the column entirely?
A) More than 5% B) More than 20% C) More than 50% D) Any missing values warrant dropping the column
Section 2: Data Standardization (Questions 8-13)
Question 8
The following team names all refer to the same team: "Ohio State", "OSU", "ohio st.", "OHIO STATE". The process of converting these to a single standard format is called:
A) Data normalization B) Name standardization C) Data deduplication D) Schema matching
Question 9
When merging data from two sources, Source A uses "Texas A&M" and Source B uses "TAMU". The merge fails because:
A) The data types don't match B) The values are not exactly equal C) The columns have different names D) One source has more rows than the other
Question 10
A player name appears as both "Mike Smith" and "Michael Smith" in different records. The best approach is to:
A) Treat them as different players B) Create a mapping from nicknames to formal names C) Delete one of the records D) Change both to lowercase
Question 11
Which pandas method is most useful for standardizing case in a string column?
A) .lower() B) .title() C) .strip() D) .replace()
Question 12
You're building a team name standardization function. It should:
A) Only handle exact matches in your mapping B) Handle exact matches and return the original if not found C) Convert unmatched names to uppercase D) Raise an error for unmatched names
Question 13
When standardizing team names, what should you do with suffixes like "Crimson Tide" or "Buckeyes"?
A) Always keep them B) Always remove them C) Remove them before matching, add back if desired D) Convert them to abbreviations
Section 3: Data Types and Validation (Questions 14-19)
Question 14
A column containing game IDs like "001", "002", "003" is stored as strings. What's the best data type to convert it to?
A) int64 B) float64 C) Keep as string (object) D) category
Question 15
A down column should only contain values 1, 2, 3, or 4. Finding a value of 5 in this column indicates:
A) A valid fifth down penalty situation B) A data quality issue that needs correction C) An overtime indicator D) A missing value placeholder
Question 16
Which pandas dtype is most memory-efficient for a column with only 10 unique team names appearing 10,000 times?
A) object (string) B) category C) int64 D) bool
Question 17
To validate that all values in a quarter column are between 1 and 5:
A) Use df['quarter'].between(1, 5).all() B) Use df['quarter'].isin([1, 2, 3, 4, 5]).all() C) Use df['quarter'].min() >= 1 and df['quarter'].max() <= 5 D) All of the above are valid approaches
Question 18
A yards_gained column has the value "12" (string) instead of 12 (integer). The best way to convert it is:
A) df['yards_gained'].astype(int) B) pd.to_numeric(df['yards_gained'], errors='coerce') C) int(df['yards_gained']) D) df['yards_gained'].apply(int)
Question 19
You're converting a game_date column from strings like "09-01-2023" to datetime. Which function should you use?
A) df['game_date'].astype('datetime64') B) pd.to_datetime(df['game_date']) C) datetime.strptime(df['game_date'], '%m-%d-%Y') D) df['game_date'].str.to_datetime()
Section 4: Outliers and Data Quality (Questions 20-24)
Question 20
A rushing play shows 85 yards gained. This is:
A) Definitely a data error B) Definitely a legitimate big play C) Needs investigation - could be either D) Should be removed as an outlier
Question 21
Using the IQR method for outlier detection, a value is considered an outlier if:
A) It is more than 1 IQR from the median B) It is more than 1.5 × IQR below Q1 or above Q3 C) It is more than 2 standard deviations from the mean D) It is the maximum or minimum value
Question 22
In football analytics, which outlier handling approach is generally recommended?
A) Always remove all outliers automatically B) Flag outliers for review rather than automatic removal C) Replace all outliers with the mean D) Outliers should never be touched
Question 23
A play-by-play dataset shows a down value of 0. This should be:
A) Kept as is - it might be a kickoff B) Changed to 1 - it's probably a typo C) Flagged as invalid - down must be 1-4 D) Deleted from the dataset
Question 24
You detect an outlier in a player's single-game rushing yards (250 yards). Before deciding how to handle it:
A) Replace it with the player's average B) Delete the record C) Verify against the official box score D) Apply winsorization
Section 5: Merging and Integration (Questions 25-30)
Question 25
When joining play-by-play data (many rows per game) with game-level data (one row per game), the relationship is:
A) One-to-one B) One-to-many C) Many-to-one D) Many-to-many
Question 26
A left join of games (left) with weather data (right) will:
A) Keep only games that have weather data B) Keep all games, with null weather for missing matches C) Keep all weather records, with null games for extras D) Keep only the overlapping records
Question 27
After merging two datasets, you have more rows than expected. This is likely caused by:
A) Missing values in the join key B) Duplicate join keys in one or both tables C) Using the wrong join type D) Column name mismatches
Question 28
The validate parameter in pd.merge() helps you:
A) Check data types match B) Ensure the join relationship is as expected C) Remove duplicates automatically D) Handle missing values in join keys
Question 29
When merging data from different seasons that might have different column names:
A) Rename columns to match before merging B) Use outer join to keep all columns C) Use suffixes parameter to distinguish columns D) Merge on index instead of columns
Question 30
To add opponent conference information to a games dataset, you need:
A) One merge using the opponent team as the key B) Two merges - one for home conference, one for opponent conference C) A cross join between games and conferences D) No merge - conferences can be inferred from team names
Answer Key
| Question | Answer | Explanation |
|---|---|---|
| 1 | B | MAR - missingness is related to a known variable (venue type), not the value itself |
| 2 | C | Median is robust to outliers in yards gained distribution |
| 3 | B | Exclude incomplete data and calculate true ratio from available data |
| 4 | D | This is structural - QBs genuinely don't have rushing stats in many games |
| 5 | B | Player-specific average maintains individual baseline |
| 6 | B | Understanding missingness pattern determines appropriate handling |
| 7 | C | >50% missing usually means the column isn't useful |
| 8 | B | Name standardization converts variants to canonical form |
| 9 | B | String matching requires exact equality |
| 10 | B | Create a systematic nickname-to-formal mapping |
| 11 | B | .title() gives proper case for team/player names |
| 12 | B | Graceful handling of unknown values is important |
| 13 | C | Remove for matching flexibility, can add back for display |
| 14 | A | Game IDs are numeric identifiers, integer is appropriate |
| 15 | B | Down can only be 1-4, value 5 is invalid |
| 16 | B | Category type stores unique values once with references |
| 17 | D | All three approaches correctly validate the range |
| 18 | B | to_numeric with coerce handles errors gracefully |
| 19 | B | pd.to_datetime is the standard pandas datetime conversion |
| 20 | C | 85-yard runs are rare but possible; verify before acting |
| 21 | B | Standard IQR method uses 1.5 × IQR beyond quartiles |
| 22 | B | Flagging allows human review of edge cases |
| 23 | C | Down 0 is not valid in football; flag for correction |
| 24 | C | Verify against authoritative source before modifying |
| 25 | C | Many plays join to one game (many-to-one) |
| 26 | B | Left join keeps all left table rows |
| 27 | B | Duplicates in join keys cause row multiplication |
| 28 | B | Validate checks one-to-one, one-to-many, etc. |
| 29 | A | Standardize column names for clean merge |
| 30 | A | Single merge of opponents to conference lookup table |
Scoring Guide
- 27-30 correct: Excellent! Strong grasp of data cleaning principles.
- 22-26 correct: Good understanding. Review missed topics.
- 17-21 correct: Fair. More practice with hands-on cleaning needed.
- Below 17: Review chapter material thoroughly.
Topics to Review by Question
| Questions | Topic |
|---|---|
| 1-7 | Missing data types and handling strategies |
| 8-13 | Name standardization and matching |
| 14-19 | Data types and validation |
| 20-24 | Outlier detection and handling |
| 25-30 | Merging and data integration |
Key Concepts Summary
Missing Data
- MCAR: Random missingness, safe to drop
- MAR: Related to observed data, can impute
- MNAR: Related to missing value itself, needs careful handling
Standardization
- Use consistent naming conventions
- Build reusable mapping dictionaries
- Handle unknown values gracefully
Validation
- Define expected ranges for each column
- Check data types match expectations
- Verify referential integrity
Merging
- Understand relationship type before merging
- Use validate parameter to catch issues
- Check row counts before and after