Quiz: Python Tools for Soccer Analytics
Test your Python skills before moving to the next chapter. Target: 70% or higher to proceed. Time: ~35 minutes
Section 1: Multiple Choice (1 point each)
1. Which pandas method is used to filter rows based on a condition?
- A)
df.filter(condition) - B)
df[condition](boolean indexing) - C)
df.select(condition) - D)
df.where(condition)
Answer
**B)** `df[condition]` (boolean indexing) *Explanation:* Boolean indexing with `df[condition]` is the standard pandas way to filter rows. `df.query()` is also valid for string-based conditions. Reference Section 4.3.3.2. What is the output shape of df.groupby('team')['goals'].sum()?
- A) A DataFrame with teams as index
- B) A Series with teams as index
- C) A DataFrame with one column
- D) A single number
Answer
**B)** A Series with teams as index *Explanation:* When groupby is applied to a single column with a single aggregation, the result is a Series. Teams become the index, sums become the values. Reference Section 4.3.5.3. Which merge type keeps all rows from the left DataFrame and matching rows from the right?
- A) Inner merge
- B) Outer merge
- C) Left merge
- D) Cross merge
Answer
**C)** Left merge *Explanation:* A left merge keeps all rows from the left DataFrame and adds matching data from the right DataFrame (with NaN for non-matches). Reference Section 4.3.6.4. What NumPy function generates random integers from a Poisson distribution?
- A)
np.random.randint() - B)
np.random.poisson() - C)
np.random.normal() - D)
np.random.exponential()
Answer
**B)** `np.random.poisson()` *Explanation:* `np.random.poisson(lambda, size)` generates Poisson-distributed random numbers, ideal for simulating goals. Reference Section 4.4.5.5. Which matplotlib function creates a new figure with multiple subplots?
- A)
plt.subplot() - B)
plt.figure() - C)
plt.subplots() - D)
plt.axes()
Answer
**C)** `plt.subplots()` *Explanation:* `plt.subplots(nrows, ncols)` creates a figure and array of axes objects for multiple subplots. Reference Section 4.5.1.6. What is the correct way to add multiple conditions in pandas filtering?
- A)
df[condition1 and condition2] - B)
df[condition1 & condition2] - C)
df[condition1 && condition2] - D)
df[(condition1) + (condition2)]
Answer
**B)** `df[condition1 & condition2]` *Explanation:* Use `&` for AND and `|` for OR in pandas boolean indexing. The Python keywords `and`/`or` don't work with Series. Each condition should be in parentheses. Reference Section 4.3.3.7. Which seaborn function creates a box plot?
- A)
sns.boxchart() - B)
sns.box() - C)
sns.boxplot() - D)
sns.whisker()
Answer
**C)** `sns.boxplot()` *Explanation:* `sns.boxplot(data=df, x='category', y='value')` creates box plots showing distribution statistics. Reference Section 4.5.4.8. What does df.memory_usage(deep=True) return?
- A) CPU usage for DataFrame operations
- B) Memory usage per column in bytes
- C) Processing time for DataFrame creation
- D) Number of bytes in each cell
Answer
**B)** Memory usage per column in bytes *Explanation:* `memory_usage(deep=True)` calculates actual memory consumption including object types (strings). Without `deep=True`, object columns show only pointer size. Reference Section 4.7.1.9. Which pandas method converts a wide DataFrame to long format?
- A)
df.pivot() - B)
df.melt() - C)
df.stack() - D)
df.reshape()
Answer
**B)** `df.melt()` *Explanation:* `df.melt()` unpivots a DataFrame from wide to long format. `pivot()` does the opposite. Reference Section 4.3.6.10. What is the primary advantage of NumPy vectorized operations over Python loops?
- A) They use less memory
- B) They are significantly faster
- C) They are easier to read
- D) They support more data types
Answer
**B)** They are significantly faster *Explanation:* NumPy operations are implemented in optimized C code and operate on entire arrays at once, typically 10-100x faster than equivalent Python loops. Reference Section 4.4.2.Section 2: True/False (1 point each)
11. In pandas, df.loc[] is used for integer-based indexing.
Answer
**False** *Explanation:* `df.loc[]` is for label-based indexing. `df.iloc[]` is for integer-based indexing. Reference Section 4.3.3.12. Virtual environments help ensure project dependencies don't conflict.
Answer
**True** *Explanation:* Virtual environments create isolated Python installations with their own packages, preventing version conflicts between projects. Reference Section 4.2.1.13. The groupby().agg() method can apply multiple aggregation functions to multiple columns.
Answer
**True** *Explanation:* `agg()` accepts a dictionary mapping columns to aggregations, e.g., `{'goals': ['sum', 'mean'], 'shots': 'count'}`. Reference Section 4.3.5.14. Changing a pandas column to 'category' dtype always reduces memory usage.
Answer
**False** *Explanation:* Category dtype reduces memory when there are few unique values relative to row count. For high-cardinality columns (many unique values), it may increase memory. Reference Section 4.7.1.15. plt.savefig() must be called before plt.show() to save the figure.
Answer
**True** *Explanation:* `plt.show()` clears the current figure. To save, call `savefig()` first or use `plt.gcf()` to get the figure after showing. Reference Section 4.5.1.Section 3: Fill in the Blank (1 point each)
16. To calculate the correlation between two arrays x and y in NumPy, use np.________(x, y).
Answer
**corrcoef** *Explanation:* `np.corrcoef(x, y)[0, 1]` returns the Pearson correlation coefficient between arrays x and y.17. To create a DataFrame from a list of dictionaries, use pd.________(list_of_dicts).
Answer
**DataFrame** *Explanation:* `pd.DataFrame([{'a': 1}, {'a': 2}])` creates a DataFrame where each dict becomes a row.18. In matplotlib, the __ method sets the x-axis label.
Answer
**set_xlabel** (or **xlabel** when using pyplot) *Explanation:* For axes objects: `ax.set_xlabel('Label')`. For pyplot: `plt.xlabel('Label')`.19. To read a CSV file with specific columns only, use the __ parameter in pd.read_csv().
Answer
**usecols** *Explanation:* `pd.read_csv('file.csv', usecols=['col1', 'col2'])` loads only specified columns, reducing memory and load time.20. The pandas method to remove duplicate rows is df.________().
Answer
**drop_duplicates** *Explanation:* `df.drop_duplicates()` removes duplicate rows. Parameters like `subset` and `keep` control behavior.Section 4: Code Output (2 points each)
21. What is the output?
import pandas as pd
df = pd.DataFrame({
'team': ['A', 'A', 'B', 'B'],
'goals': [2, 3, 1, 4]
})
result = df.groupby('team')['goals'].transform('sum')
print(result.tolist())
Answer
**[5, 5, 5, 5]** *Explanation:* `transform()` returns a Series with the same index as the original DataFrame. Team A's sum (5) is assigned to both A rows, Team B's sum (5) to both B rows.22. What is the output?
import numpy as np
goals = np.array([0, 1, 2, 3, 4])
mask = goals > 1
print(goals[mask])
Answer
**[2 3 4]** *Explanation:* Boolean indexing with `mask = [False, False, True, True, True]` selects only elements where the condition is True.23. What does this return?
df = pd.DataFrame({
'home_goals': [2, 1, 3],
'away_goals': [1, 1, 2]
})
df.query('home_goals > away_goals').shape[0]
Answer
**2** *Explanation:* Two rows have home_goals > away_goals (2>1 and 3>2). `.shape[0]` returns the row count.Section 5: Short Answer (2 points each)
24. Explain the difference between df.loc[] and df.iloc[] with a brief example.
Sample Answer
- `df.loc[]` uses **label-based** indexing (row/column names) - `df.iloc[]` uses **integer-based** indexing (positions) Example:df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])
df.loc['y', 'A'] # Returns 2 (using label 'y')
df.iloc[1, 0] # Returns 2 (using position 1)
*Key points for full credit:*
- Clear distinction between labels and integers
- Correct example showing the difference
25. Why is it important to use virtual environments for Python projects?
Sample Answer
Virtual environments provide: 1. **Dependency isolation:** Each project has its own packages, preventing version conflicts 2. **Reproducibility:** You can recreate the exact environment on another machine 3. **Clean testing:** Test package upgrades without affecting other projects 4. **Deployment ease:** Requirements can be exported and installed identically in production Example: Project A needs pandas 1.5, Project B needs pandas 2.0—virtual environments allow both to coexist. *Key points for full credit:* - At least two valid reasons - Understanding of isolation conceptScoring
| Section | Points | Your Score |
|---|---|---|
| Multiple Choice (1-10) | 10 | ___ |
| True/False (11-15) | 5 | ___ |
| Fill in Blank (16-20) | 5 | ___ |
| Code Output (21-23) | 6 | ___ |
| Short Answer (24-25) | 4 | ___ |
| Total | 30 | ___ |
Passing Score: 21/30 (70%)
Review Recommendations
- Score < 50%: Re-read entire chapter, complete Part A-C exercises
- Score 50-70%: Focus on pandas operations (Section 4.3) and NumPy (Section 4.4)
- Score 70-85%: Practice visualization exercises, review best practices
- Score > 85%: Excellent! Ready for Chapter 5