Quiz: Python Tools for Soccer Analytics

Test your Python skills before moving to the next chapter. Target: 70% or higher to proceed. Time: ~35 minutes


Section 1: Multiple Choice (1 point each)

1. Which pandas method is used to filter rows based on a condition?

  • A) df.filter(condition)
  • B) df[condition] (boolean indexing)
  • C) df.select(condition)
  • D) df.where(condition)
Answer **B)** `df[condition]` (boolean indexing) *Explanation:* Boolean indexing with `df[condition]` is the standard pandas way to filter rows. `df.query()` is also valid for string-based conditions. Reference Section 4.3.3.

2. What is the output shape of df.groupby('team')['goals'].sum()?

  • A) A DataFrame with teams as index
  • B) A Series with teams as index
  • C) A DataFrame with one column
  • D) A single number
Answer **B)** A Series with teams as index *Explanation:* When groupby is applied to a single column with a single aggregation, the result is a Series. Teams become the index, sums become the values. Reference Section 4.3.5.

3. Which merge type keeps all rows from the left DataFrame and matching rows from the right?

  • A) Inner merge
  • B) Outer merge
  • C) Left merge
  • D) Cross merge
Answer **C)** Left merge *Explanation:* A left merge keeps all rows from the left DataFrame and adds matching data from the right DataFrame (with NaN for non-matches). Reference Section 4.3.6.

4. What NumPy function generates random integers from a Poisson distribution?

  • A) np.random.randint()
  • B) np.random.poisson()
  • C) np.random.normal()
  • D) np.random.exponential()
Answer **B)** `np.random.poisson()` *Explanation:* `np.random.poisson(lambda, size)` generates Poisson-distributed random numbers, ideal for simulating goals. Reference Section 4.4.5.

5. Which matplotlib function creates a new figure with multiple subplots?

  • A) plt.subplot()
  • B) plt.figure()
  • C) plt.subplots()
  • D) plt.axes()
Answer **C)** `plt.subplots()` *Explanation:* `plt.subplots(nrows, ncols)` creates a figure and array of axes objects for multiple subplots. Reference Section 4.5.1.

6. What is the correct way to add multiple conditions in pandas filtering?

  • A) df[condition1 and condition2]
  • B) df[condition1 & condition2]
  • C) df[condition1 && condition2]
  • D) df[(condition1) + (condition2)]
Answer **B)** `df[condition1 & condition2]` *Explanation:* Use `&` for AND and `|` for OR in pandas boolean indexing. The Python keywords `and`/`or` don't work with Series. Each condition should be in parentheses. Reference Section 4.3.3.

7. Which seaborn function creates a box plot?

  • A) sns.boxchart()
  • B) sns.box()
  • C) sns.boxplot()
  • D) sns.whisker()
Answer **C)** `sns.boxplot()` *Explanation:* `sns.boxplot(data=df, x='category', y='value')` creates box plots showing distribution statistics. Reference Section 4.5.4.

8. What does df.memory_usage(deep=True) return?

  • A) CPU usage for DataFrame operations
  • B) Memory usage per column in bytes
  • C) Processing time for DataFrame creation
  • D) Number of bytes in each cell
Answer **B)** Memory usage per column in bytes *Explanation:* `memory_usage(deep=True)` calculates actual memory consumption including object types (strings). Without `deep=True`, object columns show only pointer size. Reference Section 4.7.1.

9. Which pandas method converts a wide DataFrame to long format?

  • A) df.pivot()
  • B) df.melt()
  • C) df.stack()
  • D) df.reshape()
Answer **B)** `df.melt()` *Explanation:* `df.melt()` unpivots a DataFrame from wide to long format. `pivot()` does the opposite. Reference Section 4.3.6.

10. What is the primary advantage of NumPy vectorized operations over Python loops?

  • A) They use less memory
  • B) They are significantly faster
  • C) They are easier to read
  • D) They support more data types
Answer **B)** They are significantly faster *Explanation:* NumPy operations are implemented in optimized C code and operate on entire arrays at once, typically 10-100x faster than equivalent Python loops. Reference Section 4.4.2.

Section 2: True/False (1 point each)

11. In pandas, df.loc[] is used for integer-based indexing.

Answer **False** *Explanation:* `df.loc[]` is for label-based indexing. `df.iloc[]` is for integer-based indexing. Reference Section 4.3.3.

12. Virtual environments help ensure project dependencies don't conflict.

Answer **True** *Explanation:* Virtual environments create isolated Python installations with their own packages, preventing version conflicts between projects. Reference Section 4.2.1.

13. The groupby().agg() method can apply multiple aggregation functions to multiple columns.

Answer **True** *Explanation:* `agg()` accepts a dictionary mapping columns to aggregations, e.g., `{'goals': ['sum', 'mean'], 'shots': 'count'}`. Reference Section 4.3.5.

14. Changing a pandas column to 'category' dtype always reduces memory usage.

Answer **False** *Explanation:* Category dtype reduces memory when there are few unique values relative to row count. For high-cardinality columns (many unique values), it may increase memory. Reference Section 4.7.1.

15. plt.savefig() must be called before plt.show() to save the figure.

Answer **True** *Explanation:* `plt.show()` clears the current figure. To save, call `savefig()` first or use `plt.gcf()` to get the figure after showing. Reference Section 4.5.1.

Section 3: Fill in the Blank (1 point each)

16. To calculate the correlation between two arrays x and y in NumPy, use np.________(x, y).

Answer **corrcoef** *Explanation:* `np.corrcoef(x, y)[0, 1]` returns the Pearson correlation coefficient between arrays x and y.

17. To create a DataFrame from a list of dictionaries, use pd.________(list_of_dicts).

Answer **DataFrame** *Explanation:* `pd.DataFrame([{'a': 1}, {'a': 2}])` creates a DataFrame where each dict becomes a row.

18. In matplotlib, the __ method sets the x-axis label.

Answer **set_xlabel** (or **xlabel** when using pyplot) *Explanation:* For axes objects: `ax.set_xlabel('Label')`. For pyplot: `plt.xlabel('Label')`.

19. To read a CSV file with specific columns only, use the __ parameter in pd.read_csv().

Answer **usecols** *Explanation:* `pd.read_csv('file.csv', usecols=['col1', 'col2'])` loads only specified columns, reducing memory and load time.

20. The pandas method to remove duplicate rows is df.________().

Answer **drop_duplicates** *Explanation:* `df.drop_duplicates()` removes duplicate rows. Parameters like `subset` and `keep` control behavior.

Section 4: Code Output (2 points each)

21. What is the output?

import pandas as pd

df = pd.DataFrame({
    'team': ['A', 'A', 'B', 'B'],
    'goals': [2, 3, 1, 4]
})

result = df.groupby('team')['goals'].transform('sum')
print(result.tolist())
Answer **[5, 5, 5, 5]** *Explanation:* `transform()` returns a Series with the same index as the original DataFrame. Team A's sum (5) is assigned to both A rows, Team B's sum (5) to both B rows.

22. What is the output?

import numpy as np

goals = np.array([0, 1, 2, 3, 4])
mask = goals > 1
print(goals[mask])
Answer **[2 3 4]** *Explanation:* Boolean indexing with `mask = [False, False, True, True, True]` selects only elements where the condition is True.

23. What does this return?

df = pd.DataFrame({
    'home_goals': [2, 1, 3],
    'away_goals': [1, 1, 2]
})

df.query('home_goals > away_goals').shape[0]
Answer **2** *Explanation:* Two rows have home_goals > away_goals (2>1 and 3>2). `.shape[0]` returns the row count.

Section 5: Short Answer (2 points each)

24. Explain the difference between df.loc[] and df.iloc[] with a brief example.

Sample Answer - `df.loc[]` uses **label-based** indexing (row/column names) - `df.iloc[]` uses **integer-based** indexing (positions) Example:
df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])

df.loc['y', 'A']   # Returns 2 (using label 'y')
df.iloc[1, 0]      # Returns 2 (using position 1)
*Key points for full credit:* - Clear distinction between labels and integers - Correct example showing the difference

25. Why is it important to use virtual environments for Python projects?

Sample Answer Virtual environments provide: 1. **Dependency isolation:** Each project has its own packages, preventing version conflicts 2. **Reproducibility:** You can recreate the exact environment on another machine 3. **Clean testing:** Test package upgrades without affecting other projects 4. **Deployment ease:** Requirements can be exported and installed identically in production Example: Project A needs pandas 1.5, Project B needs pandas 2.0—virtual environments allow both to coexist. *Key points for full credit:* - At least two valid reasons - Understanding of isolation concept

Scoring

Section Points Your Score
Multiple Choice (1-10) 10 ___
True/False (11-15) 5 ___
Fill in Blank (16-20) 5 ___
Code Output (21-23) 6 ___
Short Answer (24-25) 4 ___
Total 30 ___

Passing Score: 21/30 (70%)


Review Recommendations

  • Score < 50%: Re-read entire chapter, complete Part A-C exercises
  • Score 50-70%: Focus on pandas operations (Section 4.3) and NumPy (Section 4.4)
  • Score 70-85%: Practice visualization exercises, review best practices
  • Score > 85%: Excellent! Ready for Chapter 5