Quiz: Python for AI Engineering

Q: What is the output shape of `np.ones((3, 1)) + np.ones((1, 4))`? `(3, 4)` `(3, 1)` `(1, 4)` This raises a broadcasting error

A) `(3, 4)` Explanation: Broadcasting rule: dimensions of size 1 are stretched to match the larger dimension. `(3, 1)` and `(1, 4)` become `(3, 4)` by stretching dim 1 of the first array to 4 and dim 0 of the second array to 3. This is the outer-product broadcasting pattern. See Section 5.2.3.

Q: Which pandas method should you use to select rows by integer position? `df.loc[]` `df.iloc[]` `df.ix[]` `df.at[]`

B) `df.iloc[]` Explanation: `.iloc[]` uses integer-based (positional) indexing. `.loc[]` uses label-based indexing. `.ix[]` is deprecated and should never be used. `.at[]` is for accessing a single scalar value by label. See Section 5.3.2.

Q: In NumPy, the operation `a[a > 0]` is called ________ indexing (or masking).

boolean Explanation: `a > 0` produces a boolean array, and using it as an index selects only the elements where the condition is True. This is called boolean indexing or boolean masking. See Section 5.2.5.

Q: The pandas method ________ is used to combine two DataFrames on a shared key column, similar to a SQL JOIN.

merge (or `pd.merge()`) Explanation: `pd.merge(left, right, on='key', how='inner')` performs SQL-style joins. The `how` parameter supports 'inner', 'left', 'right', and 'outer' joins. See Section 5.3.6.

Q: A NumPy array stored in C-order (row-major) has elements within a ________ contiguous in memory, while Fortran-order has elements within a ________ contiguous.

row; column Explanation: C-order stores data row-by-row (last index changes fastest), while Fortran-order stores column-by-column (first index changes fastest). This affects cache performance when iterating over specific dimensions. See Section 5.2.2.

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed. Time: ~40 minutes

Section 1: Multiple Choice (1 point each)

1. Which of the following best explains why NumPy array operations are much faster than equivalent Python loops?

A) NumPy uses a faster version of the Python interpreter
B) NumPy operations dispatch to compiled C/Fortran code and can leverage SIMD instructions
C) NumPy arrays are stored on the GPU by default
D) NumPy uses JIT compilation to convert Python to machine code

Answer

**B)** NumPy operations dispatch to compiled C/Fortran code and can leverage SIMD instructions *Explanation:* NumPy's performance comes from its C-implemented ufuncs and its use of optimized BLAS libraries (like OpenBLAS or MKL). These compiled routines process entire arrays in tight loops with SIMD (Single Instruction, Multiple Data) parallelism, avoiding the per-element overhead of the Python interpreter. Option A is wrong because NumPy does not change the interpreter; C is wrong because NumPy is CPU-based (CuPy handles GPUs); D describes Numba, not NumPy. See Section 5.2.4.

2. What is the output shape of np.ones((3, 1)) + np.ones((1, 4))?

A) (3, 4)
B) (3, 1)
C) (1, 4)
D) This raises a broadcasting error

Answer

**A)** `(3, 4)` *Explanation:* Broadcasting rule: dimensions of size 1 are stretched to match the larger dimension. `(3, 1)` and `(1, 4)` become `(3, 4)` by stretching dim 1 of the first array to 4 and dim 0 of the second array to 3. This is the outer-product broadcasting pattern. See Section 5.2.3.

3. Which pandas method should you use to select rows by integer position?

A) df.loc[]
B) df.iloc[]
C) df.ix[]
D) df.at[]

Answer

**B)** `df.iloc[]` *Explanation:* `.iloc[]` uses integer-based (positional) indexing. `.loc[]` uses label-based indexing. `.ix[]` is deprecated and should never be used. `.at[]` is for accessing a single scalar value by label. See Section 5.3.2.

4. What is the primary purpose of Python's strides attribute in a NumPy array?

A) To track how many times the array has been modified
B) To specify how many bytes to jump in memory to move to the next element along each dimension
C) To indicate the maximum size the array can grow to
D) To store the original shape before reshaping

Answer

**B)** To specify how many bytes to jump in memory to move to the next element along each dimension *Explanation:* Strides are a tuple of byte counts that tell NumPy how to traverse the underlying memory buffer for each dimension. For a C-contiguous `float64` array of shape `(3, 4)`, strides are `(32, 8)`: skip 8 bytes (1 float64) to move one column, skip 32 bytes (4 float64s) to move one row. This mechanism enables views and non-contiguous slicing without copying data. See Section 5.2.1.

5. You have a DataFrame df with a column price containing some missing values. Which approach correctly fills missing prices with the column median?

A) df['price'] = df['price'].fillna(df['price'].mean())
B) df['price'] = df['price'].fillna(df['price'].median())
C) df['price'].replace(np.nan, df['price'].median())
D) df.dropna(subset=['price'])

Answer

**B)** `df['price'] = df['price'].fillna(df['price'].median())` *Explanation:* `fillna()` replaces NaN values with the specified value. Option A fills with the mean, not the median as requested. Option C uses `replace` but does not assign the result back. Option D removes rows with missing values instead of filling them. The median is often preferred over the mean for imputation because it is robust to outliers. See Section 5.3.3.

6. What is the output of the following code?

a = np.array([1, 2, 3, 4, 5])
b = a[1:4]
b[0] = 99
print(a)

A) [1, 2, 3, 4, 5]
B) [1, 99, 3, 4, 5]
C) [99, 2, 3, 4, 5]
D) An error is raised

Answer

**B)** `[1, 99, 3, 4, 5]` *Explanation:* Basic slicing in NumPy creates a view, not a copy. `b = a[1:4]` makes `b` a view into elements at positions 1, 2, 3 of `a`. Modifying `b[0]` modifies `a[1]`. This is a frequent source of bugs for programmers coming from Python lists (where slicing always creates a copy). See Section 5.2.1.

7. Which profiling tool gives you line-by-line execution times within a single function?

A) cProfile
B) timeit
C) line_profiler
D) memory_profiler

Answer

**C)** `line_profiler` *Explanation:* `cProfile` profiles at the function level (how many calls and total time per function). `timeit` measures the total time of a specific code snippet. `line_profiler` (accessed via `%lprun` in Jupyter) shows the time spent on each individual line within a function, which is the most granular profiling available. `memory_profiler` tracks memory usage per line. See Section 5.5.2.

8. What is the recommended way to solve the linear system $\mathbf{Ax} = \mathbf{b}$ in NumPy?

A) x = np.linalg.inv(A) @ b
B) x = np.linalg.solve(A, b)
C) x = A / b
D) x = np.dot(np.linalg.inv(A), b)

Answer

**B)** `x = np.linalg.solve(A, b)` *Explanation:* `solve` uses LU decomposition to compute the solution directly, which is both faster (fewer floating-point operations) and more numerically stable than explicitly computing the inverse. Computing `inv(A)` amplifies numerical errors, especially for ill-conditioned matrices. Options A and D both use the inferior inverse approach. Option C is element-wise division, which is mathematically incorrect. See Section 5.2.7 and Chapter 2.

9. What does the keepdims=True parameter do in NumPy reduction operations like np.sum()?

A) Keeps the original array unchanged
B) Retains the reduced dimension as size 1, enabling broadcasting with the original array
C) Prevents the function from modifying the input array
D) Keeps the data type the same as the input

Answer

**B)** Retains the reduced dimension as size 1, enabling broadcasting with the original array *Explanation:* When you compute `X.sum(axis=1)` on a `(3, 4)` array, the result has shape `(3,)`. With `keepdims=True`, the result has shape `(3, 1)`, which can be broadcast against the original `(3, 4)` array. This is essential for operations like centering (`X - X.mean(axis=1, keepdims=True)`). See Section 5.2.3.

10. Which of the following is the most important benefit of using virtual environments for AI projects?

A) They make Python code run faster
B) They provide GPU access for deep learning
C) They isolate project dependencies to prevent conflicts and ensure reproducibility
D) They automatically optimize package installations

Answer

**C)** They isolate project dependencies to prevent conflicts and ensure reproducibility *Explanation:* Virtual environments create isolated Python installations where each project can have its own dependency versions. This prevents the "works on my machine" problem and ensures that experiments are reproducible. They do not affect performance (A), GPU access (B), or installation optimization (D). See Section 5.6.

Section 2: True/False (1 point each)

For each statement, indicate whether it is True or False.

11. NumPy arrays can contain elements of mixed types (e.g., integers and strings in the same array).

Answer

**False** (with nuance) *Explanation:* NumPy arrays are homogeneous -- all elements share a single dtype. If you try to create an array with mixed types, NumPy will upcast to the broadest type. `np.array([1, 'hello'])` creates an array of dtype `

12. The pandas groupby().transform() method returns a result with the same shape as the input DataFrame.

Answer

**True** *Explanation:* Unlike `groupby().agg()` which returns one row per group, `transform()` broadcasts the group-level result back to the original DataFrame shape. This is useful for operations like computing z-scores within groups: `df.groupby('category')['value'].transform(lambda x: (x - x.mean()) / x.std())`. See Section 5.3.4.

13. In matplotlib, plt.plot() and the object-oriented ax.plot() produce identical results and are equally suitable for complex multi-panel figures.

Answer

**False** *Explanation:* While both can produce the same visual output for simple cases, the pyplot state-based API (`plt.plot()`) operates on the "current axes" implicitly, which becomes ambiguous and error-prone with multi-panel figures. The OO API (`ax.plot()`) explicitly specifies which axes to draw on, making complex layouts maintainable and less bug-prone. See Section 5.4.1.

14. When using pip freeze > requirements.txt, the resulting file pins exact package versions, ensuring reproducibility.

Answer

**True** *Explanation:* `pip freeze` outputs every installed package with its exact version (e.g., `numpy==1.26.4`). This pins the versions so that `pip install -r requirements.txt` on another machine will install identical versions. Note that it does not pin the Python version itself or system-level dependencies, so full reproducibility may additionally require specifying the Python version and OS. See Section 5.6.3.

15. Broadcasting in NumPy always creates a physical copy of the smaller array to match the larger array's shape.

Answer

**False** *Explanation:* Broadcasting is a virtual mechanism -- it does not allocate memory for the expanded array. Instead, NumPy uses stride tricks to make the smaller array appear larger by repeating its values through zero-stride dimensions. This is why broadcasting is both memory-efficient and fast. See Section 5.2.3.

16. The %timeit magic command in Jupyter automatically runs the code multiple times and reports statistical summaries to account for measurement noise.

Answer

**True** *Explanation:* `%timeit` automatically determines an appropriate number of loops and repeats, then reports the mean and standard deviation of the best runs. This accounts for system noise (other processes, cache state, etc.) and gives more reliable timing than a single measurement. See Section 5.5.1.

Section 3: Fill in the Blank (1 point each)

17. In NumPy, the operation a[a > 0] is called __ indexing (or masking).

Answer

**boolean** *Explanation:* `a > 0` produces a boolean array, and using it as an index selects only the elements where the condition is True. This is called boolean indexing or boolean masking. See Section 5.2.5.

18. The pandas method __ is used to combine two DataFrames on a shared key column, similar to a SQL JOIN.

Answer

**merge** (or `pd.merge()`) *Explanation:* `pd.merge(left, right, on='key', how='inner')` performs SQL-style joins. The `how` parameter supports 'inner', 'left', 'right', and 'outer' joins. See Section 5.3.6.

19. When optimizing Python code, you should always __ before optimizing, to identify the actual bottleneck rather than guessing.

Answer

**profile** *Explanation:* "Premature optimization is the root of all evil" (Donald Knuth). Profiling with tools like cProfile, line_profiler, or memory_profiler identifies where the code actually spends its time, preventing you from optimizing fast parts while ignoring the real bottleneck. See Section 5.5.2.

20. A NumPy array stored in C-order (row-major) has elements within a _ contiguous in memory, while Fortran-order has elements within a _ contiguous.

Answer

**row; column** *Explanation:* C-order stores data row-by-row (last index changes fastest), while Fortran-order stores column-by-column (first index changes fastest). This affects cache performance when iterating over specific dimensions. See Section 5.2.2.

Section 4: Short Answer (2 points each)

Write 2-4 sentences for each answer.

21. Explain why method chaining in pandas (e.g., df.query(...).assign(...).groupby(...).agg(...)) is preferred over storing intermediate variables for each step.

Sample Answer

Method chaining produces a clear, top-to-bottom data flow that reads like a recipe, making the transformation logic immediately apparent. It avoids polluting the namespace with intermediate variables that are only used once, reducing the chance of accidentally using a stale intermediate result. Additionally, pandas can sometimes optimize chained operations more effectively than separate statements. The approach mirrors the Unix pipe philosophy where each step transforms data and passes it forward. *Key points for full credit:* - Readability and clear data flow - Avoids unnecessary intermediate variables - Reduced risk of bugs from stale intermediates

22. A teammate writes: for idx, row in df.iterrows(): df.loc[idx, 'new_col'] = row['a'] * row['b']. Explain why this is poor practice and how to fix it.

Sample Answer

Using `iterrows()` to modify a DataFrame is poor practice for two reasons: (1) it processes one row at a time in a Python loop, making it orders of magnitude slower than vectorized operations on large DataFrames, and (2) modifying a DataFrame during iteration can trigger `SettingWithCopyWarning` and lead to undefined behavior. The correct approach is `df['new_col'] = df['a'] * df['b']`, which uses vectorized multiplication. On a DataFrame with 1 million rows, the vectorized version is typically 100-1000x faster. *Key points for full credit:* - Performance: Python loop vs. vectorized operation - Correctness: SettingWithCopyWarning risks - Correct fix using vectorized multiplication

23. Why should you call set_global_seed() at the beginning of every AI experiment, and what are its limitations?

Sample Answer

Setting a global seed ensures that random operations (data splitting, weight initialization, dropout masks, data shuffling) produce identical results across runs, enabling fair comparison of different model configurations. Without seed setting, apparent performance differences between experiments might be due to random variation rather than meaningful improvements. However, seed setting has limitations: it does not guarantee reproducibility across different hardware (GPU vs. CPU), different library versions, or non-deterministic operations like certain GPU-accelerated kernels. It also does not address other sources of non-reproducibility such as different OS versions or multithreaded race conditions. *Key points for full credit:* - Enables fair experiment comparison - Covers multiple random sources (NumPy, Python random, PyTorch) - Limitations: hardware differences, non-deterministic GPU ops

Section 5: Code Analysis (2 points each)

24. What is the output of the following code?

import numpy as np

a = np.array([[1, 2, 3],
              [4, 5, 6]])

b = a.T
b[0, 0] = 99

print(a[0, 0])
print(a.shape, b.shape)

Answer

99
(2, 3) (3, 2)

*Explanation:* `a.T` (transpose) returns a view of the array, not a copy. The transposed array `b` shares the same underlying data buffer as `a`. Modifying `b[0, 0]` changes the same memory location as `a[0, 0]`. The shapes are transposed: `a` is `(2, 3)` and `b` is `(3, 2)`. This is a common source of bugs when people assume transpose creates a copy.

25. The following code contains a bug. Identify it and explain how to fix it.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'B'],
    'value': [10, 20, 30, np.nan, 50]
})

# Intent: Fill NaN with group mean
df['value_filled'] = df.groupby('group')['value'].mean()
print(df)

Answer

**Bug:** `groupby().mean()` returns a Series indexed by the group labels ('A', 'B'), not by the original DataFrame index. Assigning this back to the DataFrame produces NaN for all rows because the group-label index does not match the integer index. **Fix:**

df['value_filled'] = df.groupby('group')['value'].transform('mean')

*Explanation:* The `transform()` method applies the function within each group but broadcasts the result back to the original DataFrame shape, preserving the original index. Alternatively, you could use `df['value'].fillna(df.groupby('group')['value'].transform('mean'))` to only fill NaN values while keeping non-NaN values unchanged.

Section 6: Applied Problem (5 points)

26. You are given a dataset of 100,000 customer transactions with the following schema:

customer_id (int), transaction_date (datetime), amount (float),
category (str: 'electronics', 'groceries', 'clothing', 'dining'),
is_fraud (bool)

Using pandas and matplotlib, outline (with code) a complete analysis pipeline that:

a) (1 point) Loads the data, checks for missing values, and computes basic summary statistics.

b) (2 points) Creates a monthly spending trend plot (line chart) with separate lines for each category. The plot should have proper labels, legend, and formatting.

c) (2 points) Computes the fraud rate per category and per month, creates a heatmap visualization, and identifies the category-month combination with the highest fraud rate.

Answer

**a)** Data loading and exploration:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('transactions.csv', parse_dates=['transaction_date'])
print(f"Shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nSummary:\n{df.describe()}")
print(f"\nCategory distribution:\n{df['category'].value_counts()}")
print(f"\nFraud rate: {df['is_fraud'].mean():.4f}")

**b)** Monthly spending trends:

df['month'] = df['transaction_date'].dt.to_period('M')

monthly_category = (
    df
    .groupby(['month', 'category'])['amount']
    .sum()
    .unstack(fill_value=0)
)

fig, ax = plt.subplots(figsize=(12, 6))
monthly_category.plot(ax=ax, linewidth=2, marker='o', markersize=4)
ax.set_xlabel('Month')
ax.set_ylabel('Total Spending ($)')
ax.set_title('Monthly Spending Trends by Category')
ax.legend(title='Category', frameon=True)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**c)** Fraud rate analysis:

fraud_pivot = df.pivot_table(
    values='is_fraud',
    index='category',
    columns=df['transaction_date'].dt.to_period('M'),
    aggfunc='mean'
)

fig, ax = plt.subplots(figsize=(14, 5))
sns.heatmap(fraud_pivot, annot=True, fmt='.3f', cmap='YlOrRd',
            ax=ax, linewidths=0.5)
ax.set_title('Fraud Rate by Category and Month')
ax.set_xlabel('Month')
ax.set_ylabel('Category')
plt.tight_layout()
plt.show()

# Find highest fraud rate
max_fraud = fraud_pivot.stack()
max_idx = max_fraud.idxmax()
print(f"Highest fraud rate: {max_fraud.max():.4f}")
print(f"Category: {max_idx[0]}, Month: {max_idx[1]}")

*Key points for full credit:* - Correct use of pandas groupby/pivot_table - Proper datetime handling with `.dt` accessor - Clear, labeled visualizations - Identification of the maximum fraud rate combination

Scoring

Section	Points	Your Score
Multiple Choice (1-10)	10	___
True/False (11-16)	6	___
Fill in Blank (17-20)	4	___
Short Answer (21-23)	6	___
Code Analysis (24-25)	4	___
Applied Problem (26)	5	___
Total	35	___

Passing Score: 25/35 (70%)

Review Recommendations

Score < 50%: Re-read entire chapter, focusing on Sections 5.2 (NumPy) and 5.3 (pandas)
Score 50-70%: Review Sections 5.4 (visualization) and 5.5 (profiling), redo exercises Part A-B
Score 70-85%: Good understanding! Review any missed topics before proceeding
Score > 85%: Excellent! Ready for Chapter 6: Supervised Learning