Exercises: Python for AI Engineering

These exercises progress from foundational concept checks to challenging applications. Estimated completion time: 4 hours.

Scoring Guide: - ★ Foundational (5-10 min each) - ★★ Intermediate (10-20 min each) - ★★★ Challenging (20-40 min each) - ★★★★ Advanced/Research (40+ min each)


Part A: Conceptual Understanding ★

Test your understanding of core concepts. No calculations required.

A.1. Explain why a Python list of 1,000,000 floats uses approximately 28 MB of memory while a NumPy array of the same data uses approximately 8 MB. What structural differences between Python objects and NumPy array elements account for this?

A.2. A colleague writes the following code and complains that it is slow:

result = []
for i in range(len(matrix)):
    for j in range(len(matrix[0])):
        result.append(matrix[i][j] ** 2)

Explain (a) why this is slow, and (b) how to rewrite it using NumPy vectorization.

A.3. Explain the three broadcasting rules in your own words. For each rule, provide a concrete example with specific array shapes.

A.4. Compare and contrast df.loc[] and df.iloc[] in pandas. When would you use each? Give a scenario where using the wrong one could lead to subtle bugs.

A.5. A colleague claims: "Jupyter notebooks are all you need for a production AI system." Explain why this is incorrect, and describe the appropriate role of notebooks in an AI engineering workflow.

A.6. True or False: "NumPy's @ operator for matrix multiplication always creates a new array in memory." Justify your answer with reference to how NumPy manages memory.

A.7. Explain the difference between a NumPy view and a NumPy copy. Why does basic slicing create a view while fancy indexing creates a copy? What are the implications for memory usage and data safety?

A.8. What is the Global Interpreter Lock (GIL), and why is it less of a problem for NumPy-heavy AI code than for general Python multithreading? Reference the "glue language" architecture in your answer.

A.9. Explain why np.linalg.solve(A, b) is preferred over np.linalg.inv(A) @ b for solving the linear system $\mathbf{Ax} = \mathbf{b}$. Provide both a numerical stability argument and a computational cost argument.


Part B: Calculations and Short Code ★★

Apply concepts to solve problems. Show all work or provide complete code.

B.1. Determine the output shapes for each of the following broadcasting operations. If an operation would raise an error, explain why.

Operation Array A Shape Array B Shape Result Shape
A + B (3, 4) (4,) ?
A * B (2, 3, 4) (3, 1) ?
A - B (5, 1) (1, 6) ?
A @ B (3, 4) (4, 2) ?
A + B (3, 4) (3,) ?
A * B (2, 1, 4) (3, 4) ?

B.2. Given the following array, predict the output of each operation without running the code:

a = np.arange(12).reshape(3, 4)
# a = [[ 0,  1,  2,  3],
#      [ 4,  5,  6,  7],
#      [ 8,  9, 10, 11]]

a) a[1:, ::2] b) a[a > 5] c) a.sum(axis=0) d) a.reshape(2, 6).strides (assuming float64) e) np.where(a % 3 == 0, a, -1)

B.3. Write a vectorized NumPy function row_normalize(X) that normalizes each row of a matrix to have unit L2 norm. Verify that each output row has norm 1.0 (within floating-point tolerance).

B.4. Given the following pandas DataFrame:

df = pd.DataFrame({
    'department': ['eng', 'eng', 'sales', 'sales', 'eng', 'sales'],
    'salary': [120000, 95000, 85000, 110000, 105000, 90000],
    'years': [5, 2, 3, 8, 4, 1]
})

Write a single method chain that: (a) adds a column salary_per_year = salary / years, (b) groups by department, (c) computes the mean salary_per_year, and (d) sorts by mean descending.

B.5. Calculate the strides for a NumPy array with shape (3, 4, 5) and dtype float32 in C-order and Fortran-order. Show your calculations.

B.6. Write a pandas operation that performs a left join between a users DataFrame (columns: user_id, name) and an events DataFrame (columns: event_id, user_id, timestamp, event_type), then counts the number of events per user, filling zeros for users with no events.

B.7. The following code has a memory problem when processing a 10 GB CSV file on a machine with 16 GB RAM. Identify the issue and propose a fix:

df = pd.read_csv('huge_file.csv')
result = df.groupby('category').agg({'value': 'mean'})

Part C: Programming Challenges ★★-★★★

Implement solutions in Python. All code should be well-documented and tested.

C.1. Vectorized Statistics ★★

Write a function compute_statistics that takes a 2D NumPy array and returns a dictionary containing, for each column: mean, median, standard deviation, skewness, and kurtosis. The function must be fully vectorized (no Python loops over columns).

def compute_statistics(X: np.ndarray) -> dict[str, np.ndarray]:
    """
    Compute descriptive statistics for each column of X.

    Args:
        X: Array of shape (n_samples, n_features).

    Returns:
        Dictionary mapping statistic names to arrays of shape (n_features,).

    Examples:
        >>> X = np.array([[1, 2], [3, 4], [5, 6]])
        >>> stats = compute_statistics(X)
        >>> np.allclose(stats['mean'], [3.0, 4.0])
        True
    """
    # Your code here
    pass

# Test cases
X = np.random.randn(10000, 5)
stats = compute_statistics(X)
assert stats['mean'].shape == (5,)
assert np.allclose(stats['mean'], X.mean(axis=0))

C.2. Data Cleaning Pipeline ★★

Using pandas, create a function clean_dataset that takes a raw DataFrame and performs the following operations: - Drops duplicate rows - Fills missing numerical values with column medians - Fills missing categorical values with the mode - Converts string columns to lowercase - Removes rows where any numerical column has a value beyond 3 standard deviations from the mean - Returns the cleaned DataFrame and a report dictionary with counts of modifications

def clean_dataset(
    df: pd.DataFrame,
    numeric_cols: list[str],
    categorical_cols: list[str]
) -> tuple[pd.DataFrame, dict[str, int]]:
    """
    Clean a raw dataset with comprehensive handling.

    Args:
        df: Raw input DataFrame.
        numeric_cols: List of numerical column names.
        categorical_cols: List of categorical column names.

    Returns:
        Tuple of (cleaned_df, report) where report contains
        counts of each modification type.
    """
    # Your code here
    pass

C.3. Custom Visualization Function ★★

Write a function plot_feature_importance that takes a list of feature names and their importance scores, then creates a horizontal bar chart with: - Bars sorted by importance (highest at top) - Color gradient from low to high importance - Importance values annotated on each bar - Clean formatting suitable for a report

def plot_feature_importance(
    feature_names: list[str],
    importances: np.ndarray,
    top_n: int = 15,
    title: str = "Feature Importance",
    save_path: str | None = None
) -> tuple[plt.Figure, plt.Axes]:
    """Create a publication-quality feature importance plot."""
    # Your code here
    pass

C.4. Optimization Challenge ★★★

The following function computes the covariance matrix but is inefficient. Rewrite it to be at least 100x faster on a (10000, 100) array using NumPy vectorization.

def slow_covariance(X):
    """Compute covariance matrix using explicit loops."""
    n, d = X.shape
    mean = [sum(X[:, j]) / n for j in range(d)]
    cov = [[0.0] * d for _ in range(d)]
    for i in range(d):
        for j in range(d):
            for k in range(n):
                cov[i][j] += (X[k, i] - mean[i]) * (X[k, j] - mean[j])
            cov[i][j] /= (n - 1)
    return cov

Requirements: - Your solution must produce results matching np.cov(X, rowvar=False) within 1e-10 tolerance - It must not call np.cov -- implement the vectorized math yourself - Include benchmarking code showing the speedup

C.5. Memory-Efficient Data Processing ★★★

Write a function that reads a large CSV file in chunks, computes per-column statistics (mean, std, min, max, count), and returns the aggregated results without ever loading the full file into memory.

def chunked_statistics(
    filepath: str,
    chunk_size: int = 100000,
    numeric_only: bool = True
) -> pd.DataFrame:
    """
    Compute column statistics from a CSV file using chunked reading.

    This uses Welford's online algorithm for numerically stable
    mean and variance computation.

    Args:
        filepath: Path to CSV file.
        chunk_size: Number of rows per chunk.
        numeric_only: If True, only compute stats for numeric columns.

    Returns:
        DataFrame with statistics as rows and columns as columns.
    """
    # Your code here
    pass

C.6. Broadcasting Mastery ★★★

Implement the following operations using only broadcasting and vectorized NumPy operations (no loops, no np.apply_along_axis):

a) Given points $\mathbf{X} \in \mathbb{R}^{n \times d}$ and centroids $\mathbf{C} \in \mathbb{R}^{k \times d}$, compute the matrix of Euclidean distances $\mathbf{D} \in \mathbb{R}^{n \times k}$ where $D_{ij} = \|\mathbf{x}_i - \mathbf{c}_j\|_2$.

b) Given the distance matrix from (a), assign each point to its nearest centroid (return an array of cluster indices of shape $(n,)$).

c) Given points and cluster assignments, compute new centroids as the mean of all points assigned to each cluster.

These three operations form the core of the K-means algorithm that you will study in Chapter 7.

C.7. pandas Performance Comparison ★★

For a DataFrame with 1 million rows and 10 columns, benchmark the following three approaches to creating a new column based on a conditional:

a) Using iterrows() b) Using apply() c) Using np.where() / vectorized operations

Report the time for each approach and compute the speedup ratios.


Part D: Analysis & Interpretation ★★-★★★

Apply concepts to realistic scenarios. Justify your reasoning.

D.1. Profiling Analysis ★★

You run cProfile on an AI training pipeline and get the following output (simplified):

   ncalls  tottime  cumtime  function
   100000    45.2    45.2    data_augmentation
     1000     3.1    48.3    train_batch
      100     0.5    48.8    train_epoch
    50000    12.7    12.7    compute_loss
   100000     8.3     8.3    forward_pass

a) Which function is the biggest bottleneck? b) What optimization strategy would you prioritize? c) If you reduce data_augmentation time by 50%, what is the expected overall speedup (use Amdahl's Law)?

D.2. Environment Debugging ★★★

A colleague sends you a Jupyter notebook that works on their machine but fails on yours with the error: ImportError: cannot import name 'new_feature' from 'pandas'. They say "I just installed pandas." Provide a systematic debugging procedure, identifying at least five possible causes and their solutions.

D.3. Memory Layout Analysis ★★★

You have a 3D array of shape (1000, 100, 50) with dtype float64. You need to frequently compute: - Operation A: Sum along axis 0 (sum over first dimension) - Operation B: Sum along axis 2 (sum over last dimension)

a) Which memory layout (C-order or Fortran-order) is better for Operation A? For Operation B? b) If you can only choose one layout, which would you choose and why? c) Estimate the cache performance difference for a CPU with 64-byte cache lines.


Part E: Research & Extension ★★★★

Open-ended problems for deeper exploration.

E.1. Polars vs. pandas Performance Study

Install Polars (pip install polars) and rewrite three of the pandas operations from Section 5.3 using Polars' lazy evaluation API. Benchmark both implementations on a dataset with at least 1 million rows. Write a report comparing: - Execution speed - Memory usage - API ergonomics - Suitability for different use cases

E.2. NumPy Internals Deep Dive

Using NumPy's source code (available on GitHub), trace the execution path of np.dot(A, B) for two 2D float64 arrays. Identify: - Where the BLAS dispatch happens - What checks are performed on the input arrays - How the output array is allocated - Write a short report (500-800 words) on what you discovered.

E.3. Custom Profiling Tool

Build a Python context manager @profile_block that captures: - Wall-clock time - Peak memory usage - Number of NumPy array allocations

Use it to profile a complete data pipeline (load -> clean -> transform -> visualize) and generate a report.

E.4. Visualization Library Comparison

Create the same complex multi-panel figure using matplotlib, plotly, and altair. Compare: - Lines of code required - Interactivity capabilities - Rendering performance - Output format options

Write a recommendation guide for when to use each library in AI engineering.


Solutions

Selected solutions are available in: - code/exercise-solutions.py (programming problems) - appendices/g-answers-to-selected-exercises.md (odd-numbered problems)

Full solutions available to instructors at the publisher's website.