21 min read

In the preceding four chapters, you built a formidable mathematical toolkit: linear algebra for representing data and transformations (Chapter 2), calculus and optimization for training models (Chapter 3), and probability and statistics for...

Learning Objectives

  • Apply advanced NumPy operations including broadcasting, vectorization, and memory-efficient array manipulation to implement the linear algebra operations from Chapter 2
  • Use pandas to load, clean, transform, and aggregate real-world datasets for AI pipelines
  • Create publication-quality visualizations with matplotlib and seaborn to communicate statistical insights from Chapter 4
  • Configure reproducible Jupyter notebook workflows for exploratory AI research
  • Profile Python code to identify bottlenecks and apply optimization strategies for numerical workloads
  • Manage virtual environments and dependencies to ensure reproducible AI experiments
  • Write production-quality Python code following PEP 8, type hints, and Google-style docstrings

Chapter 5: Python for AI Engineering

"First, solve the problem. Then, write the code." -- John Johnson

Chapter Overview

In the preceding four chapters, you built a formidable mathematical toolkit: linear algebra for representing data and transformations (Chapter 2), calculus and optimization for training models (Chapter 3), and probability and statistics for reasoning under uncertainty (Chapter 4). But mathematics on a whiteboard does not train a neural network or process a terabyte of sensor data. The bridge between theory and practice is software -- and in the world of AI engineering, that software is overwhelmingly written in Python.

Python's dominance in AI is no accident. Its readable syntax lowers the barrier to translating mathematical ideas into working code, while its ecosystem of numerical libraries -- NumPy, pandas, matplotlib, and beyond -- provides performance that rivals compiled languages for array-based computation. Every major deep learning framework (PyTorch, TensorFlow, JAX) exposes a Python interface. Every major cloud platform offers Python SDKs for ML services. When a new research paper drops on arXiv, the reference implementation is almost certainly in Python.

This chapter transforms you from a Python user into a Python AI engineer. You will learn not just what functions to call, but why certain patterns are fast and others are catastrophically slow, how memory layout affects numerical performance, and how to structure projects so they remain maintainable as they grow from a single notebook to a production system. By the end, you will have the practical skills to implement every algorithm in the remaining 35 chapters of this book.

In this chapter, you will learn to: - Leverage NumPy broadcasting and vectorization to eliminate slow Python loops - Manipulate and analyze structured datasets with pandas DataFrames - Build layered, publication-quality visualizations using the matplotlib object-oriented API - Set up efficient Jupyter notebook workflows for AI experimentation - Profile and optimize Python code for numerical workloads - Manage reproducible environments with virtual environments and dependency pinning


5.1 The Python Ecosystem for AI Engineering

Python was not designed for scientific computing. Guido van Rossum created it in the late 1980s as a general-purpose scripting language emphasizing readability. Yet by the mid-2000s, the combination of NumPy, SciPy, and matplotlib had turned Python into the de facto language for computational science, and the rise of deep learning after 2012 cemented its position as the undisputed language of AI.

5.1.1 Why Python Dominates AI

Several factors explain Python's dominance, and understanding them helps you use the language more effectively:

Expressiveness and Readability. Python's clean syntax maps naturally to mathematical notation. Compare the matrix multiplication $\mathbf{C} = \mathbf{A}\mathbf{B}$ in Python (C = A @ B) versus the equivalent in C (nested loops with index arithmetic). When you are translating the gradient computations from Chapter 3 or the Bayesian updates from Chapter 4 into code, this expressiveness is not a luxury -- it reduces bugs and accelerates experimentation.

The "Glue Language" Architecture. Python excels as an interface layer that coordinates high-performance components written in C, C++, and Fortran. When you call np.dot(A, B), you are not running Python arithmetic -- you are dispatching to an optimized BLAS implementation (often Intel MKL or OpenBLAS) that processes the computation in compiled code. This architecture gives you the convenience of Python with the speed of C.

Ecosystem Breadth. The Python Package Index (PyPI) hosts over 500,000 packages. For AI engineering, the critical stack includes:

Layer Libraries Purpose
Numerical Foundation NumPy, SciPy Array computation, linear algebra, optimization
Data Manipulation pandas, Polars Tabular data processing and analysis
Visualization matplotlib, seaborn, plotly Static and interactive plotting
Machine Learning scikit-learn, XGBoost, LightGBM Classical ML algorithms
Deep Learning PyTorch, TensorFlow, JAX Neural network training and inference
NLP / LLM Hugging Face Transformers, spaCy Language model deployment
Experiment Tracking MLflow, Weights & Biases Reproducibility and logging
Serving FastAPI, Flask, Ray Serve Model deployment

Community and Momentum. The network effect is self-reinforcing: researchers publish in Python, companies hire Python developers, universities teach Python, and new tools target Python first. When a new technique appears in a conference paper, the open-source implementation is available on GitHub in Python within days -- sometimes hours. This velocity of knowledge transfer is unmatched by any other programming language in the AI space.

5.1.2 Python's Limitations and How to Work Around Them

Honesty about Python's weaknesses makes you a better engineer:

The Global Interpreter Lock (GIL). CPython's GIL prevents true multi-threaded parallelism for CPU-bound Python code. For AI workloads, this is less painful than it sounds because NumPy releases the GIL during array operations, and deep learning frameworks use their own threading models. When you do need CPU parallelism in Python, use multiprocessing, concurrent.futures.ProcessPoolExecutor, or libraries like Dask and Ray.

Dynamic Typing Overhead. Every Python object carries type information, reference counts, and other metadata. A Python float is not 8 bytes -- it is a full CPython object of approximately 28 bytes. This is why a Python list of one million floats consumes roughly 28 MB, while a NumPy array of the same data uses approximately 8 MB. The lesson: never store numerical data in Python lists when NumPy arrays are available.

Startup and Interpretation Cost. Python is interpreted, making it slower than compiled languages for loop-heavy code. The mitigation strategies -- vectorization, Cython, Numba JIT compilation -- are covered in Section 5.5.

Intuition: Think of Python as the conductor of an orchestra. The conductor does not play faster than any individual musician -- but the conductor is what makes the whole ensemble work together. Python coordinates high-performance C/Fortran libraries the same way a conductor coordinates instrumentalists.


5.2 NumPy: Advanced Array Operations

NumPy is the foundation upon which virtually every numerical Python library is built. In Chapter 2, you used NumPy for basic vector and matrix operations. Here we go much deeper, covering the internal mechanics that determine whether your code runs in milliseconds or minutes.

5.2.1 Array Internals: dtype, Shape, and Strides

A NumPy array is fundamentally a contiguous block of memory paired with metadata that describes how to interpret that memory. Understanding this metadata is the key to writing fast numerical code.

import numpy as np

# Create a 3x4 array of 64-bit floats
A = np.arange(12, dtype=np.float64).reshape(3, 4)

print(f"Data type:   {A.dtype}")        # float64
print(f"Shape:       {A.shape}")         # (3, 4)
print(f"Strides:     {A.strides}")       # (32, 8)
print(f"Flags:\n{A.flags}")

dtype specifies the type and size of each element. Common choices for AI work:

dtype Bytes Use Case
float64 8 Default precision, scientific computing
float32 4 Deep learning training and inference
float16 2 Mixed-precision training, model storage
int64 8 Indices, counts
int32 4 Labels, categorical features
bool 1 Masks, logical indexing

Common Pitfall: NumPy defaults to float64, but most deep learning frameworks use float32. Forgetting to cast your data to float32 before feeding it to PyTorch or TensorFlow will either cause errors or silently double your memory usage.

Strides tell NumPy how many bytes to skip to move to the next element along each dimension. For a C-contiguous (row-major) array of float64, the stride along columns is 8 bytes (one element) and along rows is 8 * ncols bytes (one full row). Understanding strides explains why some operations produce views (no data copy) and others produce copies:

# Slicing produces a view -- no data is copied
row_slice = A[1, :]       # View into row 1
col_slice = A[:, 2]       # View into column 2

# Fancy indexing produces a copy
fancy = A[[0, 2], :]      # Copy of rows 0 and 2

# Verify with np.shares_memory
print(np.shares_memory(A, row_slice))   # True  (view)
print(np.shares_memory(A, fancy))       # False (copy)

5.2.2 Memory Layout: C-Order vs. Fortran-Order

NumPy supports two memory layouts:

  • C-order (row-major): Elements within a row are contiguous in memory. This is the default.
  • Fortran-order (column-major): Elements within a column are contiguous. Common in LAPACK routines.

The layout matters for performance because modern CPUs load data in cache lines (typically 64 bytes). Accessing elements that are contiguous in memory is dramatically faster than striding across large gaps.

# C-order: iterate over rows efficiently
C = np.ones((10000, 10000), order='C')

# Fortran-order: iterate over columns efficiently
F = np.ones((10000, 10000), order='F')

# Row sum is faster for C-order
%timeit C.sum(axis=1)   # Accessing contiguous rows
%timeit F.sum(axis=1)   # Striding across columns (slower)

# Column sum is faster for Fortran-order
%timeit C.sum(axis=0)   # Striding across rows (slower)
%timeit F.sum(axis=0)   # Accessing contiguous columns

Real-World Application: When you multiply matrices using NumPy's np.dot or the @ operator, the underlying BLAS library handles memory layout automatically. However, when you write custom operations that iterate over specific axes, choosing the right layout can yield 2-10x speedups on large arrays.

5.2.3 Broadcasting: Eliminating Loops Through Shape Alignment

Broadcasting is NumPy's mechanism for performing arithmetic on arrays of different shapes without explicitly copying data. It is arguably the single most important concept for writing fast numerical Python code.

The Broadcasting Rules:

  1. If the arrays have different numbers of dimensions, the shape of the smaller array is padded with ones on the left.
  2. Arrays with a size of 1 along a particular dimension act as if they had the size of the largest array in that dimension.
  3. If sizes along any dimension disagree and neither is 1, an error is raised.
# Example 1: Subtracting the mean from each column
data = np.random.randn(1000, 5)        # Shape: (1000, 5)
col_means = data.mean(axis=0)           # Shape: (5,)
centered = data - col_means             # Broadcasting: (1000, 5) - (5,)
# col_means is broadcast to (1000, 5) without copying

# Example 2: Computing a pairwise distance matrix
# Given N points in D dimensions, compute all N*N distances
points = np.random.randn(100, 3)       # 100 points in 3D

# Use broadcasting to compute pairwise differences
# points[:, np.newaxis, :] has shape (100, 1, 3)
# points[np.newaxis, :, :] has shape (1, 100, 3)
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]  # (100, 100, 3)
distances = np.sqrt((diff ** 2).sum(axis=2))                # (100, 100)

Connecting to Chapter 2: Recall from Section 2.3 that centering a data matrix requires subtracting the column-wise mean. Broadcasting makes this operation both concise and fast:

$$ \mathbf{X}_{\text{centered}} = \mathbf{X} - \boldsymbol{\mu}^T $$

where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector. In NumPy, this is simply X - mu with broadcasting handling the shape alignment.

# Centering + standardization (z-score normalization)
# As used in Chapter 4 for standardizing features
X = np.random.randn(500, 10)
mu = X.mean(axis=0)                # (10,)
sigma = X.std(axis=0)              # (10,)
X_standardized = (X - mu) / sigma  # Broadcasting handles both operations

Advanced Broadcasting Patterns. Broadcasting becomes particularly powerful for implementing mathematical operations that would otherwise require explicit loops. Here are several patterns that appear frequently in AI engineering:

# Pattern 1: Outer product via broadcasting
# Given vectors a (M,) and b (N,), compute the M x N outer product
a = np.array([1, 2, 3])
b = np.array([4, 5])
outer = a[:, np.newaxis] * b[np.newaxis, :]  # Shape: (3, 2)
# Equivalent to np.outer(a, b) but generalizes to higher dimensions

# Pattern 2: Batch matrix-vector multiplication
# Given a batch of matrices A (B, M, N) and vectors v (B, N)
# Compute A @ v for each item in the batch
B, M, N = 32, 5, 3
A = np.random.randn(B, M, N)
v = np.random.randn(B, N)
result = np.einsum('bij,bj->bi', A, v)  # Shape: (B, M)

# Pattern 3: Computing cosine similarity matrix
# Given two sets of vectors, compute all pairwise cosine similarities
X = np.random.randn(100, 64)  # 100 vectors of dimension 64
Y = np.random.randn(50, 64)   # 50 vectors of dimension 64
X_norm = X / np.linalg.norm(X, axis=1, keepdims=True)
Y_norm = Y / np.linalg.norm(Y, axis=1, keepdims=True)
cosine_sim = X_norm @ Y_norm.T  # Shape: (100, 50), no loops needed

# Pattern 4: One-hot encoding via broadcasting
labels = np.array([0, 2, 1, 3])
n_classes = 4
one_hot = (labels[:, np.newaxis] == np.arange(n_classes)).astype(float)
# Shape: (4, 4), each row is a one-hot vector

These broadcasting patterns eliminate loops that would otherwise dominate computation time. The cosine similarity example, for instance, computes 5,000 similarity scores in a single expression that runs at near-C speed. As we will see in Chapters 7 through 10 when implementing machine learning algorithms, mastering these patterns is essential for writing efficient numerical code.

5.2.4 Vectorization: Replacing Loops with Array Operations

Vectorization means expressing computations as operations on entire arrays rather than looping over individual elements. This is critical because:

  1. It eliminates Python loop overhead. A Python for loop over one million elements incurs one million interpreter dispatches. A vectorized operation dispatches once to compiled C code.
  2. It enables SIMD instructions. Modern CPUs can process 4-8 floating-point operations simultaneously using SIMD (Single Instruction, Multiple Data) registers. NumPy's compiled routines exploit these instructions automatically.
# SLOW: Python loop to compute softmax
def softmax_loop(x: np.ndarray) -> np.ndarray:
    """Compute softmax using explicit Python loops."""
    result = np.zeros_like(x)
    for i in range(x.shape[0]):
        exp_sum = 0.0
        for j in range(x.shape[1]):
            exp_sum += np.exp(x[i, j])
        for j in range(x.shape[1]):
            result[i, j] = np.exp(x[i, j]) / exp_sum
    return result

# FAST: Vectorized softmax
def softmax_vectorized(x: np.ndarray) -> np.ndarray:
    """Compute softmax using vectorized NumPy operations.

    Uses the numerical stability trick of subtracting the max
    before exponentiating, as discussed in Chapter 3.

    Args:
        x: Input array of shape (batch_size, num_classes).

    Returns:
        Softmax probabilities of same shape as input.
    """
    x_max = x.max(axis=1, keepdims=True)     # Numerical stability
    exp_x = np.exp(x - x_max)                # Broadcasting
    return exp_x / exp_x.sum(axis=1, keepdims=True)  # Broadcasting

# Benchmark
x = np.random.randn(1000, 100)
# softmax_loop:       ~2.5 seconds
# softmax_vectorized:  ~0.003 seconds (800x faster)

Intuition: Vectorization is like the difference between carrying groceries one item at a time versus loading them all into a cart. The total weight is the same, but the overhead per item drops dramatically.

5.2.5 Advanced Indexing and Masking

NumPy provides several indexing mechanisms beyond basic slicing. These are essential for data preprocessing in AI pipelines.

# Boolean masking: select elements meeting a condition
data = np.random.randn(1000)
positive = data[data > 0]               # All positive values
outliers = data[np.abs(data) > 2.5]     # Values beyond 2.5 std devs

# Integer array indexing: select specific rows/columns
indices = np.array([0, 5, 10, 15])
selected_rows = data[indices]

# np.where: conditional element selection
clipped = np.where(data > 0, data, 0)   # ReLU activation

# np.select: multiple conditions
conditions = [data < -1, data > 1]
choices = [-1, 1]
categorized = np.select(conditions, choices, default=0)

# Combining masks with logical operators
mask = (data > -1) & (data < 1)          # Elements in [-1, 1]
filtered = data[mask]

5.2.6 Structured Arrays and Record Arrays

When working with heterogeneous tabular data at the NumPy level (before reaching for pandas), structured arrays are invaluable:

# Define a structured dtype for a dataset
dt = np.dtype([
    ('feature_1', np.float32),
    ('feature_2', np.float32),
    ('label', np.int32),
    ('weight', np.float64)
])

# Create structured array
dataset = np.zeros(1000, dtype=dt)
dataset['feature_1'] = np.random.randn(1000).astype(np.float32)
dataset['feature_2'] = np.random.randn(1000).astype(np.float32)
dataset['label'] = np.random.randint(0, 10, 1000)
dataset['weight'] = np.random.uniform(0.5, 2.0, 1000)

# Access fields by name
positive_class = dataset[dataset['label'] == 1]
weighted_mean = np.average(
    dataset['feature_1'],
    weights=dataset['weight']
)

5.2.7 Linear Algebra with np.linalg

Building on Chapter 2, here are the essential NumPy linear algebra operations you will use throughout this book:

from numpy.linalg import norm, solve, eig, svd, inv, det

A = np.random.randn(5, 5)
b = np.random.randn(5)

# Solve linear system Ax = b (prefer over inv(A) @ b)
x = solve(A, b)

# Eigendecomposition (Chapter 2, Section 2.5)
eigenvalues, eigenvectors = eig(A)

# Singular Value Decomposition (Chapter 2, Section 2.6)
U, sigma, Vt = svd(A)

# Matrix norms
frobenius = norm(A, 'fro')
spectral = norm(A, 2)

# Determinant and condition number
d = det(A)
condition = np.linalg.cond(A)

Common Pitfall: Never use np.linalg.inv(A) @ b to solve a linear system. Use np.linalg.solve(A, b) instead. The solve function is numerically more stable and approximately twice as fast because it uses LU decomposition without explicitly forming the inverse.


5.3 pandas for Data Manipulation

While NumPy excels at homogeneous numerical arrays, real-world AI datasets are messy: they contain mixed types, missing values, datetime indices, and categorical variables. pandas is the library that bridges this gap, providing the DataFrame abstraction for tabular data manipulation.

5.3.1 DataFrame Creation and Inspection

import pandas as pd

# Create from dictionary
df = pd.DataFrame({
    'user_id': range(1, 1001),
    'age': np.random.randint(18, 65, 1000),
    'income': np.random.lognormal(10.5, 0.8, 1000),
    'signup_date': pd.date_range('2020-01-01', periods=1000, freq='D'),
    'plan': np.random.choice(['free', 'basic', 'premium'], 1000,
                              p=[0.6, 0.3, 0.1]),
    'churn': np.random.binomial(1, 0.15, 1000)
})

# Inspection methods every AI engineer should memorize
print(df.shape)                # (1000, 6)
print(df.dtypes)               # Column types
print(df.describe())           # Summary statistics
print(df.info())               # Memory usage, null counts
print(df.head(10))             # First 10 rows
print(df.isnull().sum())       # Missing value counts per column
print(df.nunique())            # Unique value counts

5.3.2 Selection and Filtering

pandas provides multiple indexing mechanisms. Use them consistently for clarity:

# Column selection
ages = df['age']                       # Single column -> Series
subset = df[['age', 'income']]         # Multiple columns -> DataFrame

# Row selection by label (.loc) and position (.iloc)
first_ten = df.iloc[:10]               # First 10 rows by position
specific = df.loc[df['plan'] == 'premium']  # Rows where plan is premium

# Combined row and column selection
premium_income = df.loc[
    df['plan'] == 'premium',
    ['user_id', 'income']
]

# Complex boolean filtering
high_value = df.loc[
    (df['income'] > 50000) &
    (df['plan'].isin(['basic', 'premium'])) &
    (df['churn'] == 0)
]

# Query syntax for readability
high_value_query = df.query(
    "income > 50000 and plan in ['basic', 'premium'] and churn == 0"
)

Best Practice: Use .loc for label-based indexing and .iloc for position-based indexing. Avoid the deprecated .ix accessor and avoid chained indexing like df[df['x'] > 0]['y'] which can produce a SettingWithCopyWarning and lead to subtle bugs.

5.3.3 Data Cleaning and Transformation

Cleaning data is where AI engineers spend a significant fraction of their time. pandas makes the common operations concise:

# Handling missing values
df_clean = df.copy()
df_clean['income'] = df_clean['income'].fillna(df_clean['income'].median())
df_clean = df_clean.dropna(subset=['age'])  # Drop rows with missing age

# Type conversion
df_clean['plan'] = df_clean['plan'].astype('category')
df_clean['signup_date'] = pd.to_datetime(df_clean['signup_date'])

# String operations
df_clean['plan_upper'] = df_clean['plan'].str.upper()

# Creating new features (feature engineering preview for Chapter 9)
df_clean['income_log'] = np.log1p(df_clean['income'])
df_clean['age_group'] = pd.cut(
    df_clean['age'],
    bins=[17, 25, 35, 45, 55, 65],
    labels=['18-25', '26-35', '36-45', '46-55', '56-65']
)
df_clean['tenure_days'] = (
    pd.Timestamp('2023-01-01') - df_clean['signup_date']
).dt.days

# Applying custom functions
def income_bracket(income: float) -> str:
    """Categorize income into brackets.

    Args:
        income: Annual income in dollars.

    Returns:
        String label for income bracket.
    """
    if income < 30000:
        return 'low'
    elif income < 75000:
        return 'medium'
    else:
        return 'high'

df_clean['income_bracket'] = df_clean['income'].apply(income_bracket)

5.3.4 GroupBy: Split-Apply-Combine

The GroupBy pattern is the workhorse of data analysis. It splits data into groups, applies a function to each group, and combines the results:

# Basic aggregation
plan_stats = df.groupby('plan').agg(
    mean_income=('income', 'mean'),
    median_income=('income', 'median'),
    churn_rate=('churn', 'mean'),
    count=('user_id', 'count')
).round(2)

print(plan_stats)

# Multiple grouping columns
age_plan_stats = df.groupby(['age_group', 'plan']).agg(
    mean_income=('income', 'mean'),
    churn_rate=('churn', 'mean')
).round(3)

# Custom aggregation functions
def iqr(x: pd.Series) -> float:
    """Compute interquartile range.

    Args:
        x: Numeric Series.

    Returns:
        Difference between 75th and 25th percentiles.
    """
    return x.quantile(0.75) - x.quantile(0.25)

income_spread = df.groupby('plan')['income'].agg(['mean', 'std', iqr])

# Transform: apply function and return same-shaped result
df['income_zscore'] = df.groupby('plan')['income'].transform(
    lambda x: (x - x.mean()) / x.std()
)

5.3.5 Method Chaining for Readable Pipelines

Method chaining produces clean, readable data transformations that read top-to-bottom like a recipe:

result = (
    df
    .query("age >= 25 and age <= 55")
    .assign(
        income_log=lambda x: np.log1p(x['income']),
        tenure_days=lambda x: (
            pd.Timestamp('2023-01-01') - x['signup_date']
        ).dt.days
    )
    .groupby('plan')
    .agg(
        avg_income=('income_log', 'mean'),
        churn_rate=('churn', 'mean'),
        avg_tenure=('tenure_days', 'mean'),
        n_users=('user_id', 'count')
    )
    .sort_values('churn_rate', ascending=True)
    .round(3)
)

Intuition: Method chaining is analogous to Unix pipes (cat file | grep pattern | sort | uniq). Each step transforms the data and passes it to the next step. This makes the pipeline's logic transparent and easy to debug by removing one step at a time.

5.3.6 Merging, Joining, and Concatenating

AI datasets often arrive as multiple tables that must be combined:

# Inner join (only matching keys)
users = pd.DataFrame({'user_id': [1, 2, 3], 'name': ['A', 'B', 'C']})
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'user_id': [1, 2, 2, 4],
    'amount': [50, 30, 70, 20]
})
merged = pd.merge(users, orders, on='user_id', how='inner')

# Left join (keep all users, NaN for missing orders)
merged_left = pd.merge(users, orders, on='user_id', how='left')

# Concatenating DataFrames vertically
batch_1 = df.iloc[:500]
batch_2 = df.iloc[500:]
combined = pd.concat([batch_1, batch_2], ignore_index=True)

# Pivot tables for cross-tabulation
pivot = df.pivot_table(
    values='income',
    index='age_group',
    columns='plan',
    aggfunc='mean'
).round(0)

5.3.7 Performance Considerations for pandas

pandas can be slow on large datasets if used improperly. Key optimizations:

# 1. Use appropriate dtypes to reduce memory
df['plan'] = df['plan'].astype('category')    # String -> category
df['churn'] = df['churn'].astype(np.int8)     # int64 -> int8

# Check memory savings
print(df.memory_usage(deep=True))

# 2. Avoid iterrows() -- use vectorized operations
# SLOW
for idx, row in df.iterrows():
    df.loc[idx, 'new_col'] = row['income'] * 1.1

# FAST
df['new_col'] = df['income'] * 1.1

# 3. Use eval() for complex column expressions on large DataFrames
df.eval('income_ratio = income / income.mean()', inplace=True)

# 4. Read only needed columns from CSV
df_partial = pd.read_csv(
    'large_file.csv',
    usecols=['user_id', 'income', 'churn'],
    dtype={'user_id': np.int32, 'income': np.float32, 'churn': np.int8}
)

5.4 Visualization with matplotlib and seaborn

Visualization is not a decoration -- it is a tool for understanding. The distributions from Chapter 4, the decision boundaries from upcoming Chapter 6, and the loss curves from Chapter 12 all rely on your ability to create clear, informative plots.

5.4.1 The matplotlib Object-Oriented API

matplotlib has two interfaces: the pyplot state-based API (plt.plot(...)) and the object-oriented (OO) API. For anything beyond a quick exploratory plot, the OO API provides more control and produces more maintainable code.

import matplotlib.pyplot as plt
import matplotlib as mpl

# Set a clean default style
plt.style.use('seaborn-v0_8-whitegrid')
mpl.rcParams.update({
    'font.size': 12,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'figure.figsize': (10, 6),
    'figure.dpi': 100,
    'savefig.dpi': 150,
    'savefig.bbox_inches': 'tight'
})

The Figure-Axes Model:

Every matplotlib visualization consists of a Figure (the entire canvas) containing one or more Axes (individual plots). This separation lets you build complex layouts:

# Single plot
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x_data, y_data, color='#2196F3', linewidth=2, label='Training')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss Curve')
ax.legend()
plt.show()

# Multi-panel figure
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, ax in enumerate(axes.flat):
    ax.set_title(f'Panel {i+1}')
plt.tight_layout()
plt.show()

5.4.2 Essential Plot Types for AI Engineering

Different data types and questions demand different visualizations:

import seaborn as sns

# --- Distribution Plots ---

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram with KDE overlay
axes[0].hist(df['income'], bins=50, density=True, alpha=0.7,
             color='steelblue', edgecolor='white')
sns.kdeplot(data=df, x='income', ax=axes[0], color='darkred', linewidth=2)
axes[0].set_title('Income Distribution')
axes[0].set_xlabel('Income ($)')

# Box plot by category
sns.boxplot(data=df, x='plan', y='income', ax=axes[1],
            palette='viridis', order=['free', 'basic', 'premium'])
axes[1].set_title('Income by Plan')

# Violin plot (shows full distribution shape)
sns.violinplot(data=df, x='plan', y='income', ax=axes[2],
               palette='muted', order=['free', 'basic', 'premium'])
axes[2].set_title('Income Distribution by Plan')

plt.tight_layout()
plt.show()
# --- Relationship Plots ---

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot with color encoding
scatter = axes[0].scatter(
    df['age'], df['income'],
    c=df['churn'], cmap='coolwarm', alpha=0.5, s=20
)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Income')
axes[0].set_title('Age vs. Income (colored by churn)')
plt.colorbar(scatter, ax=axes[0], label='Churned')

# Correlation heatmap
numeric_cols = df.select_dtypes(include=[np.number])
corr_matrix = numeric_cols.corr()
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
            ax=axes[1], fmt='.2f', square=True)
axes[1].set_title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()

5.4.3 Statistical Visualization with seaborn

seaborn builds on matplotlib to provide statistical visualization primitives:

# Pair plot: all pairwise relationships
g = sns.pairplot(
    df[['age', 'income', 'tenure_days', 'churn']],
    hue='churn',
    palette='Set1',
    diag_kind='kde',
    plot_kws={'alpha': 0.4, 's': 15}
)
g.fig.suptitle('Pairwise Feature Relationships', y=1.02)
plt.show()

# Joint plot with marginal distributions
g = sns.jointplot(
    data=df, x='age', y='income',
    hue='plan', kind='scatter', alpha=0.4
)
plt.show()

# Count plot for categorical variables
fig, ax = plt.subplots(figsize=(8, 5))
sns.countplot(data=df, x='plan', hue='churn', palette='pastel', ax=ax)
ax.set_title('Churn Distribution by Plan')
ax.legend(title='Churned', labels=['No', 'Yes'])
plt.show()

5.4.4 Publication-Quality Figures

When your plots go into papers, presentations, or reports, attention to detail matters:

def create_publication_figure(
    x: np.ndarray,
    y_train: np.ndarray,
    y_val: np.ndarray,
    title: str = "Model Performance",
    save_path: str | None = None
) -> tuple[plt.Figure, plt.Axes]:
    """Create a publication-quality training curve figure.

    Args:
        x: Array of epoch numbers.
        y_train: Training metric values per epoch.
        y_val: Validation metric values per epoch.
        title: Plot title.
        save_path: If provided, save figure to this path.

    Returns:
        Tuple of (Figure, Axes) objects.
    """
    fig, ax = plt.subplots(figsize=(8, 5))

    # Plot with clear visual hierarchy
    ax.plot(x, y_train, color='#1f77b4', linewidth=2,
            label='Training', marker='o', markersize=4)
    ax.plot(x, y_val, color='#ff7f0e', linewidth=2,
            label='Validation', marker='s', markersize=4,
            linestyle='--')

    # Annotate best validation point
    best_epoch = x[np.argmin(y_val)]
    best_val = y_val.min()
    ax.annotate(
        f'Best: {best_val:.4f}\n(epoch {best_epoch})',
        xy=(best_epoch, best_val),
        xytext=(best_epoch + 5, best_val + 0.05),
        arrowprops=dict(arrowstyle='->', color='gray'),
        fontsize=10, color='#ff7f0e'
    )

    # Labels and formatting
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Loss', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend(fontsize=11, frameon=True, fancybox=True,
              shadow=True, loc='upper right')
    ax.grid(True, alpha=0.3)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    plt.tight_layout()

    if save_path:
        fig.savefig(save_path, dpi=300, bbox_inches='tight')

    return fig, ax

5.4.5 Choosing the Right Visualization

Question You Are Asking Plot Type Library Call
What is the distribution of a variable? Histogram + KDE sns.histplot(data, kde=True)
How do distributions compare across groups? Box / Violin plot sns.boxplot(data, x=..., y=...)
Is there a relationship between two variables? Scatter plot ax.scatter(x, y)
What are all pairwise relationships? Pair plot sns.pairplot(df)
How do features correlate? Heatmap sns.heatmap(df.corr())
How does a metric change over time? Line plot ax.plot(time, metric)
What is the composition of categories? Bar / Count plot sns.countplot(data, x=...)
How does model performance vary? Learning curve Custom line plot

5.5 Jupyter Workflows, Profiling, and Performance

5.5.1 Effective Jupyter Notebook Workflows

Jupyter notebooks are ubiquitous in AI research and experimentation. Used well, they accelerate iteration. Used poorly, they become unreadable tangles of unreproducible code.

Essential Magic Commands:

# Auto-reload modules you are developing
%load_ext autoreload
%autoreload 2

# Inline matplotlib rendering
%matplotlib inline

# Time a single statement
%timeit np.dot(A, B)

# Time a cell
%%timeit
result = A @ B
result += C

# Profile a function
%prun my_function(data)

# Line-by-line profiling (requires line_profiler package)
%load_ext line_profiler
%lprun -f my_function my_function(data)

# Memory profiling
%load_ext memory_profiler
%memit my_function(data)

# Show environment info
%env

# Display variables in scope
%whos

Notebook Best Practices for AI Projects:

  1. Start every notebook with a markdown cell containing the purpose, date, author, and data dependencies.
  2. Put all imports in the first code cell. This makes dependencies immediately visible.
  3. Use section headers (## Section Name) to create navigable structure.
  4. Keep cells short. Each cell should do one thing and produce one output. If a cell exceeds 20 lines, refactor it into a function.
  5. Restart and run all before sharing. Notebooks that only work when cells are executed in a specific non-linear order are a common source of irreproducible results.
  6. Extract reusable code into .py modules. Notebooks are for exploration; production logic belongs in importable modules.

Common Pitfall: Jupyter notebooks maintain hidden state. If you define a variable in cell 5, delete the cell, and then reference the variable in cell 10, it still works during your session but fails on a fresh restart. Always test with "Kernel > Restart & Run All."

5.5.2 Memory Profiling and GPU Memory Management

Before discussing CPU profiling, let us address a challenge unique to AI engineering: memory management. Deep learning models consume enormous amounts of memory, and understanding where memory goes is essential for training larger models and debugging out-of-memory (OOM) errors.

System Memory Profiling. Python's memory_profiler package provides line-by-line memory usage tracking:

# Install: pip install memory_profiler
from memory_profiler import profile

@profile
def memory_intensive_pipeline(n: int = 50000) -> np.ndarray:
    """Demonstrate memory-aware data processing.

    Args:
        n: Number of samples to process.

    Returns:
        Processed feature matrix.
    """
    # Step 1: Load data (~40 MB for float64)
    raw_data = np.random.randn(n, 100)

    # Step 2: Compute covariance (~80 KB, but intermediate is large)
    centered = raw_data - raw_data.mean(axis=0)
    cov_matrix = (centered.T @ centered) / n

    # Step 3: Free intermediate data we no longer need
    del centered  # Explicitly free ~40 MB

    # Step 4: Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # Step 5: Project to top 10 components
    projected = raw_data @ eigenvectors[:, -10:]
    del raw_data  # Free original data

    return projected

The @profile decorator produces a line-by-line memory report showing exactly where allocations occur. This is invaluable for identifying unexpected memory spikes in data processing pipelines.

GPU Memory Management with PyTorch. When you move to deep learning in Chapters 11 through 22, GPU memory management becomes critical. PyTorch provides tools for monitoring and managing GPU memory:

# These examples require a CUDA-capable GPU
# import torch

# Check GPU memory usage
# print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
# print(f"Cached:    {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Common patterns for reducing GPU memory:
# 1. Use torch.no_grad() during inference
# 2. Delete intermediate tensors and call torch.cuda.empty_cache()
# 3. Use gradient checkpointing for very deep models
# 4. Reduce batch size (the most common fix for OOM errors)
# 5. Use mixed precision (float16 instead of float32)
# 6. Use gradient accumulation to simulate larger batch sizes

Key memory management rules for AI engineers:

  1. Tensors persist until garbage collected. A large intermediate tensor from a forward pass stays in GPU memory until all references to it are gone and Python's garbage collector runs.
  2. The computation graph consumes memory. During training, PyTorch stores all intermediate activations for backpropagation. This is often the largest memory consumer and scales with batch size and model depth.
  3. Cached allocations are not leaks. PyTorch's CUDA memory allocator caches freed memory for reuse rather than returning it to the OS. The torch.cuda.memory_reserved() may show high values even after tensors are freed.
  4. Profile before optimizing. Use torch.cuda.memory_summary() to get a detailed breakdown of memory usage before making changes.

5.5.3 Profiling Python Code

Before optimizing, you must measure. Python provides several profiling tools at different granularities.

cProfile: Function-Level Profiling

import cProfile
import pstats

def data_pipeline(n: int = 100000) -> np.ndarray:
    """Simulate a data processing pipeline.

    Args:
        n: Number of data points to process.

    Returns:
        Processed array of results.
    """
    data = np.random.randn(n, 50)
    centered = data - data.mean(axis=0)
    cov = np.cov(centered, rowvar=False)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    projected = centered @ eigenvectors[:, -10:]
    return projected

# Profile the pipeline
profiler = cProfile.Profile()
profiler.enable()
result = data_pipeline()
profiler.disable()

# Display results sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

line_profiler: Line-Level Profiling

# Install: pip install line_profiler
# In Jupyter:
%load_ext line_profiler

def compute_pairwise_distances(points: np.ndarray) -> np.ndarray:
    """Compute pairwise Euclidean distances.

    Args:
        points: Array of shape (n_points, n_dims).

    Returns:
        Distance matrix of shape (n_points, n_points).
    """
    n = points.shape[0]
    # Using the identity: ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a.b
    sq_norms = np.sum(points ** 2, axis=1)
    distances_sq = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * points @ points.T
    distances_sq = np.maximum(distances_sq, 0)  # Numerical stability
    return np.sqrt(distances_sq)

%lprun -f compute_pairwise_distances compute_pairwise_distances(np.random.randn(1000, 50))

memory_profiler: Memory Usage Tracking

# Install: pip install memory_profiler
from memory_profiler import profile

@profile
def memory_intensive_function() -> np.ndarray:
    """Function that allocates large arrays to demonstrate memory profiling."""
    a = np.random.randn(10000, 10000)          # ~800 MB
    b = a @ a.T                                 # Another ~800 MB
    eigenvalues = np.linalg.eigvalsh(b)         # Small output
    del a, b                                    # Free memory
    return eigenvalues

5.5.3 Optimization Strategies

Once profiling reveals bottlenecks, apply these strategies in order of decreasing impact:

Strategy 1: Algorithmic Improvement

The biggest gains come from better algorithms, not faster code. An $O(n \log n)$ sort will always beat an $O(n^2)$ sort, regardless of implementation language.

# O(n^2) pairwise comparison -- slow for large n
def find_duplicates_naive(arr: np.ndarray) -> list[tuple[int, int]]:
    """Find duplicate pairs using brute-force comparison."""
    duplicates = []
    for i in range(len(arr)):
        for j in range(i + 1, len(arr)):
            if arr[i] == arr[j]:
                duplicates.append((i, j))
    return duplicates

# O(n) using a hash map -- fast for any n
def find_duplicates_hash(arr: np.ndarray) -> list[tuple[int, int]]:
    """Find duplicate pairs using hash-based grouping."""
    from collections import defaultdict
    seen: dict[float, list[int]] = defaultdict(list)
    for i, val in enumerate(arr):
        seen[val].append(i)
    duplicates = []
    for indices in seen.values():
        if len(indices) > 1:
            for i in range(len(indices)):
                for j in range(i + 1, len(indices)):
                    duplicates.append((indices[i], indices[j]))
    return duplicates

Strategy 2: Vectorization (NumPy)

Replace Python loops with NumPy operations, as covered in Section 5.2.4.

Strategy 3: Pre-allocation

# SLOW: Growing a list dynamically
results = []
for i in range(1000000):
    results.append(np.sin(i * 0.01))
results = np.array(results)

# FAST: Pre-allocate the output array
results = np.empty(1000000)
x = np.arange(1000000) * 0.01
results = np.sin(x)  # Vectorized

Strategy 4: Numba JIT Compilation

When pure NumPy vectorization is insufficient (e.g., complex loop logic that cannot be expressed as array operations), Numba compiles Python to machine code:

from numba import njit

@njit
def mandelbrot_numba(
    c_real: float,
    c_imag: float,
    max_iter: int = 100
) -> int:
    """Compute Mandelbrot escape time for a single point.

    Args:
        c_real: Real part of complex number.
        c_imag: Imaginary part of complex number.
        max_iter: Maximum iterations before declaring convergence.

    Returns:
        Number of iterations before escape (or max_iter).
    """
    z_real, z_imag = 0.0, 0.0
    for i in range(max_iter):
        z_real_new = z_real * z_real - z_imag * z_imag + c_real
        z_imag = 2.0 * z_real * z_imag + c_imag
        z_real = z_real_new
        if z_real * z_real + z_imag * z_imag > 4.0:
            return i
    return max_iter

# First call triggers compilation; subsequent calls run at C speed

Advanced: For GPU-accelerated computation, CuPy provides a NumPy-compatible interface that runs on NVIDIA GPUs. In many cases, replacing import numpy as np with import cupy as np gives you 10-100x speedups on large arrays with no other code changes. This becomes essential in Chapters 11-17 when training deep networks.

5.5.4 Benchmarking Best Practices

When comparing performance, follow these principles to get reliable measurements:

import timeit

def benchmark_function(
    func,
    *args,
    n_runs: int = 100,
    warmup: int = 5,
    **kwargs
) -> dict[str, float]:
    """Benchmark a function with proper warm-up and statistics.

    Args:
        func: Function to benchmark.
        *args: Positional arguments to pass to func.
        n_runs: Number of timed runs.
        warmup: Number of untimed warm-up runs.
        **kwargs: Keyword arguments to pass to func.

    Returns:
        Dictionary with mean, std, min, and max times in seconds.
    """
    # Warm-up runs (JIT compilation, cache warming)
    for _ in range(warmup):
        func(*args, **kwargs)

    # Timed runs
    times = []
    for _ in range(n_runs):
        start = timeit.default_timer()
        func(*args, **kwargs)
        elapsed = timeit.default_timer() - start
        times.append(elapsed)

    times_arr = np.array(times)
    return {
        'mean': float(times_arr.mean()),
        'std': float(times_arr.std()),
        'min': float(times_arr.min()),
        'max': float(times_arr.max()),
    }

5.6 Virtual Environments and Dependency Management

Reproducibility is a cornerstone of scientific practice, and in AI engineering, reproducibility starts with your software environment. A model that trains perfectly on your laptop but crashes on a colleague's machine is worse than useless.

5.6.1 Why Environments Matter

Python packages have complex dependency graphs. Installing one package can upgrade or downgrade another, breaking unrelated projects. Virtual environments solve this by creating isolated Python installations for each project.

Consider a concrete scenario: Project A requires numpy==1.24.0 and scipy==1.10.0, while Project B requires numpy==1.26.0 and scipy==1.12.0. Without virtual environments, installing Project B's dependencies would break Project A.

5.6.2 Virtual Environment Tools

venv (Built-in)

# Create a virtual environment
python -m venv .venv

# Activate (Linux/macOS)
source .venv/bin/activate

# Activate (Windows)
.venv\Scripts\activate

# Install packages
pip install numpy pandas matplotlib seaborn jupyter

# Freeze dependencies
pip freeze > requirements.txt

# Reproduce environment
pip install -r requirements.txt

# Deactivate
deactivate

conda (Anaconda / Miniconda)

conda manages both Python packages and non-Python dependencies (C libraries, CUDA toolkits):

# Create environment with specific Python version
conda create -n ai-project python=3.11

# Activate
conda activate ai-project

# Install packages (prefer conda-forge channel)
conda install -c conda-forge numpy pandas matplotlib seaborn jupyterlab

# Export environment
conda env export > environment.yml

# Reproduce environment
conda env create -f environment.yml

When to Use Which:

Tool Best For Limitations
venv + pip Lightweight projects, CI/CD, Docker Cannot manage non-Python deps
conda Data science, GPU setup, cross-platform Slower solver, larger disk usage
poetry Library development, publishing to PyPI Learning curve, less common in AI
uv Fast environment creation, modern Python projects Newer tool, ecosystem still growing

5.6.3 Docker for Reproducible AI Environments

For production deployments and cross-platform reproducibility, Docker containers provide the strongest guarantee that your environment will behave identically on every machine. A typical AI project Dockerfile follows this pattern:

# Use an official NVIDIA CUDA base image for GPU support
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Set non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
    python3.11 python3.11-venv python3-pip \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment inside Docker
RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy and install requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project source code
COPY src/ /app/src/
COPY configs/ /app/configs/
WORKDIR /app

# Entry point for training
CMD ["python", "-m", "src.models.train"]

The key Docker pattern for AI projects is layer caching: by copying requirements.txt before the source code, Docker reuses the expensive dependency installation layer when only your code changes. This reduces rebuild times from minutes to seconds.

Practical Tip: When working with large models, use Docker multi-stage builds to separate the training environment (which needs PyTorch, CUDA, and development tools) from the inference environment (which may only need ONNX Runtime and a lightweight web framework). This can reduce your deployment image from 15 GB to under 2 GB.

5.6.4 Requirements Files and Pinning

A robust requirements.txt pins exact versions to ensure reproducibility:

# requirements.txt for AI Engineering Chapter 5 examples
numpy==1.26.4
pandas==2.2.0
matplotlib==3.8.2
seaborn==0.13.1
scipy==1.12.0
scikit-learn==1.4.0
jupyter==1.0.0
jupyterlab==4.0.10
line-profiler==4.1.2
memory-profiler==0.61.0
numba==0.59.0

Best Practice: Use pip freeze > requirements.txt for exact reproducibility in deployment. For development, maintain a hand-curated requirements.in with looser constraints (e.g., numpy>=1.24,<2.0) and use pip-compile (from pip-tools) to generate the pinned requirements.txt.

The Lock File Pattern. Modern dependency managers like Poetry and uv use a two-file approach that separates intent from resolution:

# pyproject.toml (intent -- what you want)
[project]
dependencies = [
    "numpy>=1.24,<2.0",
    "torch>=2.0",
    "pandas>=2.0",
]

# uv.lock or poetry.lock (resolution -- exact versions that work together)
# Auto-generated, contains exact versions and hashes for every
# direct and transitive dependency

This pattern gives you the best of both worlds: flexibility during development (your intent file accepts a range of versions) and exact reproducibility in deployment (the lock file pins everything). As we will explore in Chapters 34 and 35 on deployment, lock files are essential for ensuring that the model you validated in testing runs identically in production.

5.6.5 Project Structure for AI Projects

A well-organized project structure prevents the common situation where code, data, models, and notebooks become an unnavigable mess:

my-ai-project/
├── README.md
├── pyproject.toml           # Modern Python project config
├── requirements.txt         # Pinned dependencies
├── .gitignore
├── .env                     # API keys (never commit!)
│
├── data/
│   ├── raw/                 # Original, immutable data
│   ├── processed/           # Cleaned, transformed data
│   └── external/            # Third-party data sources
│
├── notebooks/
│   ├── 01-exploration.ipynb
│   ├── 02-feature-engineering.ipynb
│   └── 03-model-comparison.ipynb
│
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── loader.py        # Data loading functions
│   │   └── preprocessor.py  # Data cleaning functions
│   ├── features/
│   │   ├── __init__.py
│   │   └── build_features.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── train.py
│   │   └── predict.py
│   └── visualization/
│       ├── __init__.py
│       └── plots.py
│
├── tests/
│   ├── test_data.py
│   ├── test_features.py
│   └── test_models.py
│
├── models/                  # Trained model artifacts
│   └── model_v1.pkl
│
├── reports/
│   └── figures/             # Generated figures for reports
│
└── configs/
    ├── train_config.yaml
    └── model_config.yaml

Real-World Application: This structure (inspired by Cookiecutter Data Science) is used at companies like Spotify, Airbnb, and Netflix for their ML projects. Adopting it from the start of a project saves enormous refactoring effort later.


5.7 Coding Best Practices for AI Projects

Writing code that runs is the minimum bar. Writing code that your colleagues (and future you) can understand, modify, and trust is the professional standard.

5.7.1 Type Hints for AI Code

Python's type hints document function contracts and enable static analysis with tools like mypy:

from typing import Optional, Union
import numpy as np
import pandas as pd
from numpy.typing import NDArray

def normalize_features(
    X: NDArray[np.float64],
    method: str = "z-score",
    axis: int = 0,
    clip_range: Optional[tuple[float, float]] = None
) -> tuple[NDArray[np.float64], dict[str, NDArray[np.float64]]]:
    """Normalize features using the specified method.

    Implements normalization techniques discussed in Chapter 4
    for preprocessing data before model training.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        method: Normalization method. One of "z-score", "min-max",
            or "robust".
        axis: Axis along which to compute statistics.
            0 for column-wise, 1 for row-wise.
        clip_range: Optional (min, max) tuple to clip values after
            normalization.

    Returns:
        Tuple of (normalized_X, params) where params is a dict
        containing the statistics needed to transform new data.

    Raises:
        ValueError: If method is not recognized.
        ValueError: If X contains NaN values.

    Examples:
        >>> X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
        >>> X_norm, params = normalize_features(X, method="z-score")
        >>> np.allclose(X_norm.mean(axis=0), 0)
        True
    """
    if np.isnan(X).any():
        raise ValueError("Input contains NaN values. Clean data first.")

    if method == "z-score":
        mean = X.mean(axis=axis, keepdims=True)
        std = X.std(axis=axis, keepdims=True)
        std[std == 0] = 1.0  # Prevent division by zero
        X_norm = (X - mean) / std
        params = {"mean": mean.squeeze(), "std": std.squeeze()}
    elif method == "min-max":
        x_min = X.min(axis=axis, keepdims=True)
        x_max = X.max(axis=axis, keepdims=True)
        denom = x_max - x_min
        denom[denom == 0] = 1.0
        X_norm = (X - x_min) / denom
        params = {"min": x_min.squeeze(), "max": x_max.squeeze()}
    elif method == "robust":
        median = np.median(X, axis=axis, keepdims=True)
        q75 = np.percentile(X, 75, axis=axis, keepdims=True)
        q25 = np.percentile(X, 25, axis=axis, keepdims=True)
        iqr = q75 - q25
        iqr[iqr == 0] = 1.0
        X_norm = (X - median) / iqr
        params = {"median": median.squeeze(), "iqr": iqr.squeeze()}
    else:
        raise ValueError(
            f"Unknown method '{method}'. "
            f"Choose from 'z-score', 'min-max', 'robust'."
        )

    if clip_range is not None:
        X_norm = np.clip(X_norm, clip_range[0], clip_range[1])

    return X_norm, params

5.7.2 Google-Style Docstrings

All code in this book uses Google-style docstrings. The key sections are:

def function_name(param1: type, param2: type = default) -> return_type:
    """One-line summary of what the function does.

    Extended description explaining the function's purpose,
    algorithm, and any important caveats.

    Args:
        param1: Description of param1.
        param2: Description of param2. Defaults to default.

    Returns:
        Description of the return value. For complex returns,
        describe the structure.

    Raises:
        ValueError: When param1 is invalid.
        RuntimeError: When computation fails.

    Examples:
        >>> result = function_name(42, param2=0.5)
        >>> print(result)
        expected_output

    Note:
        Any mathematical notes, references to chapters, or
        implementation details.
    """

5.7.3 Debugging Machine Learning Code

Debugging ML code is fundamentally different from debugging traditional software. In conventional programming, a bug produces a crash or a visibly wrong output. In ML, a bug often produces a model that trains without errors but learns nothing useful -- or, worse, appears to learn but fails silently on edge cases. Here are systematic strategies for catching these subtle failures.

Shape Assertions. The most common ML bugs are shape mismatches. Insert shape assertions liberally during development:

def forward_pass(
    X: np.ndarray,
    W1: np.ndarray,
    b1: np.ndarray,
    W2: np.ndarray,
    b2: np.ndarray
) -> np.ndarray:
    """A two-layer neural network forward pass with shape checking.

    Args:
        X: Input of shape (batch_size, input_dim).
        W1: First layer weights (input_dim, hidden_dim).
        b1: First layer bias (hidden_dim,).
        W2: Second layer weights (hidden_dim, output_dim).
        b2: Second layer bias (output_dim,).

    Returns:
        Output logits of shape (batch_size, output_dim).
    """
    assert X.ndim == 2, f"Expected 2D input, got {X.ndim}D"
    batch_size, input_dim = X.shape
    assert W1.shape[0] == input_dim, (
        f"Weight shape {W1.shape} incompatible with input dim {input_dim}"
    )

    h = X @ W1 + b1       # (batch_size, hidden_dim)
    assert h.shape == (batch_size, W1.shape[1])

    h = np.maximum(h, 0)  # ReLU activation
    out = h @ W2 + b2     # (batch_size, output_dim)
    assert out.shape == (batch_size, W2.shape[1])

    return out

NaN and Inf Detection. Numerical instability often manifests as NaN or Inf values that propagate silently through computations. Add checks after critical operations:

def check_numerical_health(
    tensor: np.ndarray,
    name: str = "tensor"
) -> None:
    """Check for NaN and Inf values in a numerical array.

    Args:
        tensor: Array to check.
        name: Name for error messages.

    Raises:
        RuntimeError: If NaN or Inf values are detected.
    """
    if np.isnan(tensor).any():
        nan_count = np.isnan(tensor).sum()
        raise RuntimeError(
            f"{name} contains {nan_count} NaN values "
            f"(out of {tensor.size} total elements)"
        )
    if np.isinf(tensor).any():
        inf_count = np.isinf(tensor).sum()
        raise RuntimeError(
            f"{name} contains {inf_count} Inf values"
        )

The Overfit-One-Batch Test. Before training on a full dataset, verify that your model can perfectly memorize a single small batch. If it cannot, there is a bug in your training loop, loss function, or model architecture. This technique, recommended by Andrej Karpathy in his influential "Recipe for Training Neural Networks" blog post, catches the vast majority of implementation bugs:

# The overfit-one-batch sanity check
single_batch_X = X_train[:32]
single_batch_y = y_train[:32]

for epoch in range(200):
    loss = train_step(model, single_batch_X, single_batch_y)
    if epoch % 50 == 0:
        print(f"Epoch {epoch}: loss = {loss:.6f}")

# If loss does not decrease to near zero, something is wrong
assert loss < 0.01, f"Model cannot overfit one batch! Final loss: {loss}"

Gradient Checking. When implementing custom backward passes, as we will do in Chapters 11 and 12, verify your analytical gradients against numerical finite-difference approximations:

def numerical_gradient(
    f, x: np.ndarray, epsilon: float = 1e-5
) -> np.ndarray:
    """Compute numerical gradient using central differences.

    Args:
        f: Scalar-valued function.
        x: Point at which to compute gradient.
        epsilon: Step size for finite differences.

    Returns:
        Numerical gradient approximation.
    """
    grad = np.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus.flat[i] += epsilon
        x_minus.flat[i] -= epsilon
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad

Practical Tip: Keep a debugging checklist for ML code: (1) Can the model overfit one batch? (2) Does the loss decrease monotonically with a very small learning rate? (3) Are gradients neither vanishing nor exploding? (4) Do the data labels match the data inputs after shuffling? (5) Are the input features properly normalized? Systematically working through this checklist will resolve the vast majority of training failures, as we discuss further in Chapter 12.

5.7.4 Error Handling in Numerical Code

Numerical code fails in subtle ways: silent NaN propagation, integer overflow, and catastrophic cancellation. Defensive coding catches these early:

def safe_log_probability(
    probabilities: NDArray[np.float64],
    epsilon: float = 1e-10
) -> NDArray[np.float64]:
    """Compute log probabilities with numerical safety guards.

    Prevents log(0) = -inf and log(negative) = NaN, which are
    common failure modes in probability computations (Chapter 4).

    Args:
        probabilities: Array of probability values in [0, 1].
        epsilon: Small value to clamp probabilities away from zero.

    Returns:
        Array of log probabilities.

    Raises:
        ValueError: If any probability is negative.
    """
    probabilities = np.asarray(probabilities, dtype=np.float64)

    if (probabilities < 0).any():
        raise ValueError(
            f"Negative probabilities detected. "
            f"Min value: {probabilities.min():.6e}"
        )

    if (probabilities > 1.0 + 1e-6).any():
        import warnings
        warnings.warn(
            f"Probabilities exceed 1.0 (max: {probabilities.max():.6e}). "
            f"Clipping to [0, 1].",
            RuntimeWarning
        )
        probabilities = np.clip(probabilities, 0.0, 1.0)

    # Clamp to [epsilon, 1.0] before taking log
    safe_probs = np.clip(probabilities, epsilon, 1.0)
    return np.log(safe_probs)

5.7.5 Testing Numerical Code

Testing AI code requires special attention to floating-point comparison:

import numpy as np

def test_normalize_features_zscore():
    """Test z-score normalization produces zero mean and unit variance."""
    X = np.random.randn(1000, 5) * 10 + 50  # Arbitrary mean and scale
    X_norm, params = normalize_features(X, method="z-score")

    # Use np.allclose for floating-point comparison
    np.testing.assert_allclose(
        X_norm.mean(axis=0), 0.0, atol=1e-10,
        err_msg="Mean should be approximately zero after z-score"
    )
    np.testing.assert_allclose(
        X_norm.std(axis=0), 1.0, atol=1e-10,
        err_msg="Std should be approximately one after z-score"
    )

def test_normalize_features_minmax_range():
    """Test min-max normalization produces values in [0, 1]."""
    X = np.random.randn(100, 3) * 100
    X_norm, _ = normalize_features(X, method="min-max")

    assert X_norm.min() >= 0.0, "Min-max values should be >= 0"
    assert X_norm.max() <= 1.0, "Min-max values should be <= 1"

def test_normalize_features_invalid_method():
    """Test that invalid method raises ValueError."""
    X = np.random.randn(10, 3)
    try:
        normalize_features(X, method="invalid")
        assert False, "Should have raised ValueError"
    except ValueError as e:
        assert "Unknown method" in str(e)

5.7.6 Code Organization Patterns

Configuration over Hardcoding:

from dataclasses import dataclass

@dataclass
class TrainingConfig:
    """Configuration for model training.

    Centralizes all hyperparameters and settings to make
    experiments reproducible and easy to modify.

    Attributes:
        learning_rate: Initial learning rate for optimizer.
        batch_size: Number of samples per training batch.
        epochs: Maximum number of training epochs.
        early_stopping_patience: Epochs to wait before stopping.
        random_seed: Seed for reproducibility.
    """
    learning_rate: float = 0.001
    batch_size: int = 64
    epochs: int = 100
    early_stopping_patience: int = 10
    random_seed: int = 42
    data_path: str = "data/processed/train.csv"
    model_save_path: str = "models/best_model.pkl"

Reproducibility Through Seed Setting:

def set_global_seed(seed: int = 42) -> None:
    """Set random seeds for reproducibility across all libraries.

    This function should be called at the beginning of every
    experiment to ensure deterministic results.

    Args:
        seed: Integer seed value.
    """
    import random
    random.seed(seed)
    np.random.seed(seed)

    # For PyTorch (Chapters 11+)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass

5.8 Chapter Summary

This chapter armed you with the practical Python skills that underpin every AI project. You now understand not just what to call, but why certain patterns are fast and others are slow -- and that understanding will compound throughout the remaining chapters of this book.

Key Concepts

  1. Python's Architecture: Python serves as a coordination layer for high-performance C/Fortran libraries. Performance-critical code runs in compiled extensions, not the Python interpreter.
  2. NumPy Array Internals: dtype, shape, and strides determine how arrays are stored and accessed. Understanding these mechanics explains view-vs-copy behavior and cache performance.
  3. Broadcasting: NumPy's mechanism for operating on arrays of different shapes without explicit copying. It eliminates loops and is central to efficient numerical code.
  4. Vectorization: Expressing computations as array operations rather than element-wise loops, yielding 100-1000x speedups.
  5. pandas DataFrames: The standard abstraction for tabular data, supporting filtering, grouping, joining, and transformation through a fluent method-chaining API.
  6. matplotlib OO API: The Figure-Axes model provides fine-grained control for publication-quality visualizations.
  7. Profiling Before Optimizing: Use cProfile, line_profiler, and memory_profiler to identify actual bottlenecks before applying optimization strategies.
  8. Reproducible Environments: Virtual environments and pinned dependencies are non-negotiable for professional AI work.

Key Code Patterns

# Pattern 1: Vectorized operation with broadcasting
result = (X - X.mean(axis=0)) / X.std(axis=0)

# Pattern 2: pandas method chain
result = (
    df
    .query("condition")
    .assign(new_col=lambda x: transform(x['col']))
    .groupby('key')
    .agg(metric=('col', 'func'))
)

# Pattern 3: Publication figure
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, label='Data')
ax.set_xlabel('X'); ax.set_ylabel('Y'); ax.set_title('Title')
ax.legend(); plt.tight_layout()

# Pattern 4: Reproducible experiment
set_global_seed(42)
config = TrainingConfig(learning_rate=0.001)

Decision Framework

When choosing a data structure:

├── Homogeneous numerical data?
│   ├── Yes → NumPy array
│   └── No → Continue
├── Tabular with mixed types?
│   ├── Yes → pandas DataFrame
│   └── No → Continue
├── Sparse data (mostly zeros)?
│   ├── Yes → scipy.sparse
│   └── No → Python dict/list

When optimizing slow code:

├── Is the algorithm optimal? (O(n log n) vs O(n^2))
│   ├── No → Fix algorithm first
│   └── Yes → Continue
├── Can it be vectorized with NumPy?
│   ├── Yes → Vectorize
│   └── No → Continue
├── Can Numba JIT compile the loop?
│   ├── Yes → Add @njit decorator
│   └── No → Consider Cython or C extension

What's Next

In Chapter 6: Supervised Learning -- Regression and Classification, you will apply everything from this chapter to implement your first machine learning algorithms. The NumPy vectorization skills will make gradient descent fast, the pandas skills will help you prepare training data, and the matplotlib skills will let you visualize decision boundaries and learning curves. The mathematical foundations from Chapters 2-4 and the computational toolkit from this chapter converge into working ML systems.

Before moving on, complete the exercises and quiz to solidify your understanding.


Chapter 5 Exercises -> exercises.md

Chapter 5 Quiz -> quiz.md

Case Study: Building an End-to-End Data Analysis Pipeline -> case-study-01.md

Case Study: Optimizing Python Code for Large-Scale Data Processing -> case-study-02.md