Chapter 12 Key Takeaways

The Big Picture

Training a deep network successfully requires orchestrating many interacting components: loss functions, optimizers, learning rate schedules, normalization layers, weight initialization, gradient clipping, and mixed precision. The difference between a model that converges and one that diverges often comes down to a handful of deliberate engineering choices. This chapter provides the complete toolkit for making deep networks learn reliably.


Loss Functions

Task Loss Function Output Activation PyTorch
Regression MSE Linear (none) nn.MSELoss()
Regression with outliers Huber / Smooth L1 Linear nn.SmoothL1Loss()
Binary classification BCE Sigmoid nn.BCEWithLogitsLoss()
Multi-class classification Cross-Entropy None (raw logits) nn.CrossEntropyLoss()
Class-imbalanced detection Focal Loss None (raw logits) Custom implementation
Knowledge distillation KL Divergence Log-softmax nn.KLDivLoss()
  • Never apply softmax before CrossEntropyLoss in PyTorch. The loss function handles log-softmax internally for numerical stability.
  • Label smoothing replaces hard one-hot targets with soft targets, reducing overconfidence. Available via nn.CrossEntropyLoss(label_smoothing=0.1).

Optimizers

Optimizer Key Properties Default LR Best For
SGD + Momentum Simple, generalizes well, needs tuning 0.1 (with schedule) CNNs, long training runs
Adam Adaptive per-parameter LR, fast convergence 3e-4 Quick prototyping, GANs
AdamW Decoupled weight decay, best generalization 1e-4 to 5e-4 Transformers, NLP, fine-tuning
  • AdamW (decoupled weight decay) is preferred over Adam + L2 regularization because Adam's adaptive scaling interferes with standard L2 penalties.
  • Parameter groups allow different learning rates for different parts of the model---essential for fine-tuning pretrained models.
  • The three-step pattern is universal: optimizer.zero_grad(), loss.backward(), optimizer.step().

Learning Rate Schedules

  • Step decay: Multiply LR by a factor (e.g., 0.1) at predetermined epochs. Simple but requires knowing when to drop.
  • Cosine annealing: Smoothly reduces LR following a cosine curve. Consistently performs well with minimal tuning.
  • Linear warmup: Gradually increases LR from near-zero over the first few hundred/thousand steps. Critical for transformers and large batch sizes.
  • One-cycle policy: Increases LR to a peak, then decreases to well below the starting value. Often achieves best results in fewer epochs.
  • Per-epoch schedulers call scheduler.step() after each epoch; per-step schedulers call it after each batch. Mixing these up is a common bug.

Normalization Layers

Type Normalizes Over Use Case Batch-Size Sensitive?
Batch Norm (N, H, W) per channel CNNs, large batches Yes
Layer Norm (C, H, W) per sample Transformers, RNNs No
Group Norm Groups of channels per sample Small batches, detection No
Instance Norm (H, W) per channel per sample Style transfer No
  • Batch normalization requires calling model.eval() before inference to use running statistics instead of batch statistics.
  • Layer normalization is batch-size independent and is the standard for transformers.

Weight Initialization

Activation Initialization Variance
ReLU He / Kaiming $\text{Var}(w) = 2/n_{\text{in}}$
Sigmoid / Tanh Xavier / Glorot $\text{Var}(w) = 2/(n_{\text{in}} + n_{\text{out}})$
GELU (Transformers) Xavier or He Depends on implementation
  • Initializing all weights to zero breaks symmetry---all neurons compute identical gradients and never differentiate.
  • Biases are almost always initialized to zero.
  • Proper initialization keeps activation variance and gradient variance stable across layers.

Gradient Clipping

  • Norm clipping (clip_grad_norm_): Scales the entire gradient vector if its norm exceeds a threshold. Preserves gradient direction.
  • Value clipping (clip_grad_value_): Clips individual gradient elements. Can change gradient direction.
  • Norm clipping with max_norm=1.0 is the standard approach and is essential for RNNs and transformers.

Mixed Precision Training

  • Uses FP16 for forward/backward passes (faster, less memory) and FP32 for parameter updates (maintains precision).
  • Loss scaling prevents FP16 gradient underflow by multiplying the loss before backward and dividing gradients after.
  • PyTorch's torch.amp autocast and GradScaler handle this automatically.
  • BF16 (bfloat16) has the same exponent range as FP32, eliminating the need for loss scaling on hardware that supports it.

The Complete Training Loop

1. Initialize model with proper initialization
2. Configure optimizer with parameter groups
3. Set up learning rate scheduler (warmup + decay)
4. For each epoch:
   a. model.train()
   b. For each batch:
      - optimizer.zero_grad()
      - Forward pass (with autocast for mixed precision)
      - Compute loss
      - loss.backward() (with grad scaler for mixed precision)
      - Clip gradients
      - optimizer.step()
      - scheduler.step() (if per-step)
   c. scheduler.step() (if per-epoch)
   d. model.eval() + validation
   e. Save checkpoint if best
   f. Check early stopping

Debugging Checklist

  1. Overfit one batch first: If the model cannot memorize a single batch, there is a bug in the pipeline.
  2. Check loss scale: Initial loss should be approximately $-\log(1/C)$ for $C$ classes with random initialization.
  3. Verify gradients are nonzero: Zero gradients indicate dead neurons, wrong activation, or a disconnected computation graph.
  4. Monitor gradient norms: Exploding norms suggest the learning rate is too high or initialization is poor.
  5. Ensure train/eval modes are correct: model.train() during training, model.eval() during evaluation.

Common Pitfalls

  1. Softmax before CrossEntropyLoss: Double-softmax kills gradients.
  2. Missing zero_grad(): Gradients accumulate silently across iterations.
  3. Wrong scheduler frequency: Calling a per-epoch scheduler every batch decays LR far too fast.
  4. Forgetting model.eval(): Batch norm uses incorrect statistics; dropout remains active during evaluation.
  5. Excluding bias and norm parameters from weight decay: These should typically have zero weight decay.

Looking Ahead

  • Chapter 13 (Regularization and Generalization): Dropout, data augmentation, weight decay, early stopping, mixup/CutMix, and the science of preventing overfitting.
  • Chapter 14 (Convolutional Neural Networks): Applying the training techniques from this chapter to spatial architectures for computer vision.
  • Chapter 15 (Recurrent Neural Networks): Gradient clipping becomes essential; normalization choices differ for sequential data.