Chapter 12 Key Takeaways

The Big Picture

Training a deep network successfully requires orchestrating many interacting components: loss functions, optimizers, learning rate schedules, normalization layers, weight initialization, gradient clipping, and mixed precision. The difference between a model that converges and one that diverges often comes down to a handful of deliberate engineering choices. This chapter provides the complete toolkit for making deep networks learn reliably.

Loss Functions

Task	Loss Function	Output Activation	PyTorch
Regression	MSE	Linear (none)	`nn.MSELoss()`
Regression with outliers	Huber / Smooth L1	Linear	`nn.SmoothL1Loss()`
Binary classification	BCE	Sigmoid	`nn.BCEWithLogitsLoss()`
Multi-class classification	Cross-Entropy	None (raw logits)	`nn.CrossEntropyLoss()`
Class-imbalanced detection	Focal Loss	None (raw logits)	Custom implementation
Knowledge distillation	KL Divergence	Log-softmax	`nn.KLDivLoss()`

Never apply softmax before CrossEntropyLoss in PyTorch. The loss function handles log-softmax internally for numerical stability.
Label smoothing replaces hard one-hot targets with soft targets, reducing overconfidence. Available via nn.CrossEntropyLoss(label_smoothing=0.1).

Optimizers

Optimizer	Key Properties	Default LR	Best For
SGD + Momentum	Simple, generalizes well, needs tuning	0.1 (with schedule)	CNNs, long training runs
Adam	Adaptive per-parameter LR, fast convergence	3e-4	Quick prototyping, GANs
AdamW	Decoupled weight decay, best generalization	1e-4 to 5e-4	Transformers, NLP, fine-tuning

AdamW (decoupled weight decay) is preferred over Adam + L2 regularization because Adam's adaptive scaling interferes with standard L2 penalties.
Parameter groups allow different learning rates for different parts of the model---essential for fine-tuning pretrained models.
The three-step pattern is universal: optimizer.zero_grad(), loss.backward(), optimizer.step().

Learning Rate Schedules

Step decay: Multiply LR by a factor (e.g., 0.1) at predetermined epochs. Simple but requires knowing when to drop.
Cosine annealing: Smoothly reduces LR following a cosine curve. Consistently performs well with minimal tuning.
Linear warmup: Gradually increases LR from near-zero over the first few hundred/thousand steps. Critical for transformers and large batch sizes.
One-cycle policy: Increases LR to a peak, then decreases to well below the starting value. Often achieves best results in fewer epochs.
Per-epoch schedulers call scheduler.step() after each epoch; per-step schedulers call it after each batch. Mixing these up is a common bug.

Normalization Layers

Type	Normalizes Over	Use Case	Batch-Size Sensitive?
Batch Norm	(N, H, W) per channel	CNNs, large batches	Yes
Layer Norm	(C, H, W) per sample	Transformers, RNNs	No
Group Norm	Groups of channels per sample	Small batches, detection	No
Instance Norm	(H, W) per channel per sample	Style transfer	No

Batch normalization requires calling model.eval() before inference to use running statistics instead of batch statistics.
Layer normalization is batch-size independent and is the standard for transformers.

Weight Initialization

Activation	Initialization	Variance
ReLU	He / Kaiming	$\text{Var}(w) = 2/n_{\text{in}}$
Sigmoid / Tanh	Xavier / Glorot	$\text{Var}(w) = 2/(n_{\text{in}} + n_{\text{out}})$
GELU (Transformers)	Xavier or He	Depends on implementation

Initializing all weights to zero breaks symmetry---all neurons compute identical gradients and never differentiate.
Biases are almost always initialized to zero.
Proper initialization keeps activation variance and gradient variance stable across layers.

Gradient Clipping

Norm clipping (clip_grad_norm_): Scales the entire gradient vector if its norm exceeds a threshold. Preserves gradient direction.
Value clipping (clip_grad_value_): Clips individual gradient elements. Can change gradient direction.
Norm clipping with max_norm=1.0 is the standard approach and is essential for RNNs and transformers.

Mixed Precision Training

Uses FP16 for forward/backward passes (faster, less memory) and FP32 for parameter updates (maintains precision).
Loss scaling prevents FP16 gradient underflow by multiplying the loss before backward and dividing gradients after.
PyTorch's torch.amp autocast and GradScaler handle this automatically.
BF16 (bfloat16) has the same exponent range as FP32, eliminating the need for loss scaling on hardware that supports it.

The Complete Training Loop

1. Initialize model with proper initialization
2. Configure optimizer with parameter groups
3. Set up learning rate scheduler (warmup + decay)
4. For each epoch:
   a. model.train()
   b. For each batch:
      - optimizer.zero_grad()
      - Forward pass (with autocast for mixed precision)
      - Compute loss
      - loss.backward() (with grad scaler for mixed precision)
      - Clip gradients
      - optimizer.step()
      - scheduler.step() (if per-step)
   c. scheduler.step() (if per-epoch)
   d. model.eval() + validation
   e. Save checkpoint if best
   f. Check early stopping

Debugging Checklist

Overfit one batch first: If the model cannot memorize a single batch, there is a bug in the pipeline.
Check loss scale: Initial loss should be approximately $-\log(1/C)$ for $C$ classes with random initialization.
Verify gradients are nonzero: Zero gradients indicate dead neurons, wrong activation, or a disconnected computation graph.
Monitor gradient norms: Exploding norms suggest the learning rate is too high or initialization is poor.
Ensure train/eval modes are correct: model.train() during training, model.eval() during evaluation.

Common Pitfalls

Softmax before CrossEntropyLoss: Double-softmax kills gradients.
Missing zero_grad(): Gradients accumulate silently across iterations.
Wrong scheduler frequency: Calling a per-epoch scheduler every batch decays LR far too fast.
Forgetting model.eval(): Batch norm uses incorrect statistics; dropout remains active during evaluation.
Excluding bias and norm parameters from weight decay: These should typically have zero weight decay.

Looking Ahead

Chapter 13 (Regularization and Generalization): Dropout, data augmentation, weight decay, early stopping, mixup/CutMix, and the science of preventing overfitting.
Chapter 14 (Convolutional Neural Networks): Applying the training techniques from this chapter to spatial architectures for computer vision.
Chapter 15 (Recurrent Neural Networks): Gradient clipping becomes essential; normalization choices differ for sequential data.