Training a deep network successfully requires orchestrating many interacting components: loss functions, optimizers, learning rate schedules, normalization layers, weight initialization, gradient clipping, and mixed precision. The difference between a model that converges and one that diverges often comes down to a handful of deliberate engineering choices. This chapter provides the complete toolkit for making deep networks learn reliably.
Loss Functions
Task
Loss Function
Output Activation
PyTorch
Regression
MSE
Linear (none)
nn.MSELoss()
Regression with outliers
Huber / Smooth L1
Linear
nn.SmoothL1Loss()
Binary classification
BCE
Sigmoid
nn.BCEWithLogitsLoss()
Multi-class classification
Cross-Entropy
None (raw logits)
nn.CrossEntropyLoss()
Class-imbalanced detection
Focal Loss
None (raw logits)
Custom implementation
Knowledge distillation
KL Divergence
Log-softmax
nn.KLDivLoss()
Never apply softmax before CrossEntropyLoss in PyTorch. The loss function handles log-softmax internally for numerical stability.
Label smoothing replaces hard one-hot targets with soft targets, reducing overconfidence. Available via nn.CrossEntropyLoss(label_smoothing=0.1).
Optimizers
Optimizer
Key Properties
Default LR
Best For
SGD + Momentum
Simple, generalizes well, needs tuning
0.1 (with schedule)
CNNs, long training runs
Adam
Adaptive per-parameter LR, fast convergence
3e-4
Quick prototyping, GANs
AdamW
Decoupled weight decay, best generalization
1e-4 to 5e-4
Transformers, NLP, fine-tuning
AdamW (decoupled weight decay) is preferred over Adam + L2 regularization because Adam's adaptive scaling interferes with standard L2 penalties.
Parameter groups allow different learning rates for different parts of the model---essential for fine-tuning pretrained models.
The three-step pattern is universal: optimizer.zero_grad(), loss.backward(), optimizer.step().
Learning Rate Schedules
Step decay: Multiply LR by a factor (e.g., 0.1) at predetermined epochs. Simple but requires knowing when to drop.
Cosine annealing: Smoothly reduces LR following a cosine curve. Consistently performs well with minimal tuning.
Linear warmup: Gradually increases LR from near-zero over the first few hundred/thousand steps. Critical for transformers and large batch sizes.
One-cycle policy: Increases LR to a peak, then decreases to well below the starting value. Often achieves best results in fewer epochs.
Per-epoch schedulers call scheduler.step() after each epoch; per-step schedulers call it after each batch. Mixing these up is a common bug.
Normalization Layers
Type
Normalizes Over
Use Case
Batch-Size Sensitive?
Batch Norm
(N, H, W) per channel
CNNs, large batches
Yes
Layer Norm
(C, H, W) per sample
Transformers, RNNs
No
Group Norm
Groups of channels per sample
Small batches, detection
No
Instance Norm
(H, W) per channel per sample
Style transfer
No
Batch normalization requires calling model.eval() before inference to use running statistics instead of batch statistics.
Layer normalization is batch-size independent and is the standard for transformers.
Initializing all weights to zero breaks symmetry---all neurons compute identical gradients and never differentiate.
Biases are almost always initialized to zero.
Proper initialization keeps activation variance and gradient variance stable across layers.
Gradient Clipping
Norm clipping (clip_grad_norm_): Scales the entire gradient vector if its norm exceeds a threshold. Preserves gradient direction.
Value clipping (clip_grad_value_): Clips individual gradient elements. Can change gradient direction.
Norm clipping with max_norm=1.0 is the standard approach and is essential for RNNs and transformers.
Mixed Precision Training
Uses FP16 for forward/backward passes (faster, less memory) and FP32 for parameter updates (maintains precision).
Loss scaling prevents FP16 gradient underflow by multiplying the loss before backward and dividing gradients after.
PyTorch's torch.amp autocast and GradScaler handle this automatically.
BF16 (bfloat16) has the same exponent range as FP32, eliminating the need for loss scaling on hardware that supports it.
The Complete Training Loop
1. Initialize model with proper initialization
2. Configure optimizer with parameter groups
3. Set up learning rate scheduler (warmup + decay)
4. For each epoch:
a. model.train()
b. For each batch:
- optimizer.zero_grad()
- Forward pass (with autocast for mixed precision)
- Compute loss
- loss.backward() (with grad scaler for mixed precision)
- Clip gradients
- optimizer.step()
- scheduler.step() (if per-step)
c. scheduler.step() (if per-epoch)
d. model.eval() + validation
e. Save checkpoint if best
f. Check early stopping
Debugging Checklist
Overfit one batch first: If the model cannot memorize a single batch, there is a bug in the pipeline.
Check loss scale: Initial loss should be approximately $-\log(1/C)$ for $C$ classes with random initialization.
Verify gradients are nonzero: Zero gradients indicate dead neurons, wrong activation, or a disconnected computation graph.
Monitor gradient norms: Exploding norms suggest the learning rate is too high or initialization is poor.
Ensure train/eval modes are correct: model.train() during training, model.eval() during evaluation.
Common Pitfalls
Softmax before CrossEntropyLoss: Double-softmax kills gradients.
Missing zero_grad(): Gradients accumulate silently across iterations.
Wrong scheduler frequency: Calling a per-epoch scheduler every batch decays LR far too fast.
Forgetting model.eval(): Batch norm uses incorrect statistics; dropout remains active during evaluation.
Excluding bias and norm parameters from weight decay: These should typically have zero weight decay.
Looking Ahead
Chapter 13 (Regularization and Generalization): Dropout, data augmentation, weight decay, early stopping, mixup/CutMix, and the science of preventing overfitting.
Chapter 14 (Convolutional Neural Networks): Applying the training techniques from this chapter to spatial architectures for computer vision.