Chapter 7: Key Takeaways

  1. Initialization determines whether training starts at all. Weight variance must scale as $O(1/n_{\text{in}})$ to prevent activations from exploding or collapsing through layers. Xavier initialization ($\text{Var} = 2/(n_{\text{in}} + n_{\text{out}})$) preserves variance for linear and tanh activations. He initialization ($\text{Var} = 2/n_{\text{in}}$) corrects for ReLU's variance-halving property. Bad initialization is not a minor inconvenience — it can make training impossible, and the optimizer cannot recover from it.

  2. Batch normalization helps by smoothing the loss landscape, not (primarily) by reducing internal covariate shift. Santurkar et al. (2018) demonstrated that BN's benefit comes from making the loss function more Lipschitz continuous, producing more predictive gradients and enabling higher learning rates. Layer normalization replaces batch normalization for transformers and any setting where batch statistics are unreliable (small batches, variable-length sequences, inference with batch size 1).

  3. Dropout is stochastic regularization — an exponentially efficient ensemble. Zeroing neurons with probability $p$ during training is equivalent to averaging $2^n$ subnetworks at inference time. Inverted dropout (scaling by $1/(1-p)$ during training) is the standard implementation because it requires no modification at inference time. Dropout rates of 0.1-0.5 are typical, with higher rates for wider layers and lower rates near the input.

  4. Use AdamW, not Adam with weight_decay, for proper regularization. L2 regularization in Adam couples the regularization strength to the adaptive learning rate, producing uneven regularization across parameters. Decoupled weight decay (AdamW) applies uniform regularization regardless of gradient history. This distinction is especially important for correlated features, where L2-via-Adam under-regularizes the features that need it most.

  5. The one-cycle learning rate policy often provides the best convergence speed and generalization. The three-phase cycle (warmup, high-LR exploration, cosine annealing) exploits the insight that large learning rates act as regularizers (preventing sharp minima) while small learning rates enable fine convergence. The one-cycle policy frequently matches or exceeds hand-tuned schedules in 5-10x fewer epochs, with only one hyperparameter to set (maximum learning rate).

  6. Mixed precision training (bf16 or fp16 with loss scaling) should be the default on modern hardware. The 2x memory savings and 2-8x speed improvement are too significant to ignore. bf16 (on Ampere+ GPUs and TPUs) is preferred because its wider dynamic range eliminates the need for loss scaling. Keep loss computation, softmax, normalization statistics, and optimizer state in fp32.

  7. Debug training systematically: data first, then single-batch overfit, then gradient flow, then scheduling. A reproducible debugging playbook — verify the data pipeline, overfit one batch, check gradient norms per layer, use the learning rate finder, add regularization incrementally — eliminates guesswork and catches the most common failures (shuffled labels, dead neurons, incorrect learning rate) within minutes rather than days.