Use Adam with $\beta_1 = 0.0$ and $\beta_2 = 0.9$ for WGAN-GP (note: $\beta_1 = 0$, not the default 0.9). - Learning rates between $10^{-4}$ and $2 \times 10^{-4}$ work well for most architectures. - If training diverges, reduce the learning rate rather than adding regularization. - Save checkpoints