During DPO/RLHF:

Chosen reward: $\hat{r}(y_w)$ should increase - Rejected reward: $\hat{r}(y_l)$ should decrease - Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase, but not explode - KL divergence from reference: should remain bounded (typically < 10 nats) - Accuracy: fraction of pairs correctly ordered