The policy gradient would have very high variance, requiring many more episodes to converge. - The learning rate would need to be very small to prevent instability in high-reward episodes, but this would make learning painfully slow in low-reward episodes. - In sports betting environments where indi