Are the metrics improving while qualitative assessments suggest no improvement? - Are there statistical anomalies (clustering near thresholds, suspicious patterns)? - Have definitions or categories been changed in ways that improve the metric without changing the reality? - Is there a gap between pe