Case Study 02: Building a Quality Dashboard
Scenario Overview
CloudKitchen is a food delivery platform startup with 25 developers organized into four squads: Orders, Payments, Delivery, and Platform. The company adopted AI coding assistants nine months ago, and while feature velocity increased significantly, the CTO, James, grew concerned about quality trends. Production incidents were trending upward, customer complaints about order accuracy had increased, and the most experienced developers were spending an unsustainable amount of time firefighting.
James tasked senior engineer Sophia with building a quality monitoring dashboard that would give every squad visibility into their code quality and help the engineering organization make data-driven decisions about quality investments. This case study follows the eight-week project from conception to deployment.
The Problem
Before the dashboard project, CloudKitchen had quality tooling, but it was fragmented and invisible:
- Ruff and mypy ran in CI, but failures were treated as annoying speed bumps rather than quality signals. Developers frequently added
# type: ignorecomments to make mypy pass rather than fixing the underlying type issues. - Test coverage was measured by Codecov but nobody looked at the reports. Coverage had declined from 82% to 64% over nine months.
- Complexity metrics were not measured at all. Several modules had functions exceeding cyclomatic complexity of 30.
- Dependency vulnerabilities were scanned by Dependabot, but PRs from Dependabot sat unreviewed for weeks.
- Review metrics were not tracked. Some PRs were reviewed in hours; others sat for over a week.
The data existed in scattered tools and reports. What was missing was a unified view that made quality visible, trackable, and actionable.
Discovery and Design (Weeks 1-2)
Stakeholder Interviews
Sophia interviewed developers from each squad, the QA lead, and the CTO. Key findings:
Developers wanted: Fast feedback on their own code quality, comparison with team norms (not individual ranking), and clear thresholds for what "good enough" means.
Squad leads wanted: Trend visibility for their squad's codebase, ability to identify which modules need quality investment, and data to justify allocating time for refactoring.
CTO wanted: Organization-wide quality trends, correlation between quality metrics and production incidents, and leading indicators that predict quality problems before they become incidents.
Metric Selection
Based on stakeholder input, Sophia selected three tiers of metrics:
Tier 1 — Critical Health Indicators (visible on the main dashboard): - Test coverage percentage (by squad and by module) - Open high/critical vulnerability count - Average cyclomatic complexity - Production incident count (trailing 30 days) - Mean time to review (PR open to first review)
Tier 2 — Detailed Quality Metrics (visible on drill-down pages): - Cognitive complexity distribution - Maintainability index by module - Code duplication percentage - Type coverage (percentage of functions with type hints) - Linter violation trends - Test pass rate and flaky test count
Tier 3 — Process Metrics (visible on team health page): - PR size distribution - Review comment density - Time to merge - Revert rate - AI assistance disclosure rate
Architecture Decision
Sophia evaluated three architecture options:
Option A: SaaS platform (SonarQube Cloud). Full-featured but expensive ($2,400/month for their team size) and required sharing code with a third party, which conflicted with CloudKitchen's security policy.
Option B: Self-hosted SonarQube. Full-featured but heavy to operate—required a dedicated server, database, and ongoing maintenance. The Platform squad pushed back on the operational burden.
Option C: Custom lightweight dashboard. Build a data collection pipeline using existing tools (radon, ruff, mypy, pytest-cov) and a simple web frontend. Lower cost, full control, but required development effort.
Sophia chose Option C with a pragmatic twist: use Python scripts for data collection, store metrics in a PostgreSQL database (already used by the platform), and build the dashboard as a simple internal web app. The total infrastructure cost was near zero since it used existing resources.
Implementation (Weeks 3-6)
Data Collection Pipeline
Sophia built a data collection system (see code/case-study-code.py and code/example-03-quality-dashboard.py for implementation details) with three components:
1. Metric Collectors
Each collector was a Python module that ran a specific analysis tool and extracted structured data:
# Conceptual architecture of the collector system
collectors = {
"complexity": {
"tool": "radon",
"frequency": "per_commit",
"metrics": ["cyclomatic_complexity", "cognitive_complexity",
"maintainability_index"],
},
"coverage": {
"tool": "pytest-cov",
"frequency": "per_commit",
"metrics": ["line_coverage", "branch_coverage", "missing_lines"],
},
"lint": {
"tool": "ruff",
"frequency": "per_commit",
"metrics": ["violation_count", "violation_by_category",
"fixable_count"],
},
"types": {
"tool": "mypy",
"frequency": "per_commit",
"metrics": ["error_count", "note_count", "files_checked"],
},
"security": {
"tool": "bandit",
"frequency": "daily",
"metrics": ["high_count", "medium_count", "low_count",
"findings_by_cwe"],
},
"dependencies": {
"tool": "pip-audit",
"frequency": "daily",
"metrics": ["vulnerable_count", "outdated_count",
"severity_distribution"],
},
}
2. Data Store
Metrics were stored in PostgreSQL with a simple schema:
CREATE TABLE quality_metrics (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
repository VARCHAR(255) NOT NULL,
branch VARCHAR(255) NOT NULL,
commit_sha VARCHAR(40) NOT NULL,
squad VARCHAR(100),
metric_name VARCHAR(255) NOT NULL,
metric_value FLOAT NOT NULL,
metric_unit VARCHAR(50),
module_path VARCHAR(500),
metadata JSONB
);
CREATE INDEX idx_metrics_time ON quality_metrics (timestamp);
CREATE INDEX idx_metrics_repo ON quality_metrics (repository, metric_name);
CREATE INDEX idx_metrics_squad ON quality_metrics (squad, metric_name, timestamp);
3. CI Integration
The data collection ran as a post-merge CI step (not on PRs, to avoid noise):
# .github/workflows/quality-metrics.yml
name: Quality Metrics Collection
on:
push:
branches: [main]
jobs:
collect-metrics:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install radon ruff mypy bandit pip-audit pytest pytest-cov
- name: Collect complexity metrics
run: python scripts/collect_complexity.py
- name: Collect coverage metrics
run: pytest --cov=src --cov-report=json && python scripts/collect_coverage.py
- name: Collect lint metrics
run: python scripts/collect_lint.py
- name: Collect security metrics
run: python scripts/collect_security.py
- name: Push metrics to database
run: python scripts/push_metrics.py
env:
METRICS_DB_URL: ${{ secrets.METRICS_DB_URL }}
Dashboard Frontend
The dashboard was built as a lightweight Flask application with server-rendered HTML and Chart.js for visualizations. Sophia deliberately avoided a complex frontend framework to keep maintenance simple.
The main dashboard page displayed:
Header section: Overall quality score (composite of all Tier 1 metrics), trend arrow (up/down from previous week), and days since last production incident.
Squad comparison section: A card for each squad showing their key metrics with traffic-light indicators. No individual developer metrics were displayed—the team agreed that quality is a team responsibility.
Trend section: Sparkline charts showing 90-day trends for each Tier 1 metric. Each chart included threshold lines showing the target values.
Alert section: Active quality alerts based on configured thresholds (see Section 30.9 in the main chapter).
Alert System
Sophia implemented tiered alerts:
alert_config = {
"test_coverage_drop": {
"condition": "coverage < previous_week_coverage - 2",
"severity": "warning",
"channel": "squad_channel",
"message": "Test coverage dropped by {delta}% this week in {squad}.",
},
"critical_vulnerability": {
"condition": "high_vuln_count > 0 or critical_vuln_count > 0",
"severity": "critical",
"channel": "engineering_all",
"message": "{count} critical/high vulnerabilities detected. "
"See dashboard for details.",
},
"complexity_threshold": {
"condition": "max_cyclomatic > 20",
"severity": "warning",
"channel": "squad_channel",
"message": "Function {function_name} in {file} has cyclomatic "
"complexity {value}. Consider refactoring.",
},
"review_bottleneck": {
"condition": "avg_review_time > 48",
"severity": "warning",
"channel": "squad_lead",
"message": "Average review time for {squad} is {hours}h. "
"Consider redistributing review load.",
},
}
Alerts were sent to Slack channels and appeared on the dashboard. Critical alerts also triggered a PagerDuty notification to the relevant squad lead.
Rollout (Weeks 6-7)
Soft Launch
Sophia rolled out the dashboard to the Platform squad first (her own squad) for a one-week soft launch. This surfaced several issues:
Data accuracy problems. The complexity collector was counting test files, which inflated average complexity numbers. Sophia added a file filter to exclude test directories.
Misleading comparisons. The Payments squad's codebase was much smaller than Orders, making percentage-based metrics misleading. Sophia added absolute numbers alongside percentages and normalized some metrics by code size.
Missing context. Raw numbers without context were confusing. Sophia added tooltips explaining what each metric means, what the threshold is, and what action to take if the metric is in the red zone.
Performance issues. The initial dashboard queried the database on every page load, which was slow. Sophia added a materialized view that refreshed every hour for the main dashboard and kept real-time queries only for drill-down pages.
Full Launch
After addressing soft launch feedback, Sophia presented the dashboard at the weekly engineering all-hands. She emphasized three points:
- This is a team tool, not a surveillance tool. No individual developer metrics are tracked or displayed.
- The goal is visibility and improvement, not punishment. Red metrics mean "this needs attention," not "someone failed."
- Every squad owns their metrics. Squad leads are responsible for understanding their metrics and making improvement plans.
Results (Week 8 and Beyond)
Immediate Impact
Within two weeks of the full launch:
- The Orders squad discovered that their coverage had dropped to 58% in a critical payment validation module. They allocated two days of focused test writing and brought it back to 79%.
- The Delivery squad found three functions with cyclomatic complexity above 25 in their routing engine. They refactored these into smaller, composable functions, reducing complexity to under 10 each.
- The Payments squad realized they had 14 Dependabot PRs that had been open for over 30 days. They held a "dependency day" and resolved all of them.
- The Platform squad noticed their average PR review time was 52 hours—more than double the 24-hour target. They implemented a review rotation system that reduced it to 18 hours.
Three-Month Outcomes
| Metric | Before Dashboard | Three Months After | Change |
|---|---|---|---|
| Average test coverage | 64% | 78% | +14 points |
| Production incidents (monthly) | 18 | 7 | -61% |
| Average cyclomatic complexity | 8.4 | 5.7 | -32% |
| High/critical vulnerabilities | 23 | 2 | -91% |
| Average PR review time | 47 hours | 16 hours | -66% |
| Maintainability index (avg) | 52 | 71 | +37% |
| Developer satisfaction (survey) | 3.1/5.0 | 4.2/5.0 | +35% |
Cultural Shift
The most significant impact was cultural. Sophia observed several behavioral changes:
Quality became a conversation topic. Squad standup meetings started including a "quality check" where the team glanced at their dashboard card. This took 30 seconds but kept quality top of mind.
Friendly competition emerged. Squads developed a healthy competitive dynamic around their metrics. The Delivery squad celebrated when they achieved the highest maintainability index score. The Orders squad responded by focusing on their own scores the following sprint.
Proactive debt management. Squad leads began proactively allocating 20% of each sprint to quality improvement, using the dashboard to identify the highest-impact work. Previously, quality work only happened in response to production incidents.
AI code scrutiny increased. Developers became more careful about reviewing AI-generated code before committing it, because they knew the metrics would reflect any quality degradation. The AI assistance disclosure rate in PR templates increased from 30% to 85%.
The Quality Ratchet
Three months after launch, James (the CTO) proposed implementing a quality ratchet based on the dashboard data. The team agreed on the following rules:
- Test coverage cannot drop below the current value (checked weekly)
- Average cyclomatic complexity cannot increase above the current value (checked monthly)
- No new high/critical vulnerabilities can be merged (checked per PR)
- Maintainability index cannot drop below current value (checked monthly)
The ratchet was implemented as a CI check that compared current metrics against stored thresholds. If a PR would cause a metric to cross the ratchet threshold, the CI check failed with a clear message explaining which metric was affected and by how much.
Technical Challenges and Solutions
Challenge 1: Metric Attribution
Problem: When multiple squads contributed to the same module, it was unclear which squad owned the quality of that module.
Solution: Sophia implemented a CODEOWNERS-based attribution system. Each file was attributed to the squad that owned it according to the CODEOWNERS file. Shared modules were attributed to the Platform squad with a note about contributing squads.
Challenge 2: Historical Data
Problem: The dashboard launched with no historical data, making trends meaningless for the first month.
Solution: Sophia wrote a backfill script that ran the metric collectors against every merge commit from the past six months. This provided immediate trend visibility at launch.
Challenge 3: Gaming Metrics
Problem: One developer was adding trivial tests (testing that True is True) to inflate coverage numbers without actually testing meaningful behavior.
Solution: The team added mutation testing scores (using mutmut) as a secondary metric alongside coverage. Mutation testing measures whether tests actually catch bugs by introducing small changes to the code and checking whether tests fail. Trivial tests score poorly on mutation testing because they do not exercise real code paths.
Challenge 4: Dashboard Maintenance
Problem: As the project grew, new modules were not automatically included in metric collection.
Solution: Sophia configured the collectors to automatically discover Python packages under the src/ directory. New modules were included automatically with default thresholds, and squad leads could customize thresholds per module.
Cost Analysis
| Item | One-Time Cost | Ongoing Monthly Cost |
|---|---|---|
| Development (Sophia, 6 weeks at 50%) | 120 hours | — |
| Dashboard infrastructure | 0 (used existing DB and server) | $0 |
| CI compute time for metric collection | — | ~$50 (additional CI minutes) |
| Maintenance and improvements | — | ~8 hours/month |
| Total | 120 hours | 8 hours + $50 |
Estimated savings (monthly): - Reduced production incidents: 11 fewer incidents x 4 hours average resolution = 44 hours - Faster code reviews: 31 hours average reduction across team per month - Proactive debt reduction: Avoiding approximately 2 emergency refactoring sessions per quarter (estimated 40 hours each) = ~27 hours/month amortized
Net monthly benefit: ~94 developer-hours saved per month, far exceeding the 8-hour maintenance cost.
Lessons Learned
-
Start with metrics people already care about. Sophia succeeded by beginning with metrics that addressed pain points the team had already identified (slow reviews, production incidents), not abstract quality scores.
-
Team-level metrics, not individual metrics. The decision to avoid individual developer metrics was crucial for adoption. Quality is a team property, and the dashboard reinforced this philosophy.
-
Make the dashboard accessible and fast. The dashboard was linked from the team's Slack workspace topic and loaded in under two seconds. If it had been buried in a tool or slow to load, adoption would have suffered.
-
Context beats numbers. Raw numbers are meaningless without context. Every metric needed a threshold, a trend indicator, and a tooltip explaining what it means and what to do about it.
-
Backfill historical data. Launching with six months of historical trends made the dashboard immediately useful, rather than requiring a month of data collection before trends became visible.
-
Guard against gaming. Any metric system will be gamed if the incentives are wrong. Mutation testing was an effective counter to coverage gaming. Pairing metrics (coverage + mutation score, complexity + maintainability index) makes gaming harder because improving one metric at the expense of another is visible.
-
Quality dashboards drive cultural change. The biggest impact was not the technical improvements but the cultural shift toward proactive quality management. Making quality visible made it a shared team value rather than an individual burden.
Conclusion
CloudKitchen's quality dashboard project demonstrates that effective quality monitoring does not require expensive enterprise tools. A thoughtful combination of open-source analysis tools, a simple data pipeline, and a well-designed frontend can provide the visibility needed to drive significant quality improvements.
The key insight is that quality metrics are only valuable when they are visible, contextual, and actionable. A dashboard that sits in a bookmark folder collecting dust is worthless. A dashboard that is part of the team's daily workflow—visible in standups, referenced in sprint planning, and celebrated when metrics improve—becomes a powerful engine for continuous quality improvement.
For AI-assisted development teams, quality dashboards serve an additional critical function: they provide objective evidence of whether AI-generated code is helping or hurting overall code quality. This evidence enables teams to make informed decisions about how and where to use AI tools most effectively.