Case Study 2: When AI Changes Your Estimates

A Team Recalibrates After Six Months of AI-Assisted Development

The Scenario

Priya Sharma leads a development team of eight at Meridian Financial Services, a mid-size fintech company. The team maintains and extends a loan origination platform -- a complex system handling loan applications, credit checks, document verification, compliance workflows, and integration with three external credit bureaus.

Six months ago, Meridian's CTO approved a pilot program for AI coding tools. Priya's team was the first to adopt Claude Code and Cursor. The CTO wanted data: how much does AI actually improve productivity, and can the company use this data to plan its 2026 roadmap?

Priya was initially skeptical. Her team's work was not simple CRUD development. Loan origination involves intricate business rules, compliance requirements, multi-step workflows, and integration with legacy systems. She suspected AI acceleration would be modest at best.

She was both right and wrong -- in ways that fundamentally changed how her team plans and estimates.

The Before: Pre-AI Estimation Baseline

Before adopting AI tools, Priya's team had a well-calibrated estimation process. They used story points with a modified Fibonacci scale and had four years of velocity data:

Quarter	Average Velocity (points/sprint)	Estimation Accuracy
Q1 2025	42	87%
Q2 2025	45	91%
Q3 2025	43	89%
Q4 2025 (pre-AI)	44	88%

The team was mature, consistent, and predictable. Sprint planning meetings rarely had surprises. Stakeholders trusted the estimates.

Then AI arrived.

Month 1: The Honeymoon Phase

The first month of AI adoption was exhilarating. Developers reported completing tasks in a fraction of the time. A CRUD interface for loan document metadata that would have taken two days was done in three hours. An integration test suite for the credit check API was generated in an afternoon instead of two days. The team's velocity for the first AI-augmented sprint was 71 story points -- a 61% increase over their pre-AI average.

Priya was cautiously optimistic. The CTO was enthusiastic. The product owner started requesting more features per sprint.

But beneath the headline velocity number, problems were brewing.

Month 2: The Reality Check

By the second month, the team noticed several troubling patterns:

Pattern 1: The Review Backlog

Code was being generated faster than it could be reviewed. The team's two senior developers, who handled most code reviews, were overwhelmed. The pull request queue grew from an average of 3 open PRs to 14. Average review time increased from 4 hours to 2.5 days. Developers were context-switching between writing new code and waiting for reviews on completed code.

Pattern 2: The Consistency Problem

Each developer's AI tool produced code in a slightly different style. More problematically, the AI tools were generating different approaches to the same patterns. One developer's AI used the factory pattern for loan product creation; another's used the builder pattern. One AI generated compliance checks as decorators; another inline in the business logic. The codebase was becoming a patchwork.

Pattern 3: The Estimation Chaos

With AI, some tasks took 10% of the original estimate while others took 90%. A developer who completed a task in 30 minutes would have 7.5 hours of idle time in the sprint day. But the next task might be a complex compliance workflow where AI was nearly useless, and the developer would blow the estimate by 2x.

Sprint 2 velocity: 58 story points. Sprint 3 velocity: 63 story points. Sprint 4 velocity: 49 story points. The consistency that Priya's team had spent years building was gone.

The product owner was frustrated. "How can I plan a roadmap if I can't predict what the team will deliver?"

Month 3: The Diagnostic Phase

Priya decided to pause the feature rush and invest a full sprint in understanding what was happening. She implemented detailed tracking:

For every task completed over two sprints, the team recorded: - Task type (categories they defined as a team) - Estimated hours (traditional) - Actual hours (with AI) - AI utilization level (high, medium, low, none) - Code quality rating (1-5, assessed during review) - Number of review iterations before approval

After two sprints of data collection (89 tasks total), the picture became clear:

Task Type Analysis:

Task Type	Count	Avg Traditional Est	Avg AI Actual	Acceleration	Avg Quality
CRUD/boilerplate	18	6 hours	1.2 hours	5.0x	3.8
UI components	12	8 hours	2.5 hours	3.2x	3.5
Test generation	15	4 hours	0.8 hours	5.0x	3.2
API integration	11	10 hours	3.5 hours	2.9x	3.9
Business logic	14	12 hours	8.0 hours	1.5x	4.2
Compliance rules	9	16 hours	14.0 hours	1.1x	4.5
Legacy system work	6	14 hours	12.0 hours	1.2x	4.3
Architecture/design	4	8 hours	7.5 hours	1.1x	N/A

The data told a striking story:

AI acceleration was bimodal, not uniform. Tasks were either dramatically accelerated (3-5x) or barely affected (1.0-1.5x). The "moderate acceleration" category was nearly empty.
Quality inversely correlated with acceleration. The fastest AI-generated code (CRUD, tests) had the lowest quality ratings, while the least-accelerated code (compliance, legacy work) had the highest quality. This made sense: developers spent more time reviewing and manually adjusting AI output for complex tasks, producing higher-quality results.
Test generation was fast but shallow. AI-generated tests achieved high line coverage but often missed the nuanced edge cases that matter in financial software. The team had to spend additional time adding meaningful test cases after the AI generated the scaffolding.

Month 4: The Recalibration

Armed with data, Priya redesigned the team's estimation and planning process. The key changes:

Change 1: Two-Category Task Classification

Instead of the three-tier system, Priya found that their data supported only two meaningful categories:

AI-Fast Tasks (acceleration 2.5x+): CRUD, boilerplate, UI components, test scaffolding, API integration, documentation
AI-Slow Tasks (acceleration 1.0-1.5x): Business logic, compliance rules, legacy system work, architecture, security-sensitive code

She applied acceleration factors of 3.5x for AI-Fast and 1.2x for AI-Slow. The simplicity helped developers quickly categorize tasks during sprint planning.

Change 2: Quality Tax

For every AI-Fast task, Priya added a "quality tax" of 30% of the AI-estimated time. This time was explicitly allocated for: - Code review (more code means more review) - Style and pattern consistency checks - Security review for AI-generated code - Test enhancement (adding edge cases to AI-generated tests)

So a task estimated at 6 hours traditionally, with a 3.5x AI factor, would not be estimated at 1.7 hours. It would be estimated at 1.7 + 0.5 = 2.2 hours, accounting for the quality tax.

Change 3: Architectural Guard Rails

Priya created a comprehensive architectural decision record (ADR) document and coding standards guide. This was fed to every AI tool as context. She also designated "pattern owners" -- senior developers responsible for approving any AI-generated code that introduced a new pattern or deviated from established patterns.

Change 4: Review Capacity Planning

The team explicitly allocated 20% of total sprint capacity to code review. This meant that when calculating how many story points to commit to, they reduced available capacity by 20% before applying velocity calculations. This was uncomfortable -- it looked like they were committing to less work -- but it was essential for maintaining quality.

Change 5: Split Velocity Tracking

Instead of a single velocity number, Priya tracked two velocities: - AI-Fast velocity: points completed for accelerated tasks - AI-Slow velocity: points completed for non-accelerated tasks

This allowed for more accurate sprint planning based on the expected mix of task types.

Month 5-6: The New Normal

With the recalibrated system in place, the team's performance stabilized:

Sprint	AI-Fast Points	AI-Slow Points	Total Points	Estimation Accuracy	Defects
Sprint 9	38	22	60	83%	7
Sprint 10	35	25	60	88%	5
Sprint 11	40	20	60	91%	4
Sprint 12	37	24	61	90%	4

Several things stand out:

Velocity stabilized at ~60 points per sprint -- a 36% increase over the pre-AI average of 44, but well below the Month 1 peak of 71. This was sustainable, predictable velocity.
Estimation accuracy recovered to near pre-AI levels (89-91%) after dropping to as low as 62% during the chaotic early months.
Defect rates dropped significantly from a peak of 14 per sprint in Month 2 to 4-5 per sprint, below even the pre-AI average of 6.
The product owner could plan reliably again. With predictable velocity and clear acceleration categories, the roadmap became trustworthy.

The Stakeholder Conversation

At the six-month review, Priya presented the findings to the CTO and VP of Product. The conversation was nuanced.

VP of Product: "So we got a 36% velocity increase? The initial numbers suggested 60%. What happened?"

Priya: "The 60% increase was real but unsustainable. It came at the cost of quality, consistency, and estimation accuracy. The 36% increase is sustainable, predictable, and does not compromise code quality. In fact, our defect rate is now lower than it was before AI, because the quality tax process catches issues that used to slip through."

CTO: "Can we get to 60% sustainably? What would it take?"

Priya: "Potentially, over time. Our AI-Fast acceleration factor is improving as the team gets better at prompting. If we invest in tooling -- automated consistency checkers, AI-powered code review to augment human review -- we could reduce the quality tax and increase effective throughput. I would estimate we could reach a sustainable 45-50% improvement within a year."

VP of Product: "How should I plan the 2026 roadmap?"

Priya: "Use a 1.35x multiplier for overall team capacity. For quarters where the roadmap is heavy on new features with standard patterns, use 1.5x. For quarters focused on compliance work or legacy migration, use 1.15x. I will provide a feature-by-feature assessment of which acceleration category each roadmap item falls into."

The CTO approved Priya's recommendations and asked her to present the estimation framework to the other three development teams at Meridian.

The Framework Priya Developed

Priya codified her learnings into a framework that any team at Meridian could adopt:

Step 1: Establish Pre-AI Baseline (1-2 sprints) Before adopting AI tools, ensure you have reliable velocity and estimation data. You cannot measure improvement without a baseline.

Step 2: Unstructured Adoption (2-3 sprints) Let the team use AI tools freely and track detailed task-level data. Do not try to optimize yet. Collect data on actual acceleration by task type.

Step 3: Data Analysis (1 sprint) Analyze the collected data to identify your team's specific acceleration patterns. Do not assume generic industry numbers apply to your domain. Financial software, with its compliance and legacy concerns, sees very different acceleration than a greenfield web application.

Step 4: Calibrate and Implement (1 sprint) Design your team's specific estimation adjustments based on the data. Implement quality safeguards (quality tax, architectural guard rails, review capacity allocation).

Step 5: Stabilize and Track (3-4 sprints) Run the calibrated system and track accuracy. Adjust factors based on observed performance. Expect 2-3 sprints before the system stabilizes.

Step 6: Continuous Refinement (ongoing) Update acceleration factors quarterly as the team improves their AI skills and as AI tools evolve. What was an AI-Slow task today might become AI-Fast in six months as tools improve.

The entire calibration process takes 8-12 sprints (4-6 months). Teams that try to skip ahead -- applying aggressive acceleration factors without data -- will repeat the estimation chaos Priya's team experienced in Months 1-2.

Key Takeaways

Initial AI velocity gains are not sustainable. The honeymoon phase produces impressive numbers that cannot be maintained without quality degradation. Plan for sustainable improvement (30-50% for most teams) rather than peak improvement (60-80%).
AI acceleration is bimodal in domain-specific work. In specialized domains like fintech, tasks are either highly accelerated or barely affected. The "moderate acceleration" middle ground is thinner than generic frameworks suggest.
Quality requires explicit investment. The "quality tax" concept -- allocating 30% of saved time back into quality assurance -- is essential for maintaining standards in AI-augmented projects.
Estimation accuracy can recover. AI disrupts estimation in the short term, but with data-driven recalibration, teams can achieve estimation accuracy equal to or better than their pre-AI baseline.
Code review is the critical constraint. In every month of the study, code review capacity was the factor most limiting sustainable velocity improvement. Investing in review efficiency (automated pre-review, pair review sessions, AI-assisted review) has the highest leverage.
Domain matters enormously. The acceleration factors for a fintech team working with compliance requirements and legacy systems are very different from a startup building a greenfield web application. Every team must derive their own numbers from their own data.
Stakeholder communication must be honest about the journey. Presenting the Month 1 velocity to executives creates expectations that cannot be sustained. Presenting the calibrated Month 5-6 data builds trust and enables reliable roadmap planning.

Discussion Questions

Priya's team found that AI acceleration was bimodal rather than distributed across three tiers. Do you think this finding would generalize to other domains, or is it specific to financial software? What characteristics of a domain might produce a more even distribution of acceleration?
The quality tax of 30% was derived empirically from Priya's team's data. How would you determine the right quality tax percentage for your team? What factors would make it higher or lower?
During the diagnostic phase, Priya invested an entire sprint in data collection rather than feature delivery. How would you justify this investment to a product owner who is under pressure to deliver features?
The CTO asked whether the team could reach 60% sustainable improvement. Priya estimated a year. What specific investments (tooling, training, process) would you recommend to close the gap between 36% and 60%?
Priya discovered that AI-generated tests achieved high line coverage but missed important edge cases. How would you design a process to evaluate the quality of AI-generated tests, beyond simple coverage metrics?
If Meridian's other three development teams work in different domains (mobile app development, data engineering, and DevOps/infrastructure), how would you expect their acceleration profiles to differ from Priya's team? What advice would you give each team based on the lessons from this case study?