Case Study 24-2: Raj's Migration Pre-Mortem — Finding 12 Failure Modes Before Launch

The Setup: A high-stakes database migration with a hard deadline, a large surface area, and the kind of complexity that makes experienced engineers nervous. One structured pre-mortem session. Twelve failure modes identified. Three serious enough to change the plan.


Context

Raj Patel is a senior software engineer at Clearfield Financial Technology, a mid-sized fintech company that provides portfolio management software to wealth management firms. His team is responsible for a major infrastructure project: migrating the core data platform from a legacy Oracle database to PostgreSQL.

The migration affects 40 tables, 200+ stored procedures, and 15 downstream applications. The primary drivers are cost reduction (Oracle licensing is expensive) and modernization (the PostgreSQL ecosystem has better tooling for the team's current workflow). The hard deadline is the end of the fiscal quarter — finance has already booked the Oracle license savings.

Raj has been running infrastructure migrations for eight years. He's seen them go wrong. The project plan is solid. The team is experienced. The testing framework is thorough. And yet, as the go-live date approaches, he has a persistent low-grade anxiety he can't quite articulate.

He decides to run a structured pre-mortem.


Setting Up the Pre-Mortem

Raj blocks two hours on a Thursday afternoon, three weeks before the scheduled launch date. He opens his AI assistant and starts with a prompt that gives the full project context:

Raj's setup prompt:

I want to run a pre-mortem on the following project:

We're migrating a financial services data platform from Oracle to PostgreSQL. 40 tables, 200+ stored procedures. 15 downstream applications read from this database. The data includes portfolio positions, transaction history, and client account data. We have a hard deadline in 3 weeks for regulatory and financial reasons. The team is experienced (3 engineers, 8-10 years each). We've been testing for 6 weeks on a 10% sample of production data. We have a rollback plan.

Imagine it is 3 weeks from now. The migration has gone badly. Financial data is affected. Some clients have incorrect portfolio calculations. Regulator notifications may be required.

Generate 10-12 distinct failure stories. Make each one specific and plausible — not generic. These are financial systems with real compliance implications. After the failure stories, identify the top 5 prevention priorities for the next 3 weeks.


The Failure Stories: Selected Output

The AI generates 12 failure scenarios. Raj reads each one with his engineer's eye, classifying them as he goes: "Plausible / Not applicable / Already mitigated."

Failure Story 1 — Data Type Conversion (Raj rates: HIGH PLAUSIBILITY)

"Testing was conducted on a 10% sample of production data. The migration launched on schedule. Within 48 hours, a data integrity issue emerged in the portfolio calculation engine — a decimal arithmetic rounding difference between Oracle and PostgreSQL affected certain portfolio rebalancing calculations. The issue appeared in less than 0.5% of portfolios but was concentrated in high-net-worth accounts with complex multi-currency positions — accounts that rarely appeared in the test sample. By the time the issue was detected, affected portfolios had been reported to clients with incorrect values."

Raj stares at this for a long moment. He knows about Oracle-to-PostgreSQL decimal handling differences. His team had tested for it. But had they tested it on the right data? He pulls up the test dataset composition. It's a random 10% sample. Multi-currency portfolios with complex positions represent about 2% of total accounts — meaning in a 10% sample, maybe 15-20 such accounts. Is that enough to catch edge cases in complex rebalancing logic?

He opens a task tracker and creates a ticket: "Expand test coverage for multi-currency portfolio calculations — targeted sampling, not random."

Failure Story 3 — Stored Procedure Implicit Conversion (Raj rates: HIGH PLAUSIBILITY)

"Oracle's stored procedures handle certain implicit type conversions that PostgreSQL rejects or handles differently. The development team had audited the stored procedures for explicit type issues but had not systematically tested implicit conversions — behaviors Oracle performs silently that fail or produce different results in PostgreSQL. 23 of the 200+ stored procedures used date arithmetic constructs that worked correctly in Oracle but produced off-by-one errors in PostgreSQL under certain timezone conditions. The errors appeared in end-of-day processing, which runs between market close and overnight batch jobs."

Raj checks his stored procedure audit. It covers 186 of 212 stored procedures. The audit methodology checked for explicit type mismatches. He is not sure whether the implicit conversion patterns were checked systematically. He creates a second ticket: "Complete stored procedure audit — implicit type conversions, especially date/time arithmetic and timezone handling."

Failure Story 5 — Connection Pool Saturation Under Load (Raj rates: MEDIUM-HIGH PLAUSIBILITY)

"Testing was conducted with realistic data volumes but not realistic concurrent load. Production launch coincided with end-of-quarter reporting — the highest-traffic period of the year. PostgreSQL's default connection handling and connection pooling configuration differs from Oracle's. Under peak load, the connection pool saturated, causing timeouts in 3 of the 15 downstream applications. The applications failed ungracefully, causing data pipeline interruptions that required manual intervention to clear."

Raj's launch date is end of quarter. He had not specifically tested under end-of-quarter load. He creates a ticket: "Load test at 3x normal concurrent connections; verify connection pool configuration for PostgreSQL."

Failure Story 7 — Rollback Plan Not Actually Tested (Raj rates: MEDIUM-HIGH PLAUSIBILITY)

"The team had a documented rollback plan. When a data integrity issue was detected 6 hours post-launch, the rollback was initiated. It was discovered that the rollback procedure had been written but never executed in a realistic test environment. A step in the procedure assumed the Oracle database would still be in a consistent state at the point of rollback, but 6 hours of write activity in PostgreSQL had created a state that wasn't cleanly reversible. The rollback took 4 hours instead of the planned 45 minutes, during which both systems were unavailable."

Raj has a rollback plan. Has he run it? He checks the project documentation. The rollback procedure was last tested 8 weeks ago, before 6 weeks of ongoing development changes. He creates a ticket: "Execute full rollback drill in the week before go-live; verify under current system state, not the state from 8 weeks ago."

Failure Story 9 — Index Performance Regression (Raj rates: MEDIUM PLAUSIBILITY)

"Query performance in testing looked acceptable — response times were within the acceptable range on test runs. After production launch, several complex reporting queries that ran overnight began taking 3-4x longer than their Oracle equivalents. Investigation revealed that the PostgreSQL query planner made different index selection decisions than Oracle for the specific data distribution in production. The issue hadn't appeared in testing because the test dataset distribution was different from production. Overnight processing windows that ran in 4 hours now ran in 14 hours, missing the start-of-day deadline."

This one concerns Raj particularly because overnight processing windows are directly connected to regulatory reporting. He creates a ticket: "Performance test complex queries against production-representative data distribution; compare query plans between Oracle and PostgreSQL."


The Full Tally

After working through all 12 failure stories, Raj's classification: - Plausible, not mitigated (action required): 6 items - Plausible, partially mitigated (enhancement needed): 4 items - Not applicable or already fully addressed: 2 items

He has 10 new action items. He categorizes them by urgency:

Critical (must address before launch): 1. Expand multi-currency portfolio calculation test coverage 2. Complete implicit type conversion audit in stored procedures 3. Execute full rollback drill under current system state

Important (should address before launch): 4. Load test at peak concurrent connection levels 5. Performance test complex overnight queries with production-representative data distribution 6. Verify timezone handling in date arithmetic stored procedures

Should address, schedule after launch: 7-10: Four monitoring and alerting enhancements


Taking the Results to the Team

Raj doesn't just file the tickets. He calls a 30-minute team meeting.

He shares the pre-mortem exercise with context: "I ran a failure analysis on our migration. Here are the scenarios I think are most plausible. I want to validate my thinking with you before we commit to the additional work."

Two things happen in the meeting:

First, the team lead on the stored procedure audit confirms that the audit methodology did not specifically check implicit conversions — it was focused on explicit type mismatches. She estimates a complete check would take 3 days. They add it to the schedule.

Second, another engineer raises a failure mode that wasn't in Raj's AI-generated list at all: "What happens if a downstream application that we don't control has connection caching? The connection strings will be different after the migration. If an application is caching the old Oracle connection parameters and we can't force a restart during the cutover window, we might have an application that's routing to Oracle but everyone thinks it's on PostgreSQL."

Raj adds this to the checklist immediately. It would not have appeared in his AI output because it requires specific knowledge of one of their downstream applications — knowledge only his team has.


What the Pre-Mortem Found: The Three Critical Issues

Of the 10 action items, three proved to be genuinely critical — issues that, if they had reached production, would have caused incidents requiring client notification and possibly regulatory reporting.

Issue 1: Multi-currency decimal arithmetic

The expanded test coverage found four stored procedures where PostgreSQL's decimal arithmetic produced different rounding behavior than Oracle for multi-currency rebalancing calculations. The differences were small in absolute terms but would have caused portfolio value discrepancies for approximately 340 client accounts. The fix was a targeted stored procedure rewrite with explicit rounding specifications. Time to fix: 2 days.

Issue 2: End-of-day timezone handling

The implicit conversion audit found 7 stored procedures with timezone-dependent date arithmetic. In Oracle, these procedures had always run with the database server's local timezone as the implicit context. PostgreSQL required explicit timezone specification. In testing on a non-production server in the same timezone, the issue didn't appear. In production on a server in UTC, the end-of-day calculations would have run with incorrect date boundaries during daylight saving time transitions. Time to fix: 1.5 days.

Issue 3: Rollback drill failure

The rollback drill execution revealed that the procedure had a dependency on a specific Oracle system table that no longer existed in the version of Oracle running in the environment (the version had been updated since the procedure was written). The rollback would have worked in testing but would have failed in production. A revised rollback procedure was written and tested. Time to fix: 3 hours.


The Decision: Delay Two Weeks

The three critical issues required 4.5 additional days of work. But validating the fixes, re-testing affected stored procedures, and verifying the rollback procedure required an additional 7 days. The total delay from pre-mortem to ready-to-launch was 11 working days — approximately two weeks.

Raj takes this to his manager and the project sponsor. He brings the pre-mortem documentation, the issue descriptions, and the estimated risk of proceeding without the fixes.

The conversation is direct: "We found three issues that would have caused client-visible data errors. Fixing them requires two weeks. The alternative is launching on schedule with known data integrity risks in a system that handles client financial data."

The decision to delay is made in under 10 minutes. The fiscal quarter deadline is pushed to the following quarter. Finance has to reverse a portion of the license savings booking. Nobody is happy about the delay.

But two weeks after the revised launch date, the migration completes without incident. No client notifications. No data integrity issues. No regulatory concerns. Zero production incidents in the first 30 days.


Reflection: What the Pre-Mortem Contributed

Raj documents his reflection for the team's knowledge base:

"The pre-mortem didn't find every risk. One of the most important items — the connection caching issue — came from a team member, not the AI. The pre-mortem created a structure for systematic thinking that made it easier for people to surface things they might otherwise have kept to themselves ('probably not worth raising this').

The AI's output was most valuable for the risks it found that I should have thought of but didn't: the test data distribution issue and the rollback drift issue. These were both failures in my own process — the test methodology and the rollback procedure maintenance — that I was too close to see.

What was interesting was how the fictional failure narrative made risks feel concrete in a way that a risk register list didn't. Reading a story about clients receiving incorrect portfolio values is different from reading 'Risk: data integrity issues.' The narrative format made the team take the risks seriously in a way that abstract risk registers often don't.

The two-week delay was painful. The alternative was worse."


Key Takeaways from This Case

  1. Pre-mortem narratives make risks concrete. The story format produces a different quality of engagement than risk registers. Reading a plausible failure scenario activates practical thinking rather than abstract risk assessment.

  2. AI pre-mortems surface general risks; team discussion surfaces specific ones. The most important item in Raj's final checklist came from a team member who had specific knowledge AI couldn't have. The pre-mortem created space for that contribution.

  3. Test methodology is often the risk. Several of Raj's critical issues weren't in the data or the code — they were in the way the testing was designed. AI-generated pre-mortems reliably identify testing scope issues because test coverage gaps are common failure patterns.

  4. The cost of a delay is usually lower than the cost of a production incident. For systems with data integrity or regulatory implications, the math almost always favors delay. The pre-mortem makes that case with evidence.

  5. Document the pre-mortem results. Having a written record of what was found and why decisions were made is valuable for post-launch review and for building organizational knowledge about migration risks.