Case Study 1: Zero-Downtime Schema Migration at a Payment Processor

"We Process $2.3 Billion a Day — Downtime Is Not an Option"


The Company

ClearPath Payments, headquartered in Atlanta, processes credit and debit card transactions for 45,000 merchants across the southeastern United States. Founded in 2004, ClearPath grew from a regional payment gateway into a full-service payment processor handling 28 million transactions per day, with a peak rate of 1,200 transactions per second during the holiday shopping season.

Their core transaction processing system runs on DB2 12 for z/OS in a four-member data sharing group. The primary database contains 380 tables, with the TRANSACTION table holding 4.2 billion rows across 48 range partitions (one per month, rolling four years). The system runs on two IBM z15 LPARs, each hosting two data sharing members.

ClearPath's contractual SLA with their merchant partners guarantees 99.995% uptime — no more than 26 minutes of downtime per year. Their actual track record is better: they had not experienced unplanned downtime in 31 months. Planned maintenance windows were limited to the first Sunday of each month, 2:00 AM to 5:00 AM Eastern, and even those windows saw reduced but continuous transaction volume from gas station pay-at-the-pump and 24-hour retail.

The Problem

In March 2024, the Payment Card Industry Data Security Standard (PCI DSS) version 4.0 mandated new data fields for enhanced transaction monitoring. ClearPath had until March 2025 to comply. The required changes were:

  1. New columns on TRANSACTION: DEVICE_FINGERPRINT (VARCHAR(256)), TRANSACTION_RISK_SCORE (DECIMAL(5,2)), AUTH_METHOD (CHAR(3)), GEOLOCATION_HASH (VARCHAR(64)).
  2. New table: TRANSACTION_RISK_DETAIL — stores detailed risk assessment data for flagged transactions.
  3. Column widening: MERCHANT_NAME from VARCHAR(40) to VARCHAR(100) — the new standard required longer merchant identifiers.
  4. New column on MERCHANT: MERCHANT_CATEGORY_CODE_V2 (VARCHAR(10)) — a replacement for the old 4-digit MCC with a new expanded code set.
  5. Type change on TRANSACTION_DETAIL: DETAIL_ID from INTEGER to BIGINT — the table was at 1.9 billion rows and the identity sequence was at 1,870,000,000.

The challenge was clear: these changes had to be applied to a system that processes $2.3 billion per day, with effectively zero tolerance for downtime.

The Team

ClearPath assembled a cross-functional migration team:

  • Raj Patel, Lead DBA — 18 years of z/OS DB2 experience, had performed four previous major version upgrades.
  • Diana Chen, Application Architect — owned the transaction processing middleware.
  • Marcus Webb, Performance Engineer — responsible for capacity planning and performance SLAs.
  • Sarah Knowles, Change Manager — ensured every change followed ClearPath's ITIL-based change management process.

Raj would later say: "We spent more time planning than executing. The planning took eleven weeks. The execution took three weekends. That ratio is about right for a change of this magnitude."

The Plan

The team divided the changes into three tiers based on risk:

Tier 1 (Zero Risk — Instant Operations) - Adding four nullable columns to TRANSACTION - Adding one nullable column to MERCHANT - Creating the new TRANSACTION_RISK_DETAIL table - Widening MERCHANT_NAME from VARCHAR(40) to VARCHAR(100)

Tier 2 (Low Risk — REORG Required) - Advisory REORG cleanup after Tier 1 changes

Tier 3 (Medium Risk — Downtime Possible) - INTEGER to BIGINT on TRANSACTION_DETAIL.DETAIL_ID

Tier 1 Execution: The Tuesday Morning That Nothing Happened

Tier 1 was scheduled for a Tuesday at 10:00 AM — deliberately during business hours. Raj wanted to demonstrate confidence in the safety of these operations, and scheduling during business hours ensured the full operations team was available.

The script contained 7 ALTER TABLE statements and 1 CREATE TABLE. Raj had tested the identical script in DEV, QA, and pre-production environments. Each execution completed in under 2 seconds.

At 10:00 AM, with Diana monitoring the application dashboards and Marcus watching the z/OS performance panels, Raj executed the script. Total elapsed time: 1.4 seconds. Transaction throughput did not waver. No application errors. The four new TRANSACTION columns entered ADVISORY REORG status — exactly as expected.

"That was anticlimactic," Diana said.

"That's the goal," Raj replied.

Tier 1 Aftermath: The Backfill Challenge

The new columns existed, but 4.2 billion existing rows had NULL values. The application team needed the TRANSACTION_RISK_SCORE column populated with a default value of 0.00 for historical records so that their new analytics queries would not have to handle NULLs.

Raj wrote a partition-by-partition backfill script:

-- Partition-by-partition backfill
-- Run for each of the 48 partitions
UPDATE CLEARPATH.TRANSACTION PARTITION(PART001)
SET TRANSACTION_RISK_SCORE = 0.00
WHERE TRANSACTION_RISK_SCORE IS NULL;
COMMIT;

Each partition contained approximately 87 million rows. The UPDATE took 12-15 minutes per partition. Raj ran 4 partitions in parallel (one per data sharing member), completing the backfill in approximately 3 hours. The system continued processing live transactions throughout — the UPDATE used row-level locking and committed per partition, so lock contention was minimal.

The key insight: by running the backfill as a separate step from the ALTER, and by processing partition by partition, the operation was fully controllable. If any partition's UPDATE caused unexpected contention, Raj could pause it and reschedule.

Tier 2 Execution: The Saturday Night REORG

Two weeks after Tier 1, Raj scheduled the advisory REORG cleanup during the monthly maintenance window. The TRANSACTION tablespace had 48 partitions in ADVISORY REORG. Raj used SHRLEVEL(CHANGE) REORG to allow continued read/write access:

REORG TABLESPACE CLEARDB.TRANSTS PART 1:12
    SHRLEVEL CHANGE
    MAPPINGTABLE CLEARPATH.TRAN_MAPPING
    LOG YES
    DRAIN_WAIT 120
    RETRY 5
    RETRY_DELAY 60;

He reorganized 12 partitions at a time across the 3-hour window. The first batch (partitions 1-12, the oldest data with lowest write activity) completed in 47 minutes. The second batch (partitions 13-24) took 52 minutes. The third batch (partitions 25-36, more recent and more actively written) took 68 minutes. The fourth batch (partitions 37-48, the most active) was deferred to the next maintenance window because Raj did not want to risk the drain phase conflicting with the end of the maintenance window.

During the REORG, transaction processing continued at normal volumes. Marcus reported a 3-4% increase in I/O wait times during the REORG — within acceptable limits.

Tier 3 Execution: The BIGINT Conversion

The INTEGER-to-BIGINT conversion on TRANSACTION_DETAIL was the most delicate operation. The table had 1.9 billion rows across 24 partitions. The ALTER would place the tablespace in REORG PENDING, making it inaccessible.

Raj's plan:

  1. Take a full image copy of the tablespace (backup).
  2. ALTER COLUMN DETAIL_ID SET DATA TYPE BIGINT.
  3. REORG TABLESPACE with SHRLEVEL NONE (offline — SHRLEVEL CHANGE was not available because the row format had changed).
  4. REBUILD INDEX.
  5. RUNSTATS.
  6. Verify.

The critical question was: how long would the offline REORG take? Raj tested in pre-production with a tablespace cloned from production. The REORG took 2 hours and 14 minutes for 1.9 billion rows.

ClearPath negotiated a special 4-hour maintenance window with their merchant partners — the first extended window in two years. Raj added 90 minutes of buffer to the 2:14 estimate.

On the night of the migration:

  • 1:00 AM: Full image copy completed (started at 11:00 PM).
  • 1:15 AM: ALTER TABLE executed (1 second).
  • 1:17 AM: REORG started.
  • 3:28 AM: REORG completed (2 hours 11 minutes — within estimate).
  • 3:35 AM: REBUILD INDEX completed.
  • 3:42 AM: RUNSTATS completed.
  • 3:45 AM: Verification queries confirmed row counts, max DETAIL_ID, and constraint validity.
  • 3:50 AM: Application team confirmed transactions processing normally.

Total downtime for TRANSACTION_DETAIL: 2 hours 35 minutes, well within the 4-hour window.

What Went Wrong (And What Went Right)

The drain phase scare: During the Tier 2 REORG of partitions 25-36, the drain phase for partition 31 timed out on the first attempt. A long-running batch job had an open cursor against the partition. The RETRY parameter kicked in, waiting 60 seconds, then retrying the drain. On the second attempt, the batch job's cursor had closed naturally, and the drain succeeded. Without the RETRY configuration, the REORG would have failed, requiring manual intervention.

Raj's post-mortem note: "The RETRY parameters are not optional. They are your safety net for drain-phase contention. I now set RETRY to at least 5 on every SHRLEVEL CHANGE REORG."

The backfill performance surprise: The initial backfill test in QA processed 500,000 rows per minute. In production, it processed only 320,000 rows per minute because production had more concurrent activity competing for buffer pool and log resources. Raj had to adjust the parallelism from 6 partitions to 4 partitions to stay within acceptable resource consumption.

Lesson: "Always discount your QA performance numbers by at least 30% when planning production execution. QA does not have production's workload."

The success that mattered most: Zero unplanned downtime. Zero data loss. Zero application errors. The merchants never knew anything happened.

The Numbers

Metric Value
Total tables modified 3
Total columns added 6
Total columns widened 1
Total type changes 1
New tables created 1
Rows backfilled 4.2 billion
Planning duration 11 weeks
Execution duration 3 weekends
Unplanned downtime 0 minutes
Application errors from migration 0
Transactions processed during migration ~84 million per weekend

Lessons Learned

  1. Tier your changes by risk. Not all schema changes are equal. Separating instant operations (Tier 1) from REORG-dependent operations (Tier 2) from downtime-required operations (Tier 3) allowed independent scheduling and validation.

  2. Test in a production-scale environment. The backfill rate difference between QA (500K rows/min) and production (320K rows/min) would have caused a timing overrun if the team had relied on QA numbers alone.

  3. RETRY parameters are mandatory for online REORG. The partition 31 drain timeout would have been a manual emergency without automatic retries.

  4. Backfill partition by partition. Processing 87 million rows per COMMIT (per partition) kept lock contention manageable. Attempting to UPDATE 4.2 billion rows in a single statement would have been catastrophic.

  5. Schedule risky changes in dedicated windows, not combined with other work. The BIGINT conversion got its own 4-hour window with the entire team focused on one operation. No distractions, no competing changes.

  6. Communicate early and specifically. ClearPath notified merchant partners about the extended maintenance window six weeks in advance, with a specific explanation of what was changing and why (PCI DSS compliance). Transparency built trust.


This case study illustrates Sections 16.2 (ALTER TABLE safe changes), 16.3 (pending states and REORG), and 16.12 (Meridian Bank schema evolution methodology).