Case Study 2: Transaction Design -- When COMMIT Frequency Matters


Background

Priya Sharma is the lead DBA at Meridian National Bank's distributed systems team, responsible for the Db2 for LUW environment that powers the digital banking platform. The platform handles online account management, mobile deposits, bill payments, and real-time balance inquiries for 1.2 million customers.

On a Tuesday afternoon, the operations team reports a cascading failure: the bill payment service is timing out, the mobile app shows stale balances, and the customer service dashboard is unresponsive. Priya is paged at 2:43 PM.


The Incident

Initial Investigation

Priya connects to the production database and immediately checks the lock situation:

-- Check for lock waits
SELECT
    hl.AGENT_ID AS blocker_agent,
    hl.TABNAME AS locked_table,
    hl.LOCK_MODE AS lock_mode,
    wl.AGENT_ID AS waiting_agent,
    wl.LOCK_WAIT_START_TIME
FROM SYSIBMADM.MON_GET_LOCKS AS hl
JOIN SYSIBMADM.MON_GET_APPL_LOCKWAIT AS wl
    ON hl.LOCK_NAME = wl.LOCK_NAME
ORDER BY wl.LOCK_WAIT_START_TIME;

The results are alarming: 847 agents are waiting for locks on the ACCOUNT table. All are blocked by a single agent -- agent 15823.

-- What is agent 15823 doing?
SELECT AGENT_ID, APPL_NAME, APPL_STATUS, UOW_START_TIME,
       LOCKS_HELD, TOTAL_LOG_USED_KB
FROM SYSIBMADM.SNAPAPPL
WHERE AGENT_ID = 15823;

The result: - APPL_NAME: INTEREST_CALC - APPL_STATUS: UOW Executing - UOW_START_TIME: 2025-11-18 14:01:03 (42 minutes ago) - LOCKS_HELD: 487,231 - TOTAL_LOG_USED_KB: 2,847,520 (2.8 GB)

Agent 15823 is the monthly interest calculation batch job. It has been running for 42 minutes, has accumulated nearly half a million locks, and has consumed 2.8 GB of transaction log -- all in a single uncommitted transaction.

Root Cause

Priya pulls up the interest calculation code. It was written six months ago by a developer who is no longer with the team:

CREATE PROCEDURE meridian.calculate_monthly_interest()
LANGUAGE SQL
BEGIN
    DECLARE v_account_id BIGINT;
    DECLARE v_balance DECIMAL(15,2);
    DECLARE v_rate DECIMAL(7,5);
    DECLARE v_interest DECIMAL(15,2);
    DECLARE v_done INTEGER DEFAULT 0;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET v_done = 1;

    DECLARE interest_cursor CURSOR FOR
        SELECT account_id, current_balance, interest_rate
        FROM meridian.account
        WHERE account_type = 'SAVINGS'
          AND status = 'ACTIVE'
          AND current_balance > 0
        FOR UPDATE;

    OPEN interest_cursor;

    interest_loop: LOOP
        FETCH interest_cursor INTO v_account_id, v_balance, v_rate;
        IF v_done = 1 THEN LEAVE interest_loop; END IF;

        SET v_interest = ROUND(v_balance * (v_rate / 12), 2);

        -- Update the account balance
        UPDATE meridian.account
        SET current_balance = current_balance + v_interest,
            last_interest_date = CURRENT DATE
        WHERE CURRENT OF interest_cursor;

        -- Record the interest transaction
        INSERT INTO meridian.transaction_history
            (transaction_id, account_id, transaction_type, amount,
             transaction_date, description)
        VALUES
            (NEXT VALUE FOR meridian.transaction_seq,
             v_account_id, 'INTEREST', v_interest,
             CURRENT DATE, 'Monthly interest credit');

    END LOOP;

    -- Single COMMIT at the very end
    COMMIT;

    CLOSE interest_cursor;
END

The fatal flaw: The procedure processes all 487,000 savings accounts in a single transaction with no intermediate commits. Every account it touches holds a lock until the final COMMIT. After processing roughly 350,000 accounts, DB2 escalated the individual row locks to a table-level exclusive lock on the ACCOUNT table. This blocked every other application from accessing account data -- including the online banking platform, mobile app, and customer service tools.

Immediate Resolution

Priya forces the application:

-- Force terminate the runaway application
CALL ADMIN_CMD('FORCE APPLICATION (15823)');

DB2 begins rolling back agent 15823's transaction. Because 42 minutes of work must be undone, the rollback takes 28 minutes. During this time, the ACCOUNT table remains locked. Total outage duration: 70 minutes.


The Fix: Redesigning the Transaction Strategy

Priya redesigns the interest calculation procedure with proper transaction control:

CREATE OR REPLACE PROCEDURE meridian.calculate_monthly_interest(
    IN p_batch_size INTEGER DEFAULT 5000,
    IN p_calc_date DATE DEFAULT CURRENT DATE
)
LANGUAGE SQL
BEGIN
    DECLARE v_account_id BIGINT;
    DECLARE v_balance DECIMAL(15,2);
    DECLARE v_rate DECIMAL(7,5);
    DECLARE v_interest DECIMAL(15,2);
    DECLARE v_batch_count INTEGER DEFAULT 0;
    DECLARE v_total_count INTEGER DEFAULT 0;
    DECLARE v_total_interest DECIMAL(15,2) DEFAULT 0;
    DECLARE v_last_processed BIGINT DEFAULT 0;
    DECLARE v_done INTEGER DEFAULT 0;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET v_done = 1;

    -- Check for restart: find the last processed account
    SELECT COALESCE(MAX(last_account_id), 0)
    INTO v_last_processed
    FROM meridian.batch_checkpoint
    WHERE batch_name = 'MONTHLY_INTEREST'
      AND batch_date = p_calc_date
      AND status = 'IN_PROGRESS';

    -- If no checkpoint exists, create one
    IF v_last_processed = 0 THEN
        INSERT INTO meridian.batch_checkpoint
            (batch_name, batch_date, status, last_account_id,
             rows_processed, start_time)
        VALUES
            ('MONTHLY_INTEREST', p_calc_date, 'IN_PROGRESS', 0,
             0, CURRENT TIMESTAMP);
        COMMIT;
    END IF;

    -- Process accounts in order, resuming from checkpoint
    DECLARE interest_cursor CURSOR FOR
        SELECT account_id, current_balance, interest_rate
        FROM meridian.account
        WHERE account_type = 'SAVINGS'
          AND status = 'ACTIVE'
          AND current_balance > 0
          AND account_id > v_last_processed
        ORDER BY account_id;

    OPEN interest_cursor;

    interest_loop: LOOP
        FETCH interest_cursor INTO v_account_id, v_balance, v_rate;
        IF v_done = 1 THEN LEAVE interest_loop; END IF;

        SET v_interest = ROUND(v_balance * (v_rate / 12), 2);

        -- Update the account balance
        UPDATE meridian.account
        SET current_balance = current_balance + v_interest,
            last_interest_date = p_calc_date
        WHERE account_id = v_account_id;

        -- Record the interest transaction
        INSERT INTO meridian.transaction_history
            (transaction_id, account_id, transaction_type, amount,
             transaction_date, description)
        VALUES
            (NEXT VALUE FOR meridian.transaction_seq,
             v_account_id, 'INTEREST', v_interest,
             p_calc_date, 'Monthly interest credit');

        SET v_batch_count = v_batch_count + 1;
        SET v_total_count = v_total_count + 1;
        SET v_total_interest = v_total_interest + v_interest;

        -- Commit every p_batch_size rows
        IF v_batch_count >= p_batch_size THEN
            -- Update checkpoint
            UPDATE meridian.batch_checkpoint
            SET last_account_id = v_account_id,
                rows_processed = v_total_count,
                last_commit_time = CURRENT TIMESTAMP
            WHERE batch_name = 'MONTHLY_INTEREST'
              AND batch_date = p_calc_date;

            COMMIT;

            -- Reset cursor (must re-declare after COMMIT)
            -- Implementation note: in practice, use a dynamic
            -- approach or process in discrete ranges
            SET v_batch_count = 0;
        END IF;

    END LOOP;

    -- Final commit and checkpoint update
    UPDATE meridian.batch_checkpoint
    SET status = 'COMPLETE',
        last_account_id = v_account_id,
        rows_processed = v_total_count,
        total_interest = v_total_interest,
        end_time = CURRENT TIMESTAMP
    WHERE batch_name = 'MONTHLY_INTEREST'
      AND batch_date = p_calc_date;

    COMMIT;

    CLOSE interest_cursor;
END

Key Design Changes

1. Periodic commits (every 5,000 rows). Instead of one massive transaction, the procedure commits in batches. Each commit releases all locks, freeing the ACCOUNT table for other applications.

2. Checkpoint-based restart. The batch_checkpoint table records the last successfully processed account_id. If the job fails, it can be restarted from the last checkpoint instead of starting over.

3. Ordered processing. Accounts are processed in account_id order and the cursor starts from v_last_processed. This enables deterministic restart behavior.

4. No FOR UPDATE cursor. The original code used FOR UPDATE which held locks on all fetched rows. The new code uses a regular cursor and updates by account_id, holding locks only within the current batch.

5. Configurable batch size. The batch size is a parameter, allowing operations to tune it based on observed lock and log behavior.


Analyzing the Tradeoffs

Lock Behavior Comparison

Metric Original (No Commits) Redesigned (5,000-row batches)
Max concurrent locks 487,000+ (all accounts) ~5,000 (one batch)
Lock escalation risk Guaranteed (escalated to table lock) Very low (well below threshold)
Concurrent access blocked All ACCOUNT table access for ~70 min Brief lock waits during batch commits
Other application impact Complete outage Barely noticeable

Log Volume Comparison

Metric Original Redesigned
Active log consumed 2.8 GB (single UOW) ~28 MB per batch (released each commit)
Log-full risk High Negligible
Rollback on failure 42 minutes (entire batch) < 5 seconds (current batch only)

Throughput Comparison

Metric Original Redesigned
Total elapsed time ~42 min (before failure) ~38 min (complete)
Recovery time on failure 28 min rollback + restart from zero Restart from last checkpoint
Total time with one failure 42 + 28 + 42 = 112 min 38 + 5 sec + restart overhead

The redesigned procedure is actually faster than the original because DB2 can use more efficient access paths without holding a massive lock structure in memory.


The Broader Lesson: Transaction Design Is Architecture

This incident illustrates a principle that many developers learn the hard way: transaction boundaries are an architectural decision, not an afterthought.

The original developer thought about the correctness of the interest calculation -- the SQL was correct, the arithmetic was right, the interest was credited properly. What they did not think about was:

  1. How long will this transaction run? In a multi-user database, transaction duration directly impacts concurrency.

  2. How many locks will this transaction hold? Lock accumulation is proportional to the number of rows modified within a single commit scope.

  3. How much log will this transaction generate? Log consumption determines whether the transaction can complete or will trigger a log-full condition.

  4. What happens if this transaction fails? Rollback time is proportional to the amount of work to undo. A 42-minute transaction takes 28 minutes to roll back. A 5-second batch takes milliseconds.

  5. Can other applications continue working while this runs? If the batch locks critical tables, the entire application stack feels it.

The COMMIT Frequency Decision Framework

Priya documents a decision framework for future batch jobs at Meridian:

Factor Guidance
Lock threshold Keep locks per transaction below 20% of LOCKLIST to avoid escalation
Log consumption Keep per-transaction log usage below 10% of active log capacity
Concurrency requirement Higher concurrency = more frequent commits
Recovery time objective Max rollback time should be < 1 minute for production batches
Throughput Commits have overhead; find the balance between safety and speed
Typical batch size 1,000-10,000 rows per commit for most workloads

The Checkpoint Pattern

Every batch job at Meridian now follows the checkpoint pattern:

  1. Before starting: Check if a previous run was interrupted. If so, resume from the last checkpoint.
  2. During processing: Update a checkpoint record at every commit point, recording the last processed key and rows completed.
  3. On completion: Mark the checkpoint as COMPLETE.
  4. On failure: The checkpoint record stays in IN_PROGRESS state. The next run detects this and resumes.

This pattern transforms batch jobs from "all or nothing" operations into resumable, restartable processes.


Post-Incident Actions

  1. Immediate: The interest calculation procedure was rewritten with batch commits and deployed to production.

  2. Short-term: Priya audited all 23 batch procedures in the Meridian LUW environment. She found four others with similar "no commit" patterns and remediated them.

  3. Medium-term: The team established a code review checklist that includes: - Does the procedure commit at regular intervals? - Is there a checkpoint/restart mechanism? - Has the maximum transaction size been estimated and validated against log capacity? - Has the procedure been tested with production-scale data volumes?

  4. Long-term: Monitoring alerts were configured to detect: - Any transaction holding locks for more than 5 minutes - Any transaction consuming more than 500 MB of log - Any lock escalation event

These alerts would have detected the original problem within minutes rather than waiting for user-facing failures.


Discussion Questions

  1. The redesigned procedure commits every 5,000 rows. If Meridian's digital banking platform requires sub-second response times for account balance queries, is 5,000 the right number? How would you determine the optimal batch size?

  2. The checkpoint pattern assumes ordered processing (by account_id). What challenges arise if the batch must process accounts in a different order (for example, by branch, then by account)?

  3. A colleague suggests using a set-based approach instead of a cursor: UPDATE meridian.account SET current_balance = current_balance + ROUND(current_balance * (interest_rate / 12), 2) WHERE account_type = 'SAVINGS' AND status = 'ACTIVE'. This would process all accounts in one statement. What are the advantages and risks? Would you recommend it?

  4. The incident caused a 70-minute outage. What was the reputational cost to the bank? How would you calculate the business impact to justify the investment in redesigning batch procedures?

  5. On DB2 for z/OS, this interest calculation would typically run during the batch window when online systems are quiesced. How does the z/OS batch window model fundamentally differ from the LUW always-on model, and how does that difference affect transaction design?


Return to Chapter 9 | Continue to Key Takeaways