Case Study 2: Transaction Design -- When COMMIT Frequency Matters
Background
Priya Sharma is the lead DBA at Meridian National Bank's distributed systems team, responsible for the Db2 for LUW environment that powers the digital banking platform. The platform handles online account management, mobile deposits, bill payments, and real-time balance inquiries for 1.2 million customers.
On a Tuesday afternoon, the operations team reports a cascading failure: the bill payment service is timing out, the mobile app shows stale balances, and the customer service dashboard is unresponsive. Priya is paged at 2:43 PM.
The Incident
Initial Investigation
Priya connects to the production database and immediately checks the lock situation:
-- Check for lock waits
SELECT
hl.AGENT_ID AS blocker_agent,
hl.TABNAME AS locked_table,
hl.LOCK_MODE AS lock_mode,
wl.AGENT_ID AS waiting_agent,
wl.LOCK_WAIT_START_TIME
FROM SYSIBMADM.MON_GET_LOCKS AS hl
JOIN SYSIBMADM.MON_GET_APPL_LOCKWAIT AS wl
ON hl.LOCK_NAME = wl.LOCK_NAME
ORDER BY wl.LOCK_WAIT_START_TIME;
The results are alarming: 847 agents are waiting for locks on the ACCOUNT table. All are blocked by a single agent -- agent 15823.
-- What is agent 15823 doing?
SELECT AGENT_ID, APPL_NAME, APPL_STATUS, UOW_START_TIME,
LOCKS_HELD, TOTAL_LOG_USED_KB
FROM SYSIBMADM.SNAPAPPL
WHERE AGENT_ID = 15823;
The result:
- APPL_NAME: INTEREST_CALC
- APPL_STATUS: UOW Executing
- UOW_START_TIME: 2025-11-18 14:01:03 (42 minutes ago)
- LOCKS_HELD: 487,231
- TOTAL_LOG_USED_KB: 2,847,520 (2.8 GB)
Agent 15823 is the monthly interest calculation batch job. It has been running for 42 minutes, has accumulated nearly half a million locks, and has consumed 2.8 GB of transaction log -- all in a single uncommitted transaction.
Root Cause
Priya pulls up the interest calculation code. It was written six months ago by a developer who is no longer with the team:
CREATE PROCEDURE meridian.calculate_monthly_interest()
LANGUAGE SQL
BEGIN
DECLARE v_account_id BIGINT;
DECLARE v_balance DECIMAL(15,2);
DECLARE v_rate DECIMAL(7,5);
DECLARE v_interest DECIMAL(15,2);
DECLARE v_done INTEGER DEFAULT 0;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET v_done = 1;
DECLARE interest_cursor CURSOR FOR
SELECT account_id, current_balance, interest_rate
FROM meridian.account
WHERE account_type = 'SAVINGS'
AND status = 'ACTIVE'
AND current_balance > 0
FOR UPDATE;
OPEN interest_cursor;
interest_loop: LOOP
FETCH interest_cursor INTO v_account_id, v_balance, v_rate;
IF v_done = 1 THEN LEAVE interest_loop; END IF;
SET v_interest = ROUND(v_balance * (v_rate / 12), 2);
-- Update the account balance
UPDATE meridian.account
SET current_balance = current_balance + v_interest,
last_interest_date = CURRENT DATE
WHERE CURRENT OF interest_cursor;
-- Record the interest transaction
INSERT INTO meridian.transaction_history
(transaction_id, account_id, transaction_type, amount,
transaction_date, description)
VALUES
(NEXT VALUE FOR meridian.transaction_seq,
v_account_id, 'INTEREST', v_interest,
CURRENT DATE, 'Monthly interest credit');
END LOOP;
-- Single COMMIT at the very end
COMMIT;
CLOSE interest_cursor;
END
The fatal flaw: The procedure processes all 487,000 savings accounts in a single transaction with no intermediate commits. Every account it touches holds a lock until the final COMMIT. After processing roughly 350,000 accounts, DB2 escalated the individual row locks to a table-level exclusive lock on the ACCOUNT table. This blocked every other application from accessing account data -- including the online banking platform, mobile app, and customer service tools.
Immediate Resolution
Priya forces the application:
-- Force terminate the runaway application
CALL ADMIN_CMD('FORCE APPLICATION (15823)');
DB2 begins rolling back agent 15823's transaction. Because 42 minutes of work must be undone, the rollback takes 28 minutes. During this time, the ACCOUNT table remains locked. Total outage duration: 70 minutes.
The Fix: Redesigning the Transaction Strategy
Priya redesigns the interest calculation procedure with proper transaction control:
CREATE OR REPLACE PROCEDURE meridian.calculate_monthly_interest(
IN p_batch_size INTEGER DEFAULT 5000,
IN p_calc_date DATE DEFAULT CURRENT DATE
)
LANGUAGE SQL
BEGIN
DECLARE v_account_id BIGINT;
DECLARE v_balance DECIMAL(15,2);
DECLARE v_rate DECIMAL(7,5);
DECLARE v_interest DECIMAL(15,2);
DECLARE v_batch_count INTEGER DEFAULT 0;
DECLARE v_total_count INTEGER DEFAULT 0;
DECLARE v_total_interest DECIMAL(15,2) DEFAULT 0;
DECLARE v_last_processed BIGINT DEFAULT 0;
DECLARE v_done INTEGER DEFAULT 0;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET v_done = 1;
-- Check for restart: find the last processed account
SELECT COALESCE(MAX(last_account_id), 0)
INTO v_last_processed
FROM meridian.batch_checkpoint
WHERE batch_name = 'MONTHLY_INTEREST'
AND batch_date = p_calc_date
AND status = 'IN_PROGRESS';
-- If no checkpoint exists, create one
IF v_last_processed = 0 THEN
INSERT INTO meridian.batch_checkpoint
(batch_name, batch_date, status, last_account_id,
rows_processed, start_time)
VALUES
('MONTHLY_INTEREST', p_calc_date, 'IN_PROGRESS', 0,
0, CURRENT TIMESTAMP);
COMMIT;
END IF;
-- Process accounts in order, resuming from checkpoint
DECLARE interest_cursor CURSOR FOR
SELECT account_id, current_balance, interest_rate
FROM meridian.account
WHERE account_type = 'SAVINGS'
AND status = 'ACTIVE'
AND current_balance > 0
AND account_id > v_last_processed
ORDER BY account_id;
OPEN interest_cursor;
interest_loop: LOOP
FETCH interest_cursor INTO v_account_id, v_balance, v_rate;
IF v_done = 1 THEN LEAVE interest_loop; END IF;
SET v_interest = ROUND(v_balance * (v_rate / 12), 2);
-- Update the account balance
UPDATE meridian.account
SET current_balance = current_balance + v_interest,
last_interest_date = p_calc_date
WHERE account_id = v_account_id;
-- Record the interest transaction
INSERT INTO meridian.transaction_history
(transaction_id, account_id, transaction_type, amount,
transaction_date, description)
VALUES
(NEXT VALUE FOR meridian.transaction_seq,
v_account_id, 'INTEREST', v_interest,
p_calc_date, 'Monthly interest credit');
SET v_batch_count = v_batch_count + 1;
SET v_total_count = v_total_count + 1;
SET v_total_interest = v_total_interest + v_interest;
-- Commit every p_batch_size rows
IF v_batch_count >= p_batch_size THEN
-- Update checkpoint
UPDATE meridian.batch_checkpoint
SET last_account_id = v_account_id,
rows_processed = v_total_count,
last_commit_time = CURRENT TIMESTAMP
WHERE batch_name = 'MONTHLY_INTEREST'
AND batch_date = p_calc_date;
COMMIT;
-- Reset cursor (must re-declare after COMMIT)
-- Implementation note: in practice, use a dynamic
-- approach or process in discrete ranges
SET v_batch_count = 0;
END IF;
END LOOP;
-- Final commit and checkpoint update
UPDATE meridian.batch_checkpoint
SET status = 'COMPLETE',
last_account_id = v_account_id,
rows_processed = v_total_count,
total_interest = v_total_interest,
end_time = CURRENT TIMESTAMP
WHERE batch_name = 'MONTHLY_INTEREST'
AND batch_date = p_calc_date;
COMMIT;
CLOSE interest_cursor;
END
Key Design Changes
1. Periodic commits (every 5,000 rows). Instead of one massive transaction, the procedure commits in batches. Each commit releases all locks, freeing the ACCOUNT table for other applications.
2. Checkpoint-based restart. The batch_checkpoint table records the last successfully processed account_id. If the job fails, it can be restarted from the last checkpoint instead of starting over.
3. Ordered processing. Accounts are processed in account_id order and the cursor starts from v_last_processed. This enables deterministic restart behavior.
4. No FOR UPDATE cursor. The original code used FOR UPDATE which held locks on all fetched rows. The new code uses a regular cursor and updates by account_id, holding locks only within the current batch.
5. Configurable batch size. The batch size is a parameter, allowing operations to tune it based on observed lock and log behavior.
Analyzing the Tradeoffs
Lock Behavior Comparison
| Metric | Original (No Commits) | Redesigned (5,000-row batches) |
|---|---|---|
| Max concurrent locks | 487,000+ (all accounts) | ~5,000 (one batch) |
| Lock escalation risk | Guaranteed (escalated to table lock) | Very low (well below threshold) |
| Concurrent access blocked | All ACCOUNT table access for ~70 min | Brief lock waits during batch commits |
| Other application impact | Complete outage | Barely noticeable |
Log Volume Comparison
| Metric | Original | Redesigned |
|---|---|---|
| Active log consumed | 2.8 GB (single UOW) | ~28 MB per batch (released each commit) |
| Log-full risk | High | Negligible |
| Rollback on failure | 42 minutes (entire batch) | < 5 seconds (current batch only) |
Throughput Comparison
| Metric | Original | Redesigned |
|---|---|---|
| Total elapsed time | ~42 min (before failure) | ~38 min (complete) |
| Recovery time on failure | 28 min rollback + restart from zero | Restart from last checkpoint |
| Total time with one failure | 42 + 28 + 42 = 112 min | 38 + 5 sec + restart overhead |
The redesigned procedure is actually faster than the original because DB2 can use more efficient access paths without holding a massive lock structure in memory.
The Broader Lesson: Transaction Design Is Architecture
This incident illustrates a principle that many developers learn the hard way: transaction boundaries are an architectural decision, not an afterthought.
The original developer thought about the correctness of the interest calculation -- the SQL was correct, the arithmetic was right, the interest was credited properly. What they did not think about was:
-
How long will this transaction run? In a multi-user database, transaction duration directly impacts concurrency.
-
How many locks will this transaction hold? Lock accumulation is proportional to the number of rows modified within a single commit scope.
-
How much log will this transaction generate? Log consumption determines whether the transaction can complete or will trigger a log-full condition.
-
What happens if this transaction fails? Rollback time is proportional to the amount of work to undo. A 42-minute transaction takes 28 minutes to roll back. A 5-second batch takes milliseconds.
-
Can other applications continue working while this runs? If the batch locks critical tables, the entire application stack feels it.
The COMMIT Frequency Decision Framework
Priya documents a decision framework for future batch jobs at Meridian:
| Factor | Guidance |
|---|---|
| Lock threshold | Keep locks per transaction below 20% of LOCKLIST to avoid escalation |
| Log consumption | Keep per-transaction log usage below 10% of active log capacity |
| Concurrency requirement | Higher concurrency = more frequent commits |
| Recovery time objective | Max rollback time should be < 1 minute for production batches |
| Throughput | Commits have overhead; find the balance between safety and speed |
| Typical batch size | 1,000-10,000 rows per commit for most workloads |
The Checkpoint Pattern
Every batch job at Meridian now follows the checkpoint pattern:
- Before starting: Check if a previous run was interrupted. If so, resume from the last checkpoint.
- During processing: Update a checkpoint record at every commit point, recording the last processed key and rows completed.
- On completion: Mark the checkpoint as COMPLETE.
- On failure: The checkpoint record stays in IN_PROGRESS state. The next run detects this and resumes.
This pattern transforms batch jobs from "all or nothing" operations into resumable, restartable processes.
Post-Incident Actions
-
Immediate: The interest calculation procedure was rewritten with batch commits and deployed to production.
-
Short-term: Priya audited all 23 batch procedures in the Meridian LUW environment. She found four others with similar "no commit" patterns and remediated them.
-
Medium-term: The team established a code review checklist that includes: - Does the procedure commit at regular intervals? - Is there a checkpoint/restart mechanism? - Has the maximum transaction size been estimated and validated against log capacity? - Has the procedure been tested with production-scale data volumes?
-
Long-term: Monitoring alerts were configured to detect: - Any transaction holding locks for more than 5 minutes - Any transaction consuming more than 500 MB of log - Any lock escalation event
These alerts would have detected the original problem within minutes rather than waiting for user-facing failures.
Discussion Questions
-
The redesigned procedure commits every 5,000 rows. If Meridian's digital banking platform requires sub-second response times for account balance queries, is 5,000 the right number? How would you determine the optimal batch size?
-
The checkpoint pattern assumes ordered processing (by account_id). What challenges arise if the batch must process accounts in a different order (for example, by branch, then by account)?
-
A colleague suggests using a set-based approach instead of a cursor:
UPDATE meridian.account SET current_balance = current_balance + ROUND(current_balance * (interest_rate / 12), 2) WHERE account_type = 'SAVINGS' AND status = 'ACTIVE'. This would process all accounts in one statement. What are the advantages and risks? Would you recommend it? -
The incident caused a 70-minute outage. What was the reputational cost to the bank? How would you calculate the business impact to justify the investment in redesigning batch procedures?
-
On DB2 for z/OS, this interest calculation would typically run during the batch window when online systems are quiesced. How does the z/OS batch window model fundamentally differ from the LUW always-on model, and how does that difference affect transaction design?
Return to Chapter 9 | Continue to Key Takeaways