Case Study 2: When the Queue Ran Dry

DataField.Dev

Case Study 2: When the Queue Ran Dry

Background

Midwest Healthcare Alliance (MHA) had successfully migrated their prescription claim payments from batch to real-time MQ-based processing six months earlier. The system processed approximately 30,000 prescription claims per day through a queue connecting the claims adjudication system to the pharmacy payment system. The system had been running flawlessly — zero DLQ messages, zero reconciliation mismatches, and average end-to-end latency of 8 seconds.

On a Tuesday morning, the pharmacy payment team noticed that no new payments had been processed since 2:14 AM. The MQ queue was empty. The monitoring program reported zero queue depth (which it interpreted as "healthy — no backlog"). The claims system was adjudicating claims normally. But no messages were arriving on the payment queue.

The Investigation

The MQ administrators investigated and found the problem: the MQ channel between the claims LPAR and the payment LPAR had been stopped by an automated process. A scheduled maintenance task, intended for a different queue manager, had accidentally targeted the production channel. The channel had been down for 7 hours.

During those 7 hours, the claims adjudication system continued to PUT messages — but they accumulated on the local transmission queue on the claims LPAR, not on the remote payment queue. The transmission queue had grown to 47,000 messages.

When the MQ channel was restarted at 9:30 AM, all 47,000 messages began flowing to the payment queue simultaneously. The payment consumer program, designed for steady-state processing of ~1,250 messages per hour, was suddenly faced with 47,000 messages. The DB2 subsystem supporting the payment program experienced extreme lock contention as the consumer attempted to process all messages at once. DB2 response times increased from 5 milliseconds to 3 seconds. Several CICS transactions timed out.

By 10:15 AM, the CICS region was in distress. The systems programmer was forced to cold-start the CICS region, which stopped all processing — including the payment consumer.

The Resolution

The recovery took the rest of the day:

10:15 AM: CICS cold-started. Payment processing stopped.
10:45 AM: MQ channel stopped to prevent message flood.
11:00 AM: A temporary "throttle" program was deployed to read messages from the queue one at a time, with a 100-millisecond delay between messages.
11:00 AM - 4:00 PM: The throttle program processed all 47,000 backlog messages at a controlled rate.
4:00 PM: MQ channel restarted with normal flow.
4:30 PM: System returned to normal operation.

Total downtime: ~8 hours. Total claims affected: approximately 47,000 prescriptions with delayed payments.

Root Causes

Monitoring blind spot. The monitoring program only checked queue depth on the payment (remote) queue. It did not monitor the transmission queue on the claims LPAR. When the channel was down, the payment queue depth was zero (no messages arriving), which the monitor interpreted as healthy. In reality, messages were piling up on the transmission queue.
No channel monitoring. The MQ channel status was not checked by any automated monitoring. The channel-down condition was invisible for 7 hours.
No flow-control on the consumer. The consumer program processed messages as fast as they arrived, with no throttling. Under normal conditions, this was fine. Under burst conditions (47,000 messages at once), it overwhelmed DB2.
Maintenance process targeting error. The root cause of the channel outage was a human error in the maintenance scheduling system. The wrong queue manager name was specified.

Lessons Learned

Monitor the entire message path. Queue depth on the destination queue is not sufficient. You must also monitor transmission queues, channel status, and message age (how long the oldest message has been waiting).
Build flow-control into consumers. A consumer that processes messages as fast as possible will fail under burst conditions. Production consumers should have configurable throttling — a maximum number of messages per minute that can be increased during catch-up but prevents DB2 overwhelm.
Test failure scenarios, not just happy paths. The system had been tested with normal volumes and with the parallel-run reconciliation. It had never been tested with a 7-hour backlog arriving all at once.
Automate maintenance safeguards. The maintenance process that stopped the channel should have required confirmation when targeting a production queue manager. A simple "are you sure?" prompt — or better, a separate approval workflow for production changes — would have prevented the outage.

Discussion Questions

Design a monitoring check that would have detected the channel-down condition within 5 minutes. What metric would you monitor, and what threshold would trigger an alert?
Write a COBOL pseudocode sketch of a throttled consumer that processes a maximum of N messages per second.
How would you design a "catch-up" mode for the consumer that processes backlog faster than normal but slower than "as fast as possible"?
The transmission queue grew to 47,000 messages. What would have happened if the transmission queue's MAXDEPTH had been set to 10,000? Would that have been better or worse?
What changes to the cutover plan (from this chapter) would help protect against similar infrastructure failures?