Chapter 25 Key Takeaways
Core Principles
-
Parallel batch is arithmetic, not magic. Four partitions with balanced data and adequate resources reduce elapsed time by roughly 65-70%. The speedup is real but not linear — partitioning overhead, imbalance, and shared-resource contention consume some of the theoretical benefit.
-
Three levels of parallelism are multiplicative. Application partitioning (splitting work across JCL jobs), DB2 parallelism (I/O/CP/Sysplex within the database), and SORT parallelism (DFSORT hiperspace and multi-volume work files) compound each other. Use all three where applicable.
-
The partition control table is the linchpin. It provides partition boundaries, status tracking, checkpoint data, timing, and reconciliation data in a single queryable structure. Every parallel batch design should start here.
-
Partition balance determines actual speedup. A 4:1 imbalance ratio means your parallel batch is only as fast as the overloaded partition — potentially slower than a well-tuned 2-partition design. Compute balanced boundaries from actual data distribution, not assumptions.
-
Partition-level restart is the operational safety net. When 1 of 4 partitions fails, restart only the failed partition from its checkpoint. Do not rerun the 3 successful partitions. This requires unique checkpoint IDs per partition and independent output files.
Design Rules
-
Choose partition keys based on four properties: even distribution, no cross-partition dependencies, alignment with physical data layout, and compatibility with checkpoint/restart.
-
Isolate resources per partition: separate output files, separate SORT work datasets, separate checkpoint datasets, separate DB2 threads. Two partitions writing to the same output file will corrupt data.
-
Handle deadlocks in every partition-safe program. SQLCODE -911 and -913 must trigger rollback, delay, and retry — not abend. A 1-3 second delay with random component breaks the deadlock-retry-deadlock cycle.
-
Commit frequently in parallel batch. Every 500-2,000 rows, depending on lock escalation thresholds. Infrequent commits cause lock escalation, which causes tablespace-level locks, which blocks all other partitions.
-
Reconcile after every merge. Sum partition record counts, compare to input total. Verify control totals. Check for gaps and overlaps in key ranges. Any discrepancy halts downstream processing.
DB2 Parallelism
-
DEGREE(ANY) enables all three forms of DB2 parallelism. DB2 chooses the appropriate level based on the query, tablespace, and available resources. There is no way to force a specific degree.
-
Size buffer pools for parallelism. Each parallel stream needs its own buffer pages. If QXDEGRD < QXDEGAT in accounting traces, a resource constraint is limiting parallelism — the QXREDRN code tells you which resource.
-
Align application partitions with DB2 table partitions. This eliminates cross-partition lock contention and enables DB2 partition pruning. It is the single most effective design decision for DB2-centric parallel batch.
-
In Sysplex parallelism, size the GBP for concurrent write traffic. Undersized Group Buffer Pools cause GBP-full conditions that force synchronous writes, destroying the performance benefit of multi-member parallelism.
Pipeline Design
-
The critical path is determined by the slowest partition. In a fan-out/fan-in pipeline, everything downstream waits for the last partition to complete. Monitor partition elapsed times and investigate any partition that takes significantly longer than its peers.
-
Pipeline steps can overlap if dependencies allow. Interest calculation and fraud scoring can run concurrently if they access different data. Use dependency analysis to identify which steps can overlap and which must be sequential.
-
Fan-out patterns trade I/O for simplicity. A single-pass COBOL splitter reads the input once. DFSORT OUTFIL with multiple INCLUDE conditions reads it N times. DB2 query partitioning reads nothing — each partition queries its own range.
-
DFSORT MERGE is O(N) for pre-sorted inputs. Always use MERGE instead of SORT when combining sorted partition outputs. MERGE requires no SORTWK datasets and completes in a single pass.
Operational Rules
-
Recompute partition boundaries regularly. Data distributions shift as the business changes. Monthly recomputation prevents imbalance drift that silently erodes parallel batch performance.
-
Monitor partition imbalance in real time. Alert when the fastest partition is more than 50% ahead of the slowest (by record count). This indicates a rebalancing need or a partition-specific resource problem.
-
Coordinate parallel batch with all system stakeholders. DBAs must know the batch schedule (no tablespace maintenance during batch). Operations must know the parallel job structure (restart procedures differ from serial). Schedulers must define correct dependencies (merge waits for ALL partitions).
When NOT to Parallelize
-
Do not parallelize jobs under 100K records or 10 minutes elapsed. The operational complexity exceeds the time savings. Save parallelism for jobs that matter.
-
Do not parallelize without a natural partition key. If records have cross-record dependencies that cannot be contained within partitions, parallelism produces incorrect results. Analyze dependencies before designing partitions.
-
Do not parallelize beyond your system's capacity. More partitions than available CPs, I/O paths, or DB2 threads means partitions wait for resources — adding overhead without adding parallelism. Use the formula:
PARTITION_COUNT = MIN(CPs, IO_paths/2, DB2_threads/2, 8).
Formulas and Thresholds
| Metric | Target | Alert |
|---|---|---|
| Partition imbalance ratio | < 1.2:1 | > 1.5:1 |
| DB2 parallelism achieved (QXDEGRD/QXDEGAT) | > 0.75 | < 0.5 |
| Checkpoint interval | 500-2,000 rows | N/A |
| Commit frequency | Every 500-2,000 updates | Lock escalation |
| Deadlock retry limit | 3 retries max | Any sustained deadlocks |
| Reconciliation delta | 0 | Any non-zero |
| Partition boundary recomputation | Monthly | Imbalance > 1.3:1 |