Chapter 25 Exercises: Parallel Batch Processing
Exercise 25.1 — Partition Boundary Calculation (Apply)
You have a VSAM KSDS with 12 million records. The key is a 10-digit account number. The account distribution is:
| Key Range | Record Count |
|---|---|
| 0000000001–1000000000 | 800,000 |
| 1000000001–2000000000 | 2,100,000 |
| 2000000001–3000000000 | 3,400,000 |
| 3000000001–4000000000 | 1,900,000 |
| 4000000001–5000000000 | 1,200,000 |
| 5000000001–6000000000 | 900,000 |
| 6000000001–7000000000 | 700,000 |
| 7000000001–8000000000 | 500,000 |
| 8000000001–9000000000 | 300,000 |
| 9000000001–9999999999 | 200,000 |
Part A: Calculate four partition boundaries that produce approximately equal record counts (3 million each).
Part B: Calculate six partition boundaries that produce approximately equal record counts (2 million each).
Part C: For each partitioning scheme, calculate the partition imbalance ratio (largest partition / smallest partition). Which scheme has better balance?
Part D: The 2000000001–3000000000 range has high density. If you split within this range, what additional information would you need to determine the optimal split point?
Exercise 25.2 — Partition Control Table Design (Apply)
Design a partition control table for a parallel batch system that processes three different jobs nightly: ACCTPOST (account posting), INTCALC (interest calculation), and STMTGEN (statement generation).
Part A: Write the CREATE TABLE DDL. Include columns for: job identification, run date, partition number, key boundaries, expected and actual record counts, status tracking, timing, checkpoint information, and error details.
Part B: Write the INSERT statements to populate the table for a four-partition ACCTPOST run on 2026-03-16. Use the partition boundaries from Exercise 25.1 Part A.
Part C: Write a SQL query that a monitoring script would use to check the status of all partitions for the current night's batch run, showing elapsed time for completed partitions and running time for active partitions.
Part D: Write a SQL query that identifies which partitions need to be restarted (status = 'F' for failed) and shows their last checkpoint key for restart positioning.
Exercise 25.3 — Partition-Safe COBOL Skeleton (Apply)
Write the WORKING-STORAGE SECTION and the main control flow (PROCEDURE DIVISION through paragraph structure) for a partition-safe COBOL program that:
- Reads its partition number from the JCL PARM
- Queries the partition control table for its key boundaries
- Updates its status to 'R' (running)
- Opens partition-specific input and output files
- Processes records within its key range
- Checkpoints every 1,000 records
- Handles deadlocks with retry (max 3 retries)
- Updates its status to 'C' (completed) or 'F' (failed) on exit
- Reports its record count and elapsed time
Do not write the detailed processing logic — focus on the partition framework: parameter handling, status management, checkpoint logic, and error handling structure.
Exercise 25.4 — Deadlock Analysis (Analyze)
Four partitions are processing account updates concurrently. Each partition updates rows in the ACCOUNT_MASTER table (LOCKSIZE ROW) and the TRANSACTION_HISTORY table (LOCKSIZE PAGE).
Part A: Partition 1 processes account 1000005 and inserts a row into TRANSACTION_HISTORY page 4472. Partition 2 processes account 1000006, which also maps to TRANSACTION_HISTORY page 4472. Explain why a deadlock can occur even though the partitions process different accounts.
Part B: How does LOCKSIZE PAGE on TRANSACTION_HISTORY create this problem? Would LOCKSIZE ROW eliminate it? What is the trade-off?
Part C: Partition 3 experiences lock escalation on ACCOUNT_MASTER because it accumulated 1,500 page locks (NUMLKTS = 1000). The escalation promotes all 1,500 locks to a single tablespace lock. What happens to partitions 1, 2, and 4?
Part D: Propose three specific changes to eliminate or minimize deadlocks in this scenario. For each change, identify the trade-off.
Exercise 25.5 — DB2 Parallelism Configuration (Apply)
You have a DB2 subsystem with: - 4 CPs (general purpose processors) - 2 zIIPs - ACCOUNT_MASTER table: 12 partitions, each on a separate volume - Buffer pool BP2: 50,000 pages allocated for ACCOUNT_MASTER - Current BIND: DEGREE(1) (parallelism disabled) - Current PARAMDEG ZPARM: 4
Part A: You change the BIND to DEGREE(ANY). For a full tablespace scan of ACCOUNT_MASTER, what degree of I/O parallelism will DB2 attempt? What will limit the actual degree?
Part B: With CP parallelism enabled (DEGREE(ANY) and sufficient resources), how many parallel tasks might DB2 create for a scan of all 12 partitions? What limits the actual CP parallelism?
Part C: Your DB2 accounting trace shows QXDEGAT = 8 and QXDEGRD = 3. What does this mean? List three possible causes for the reduction.
Part D: You increase BP2 to 150,000 pages and change PARAMDEG to 8. Predict the effect on the same tablespace scan. What would you monitor to verify improvement?
Exercise 25.6 — Pipeline Dependency Diagram (Apply)
Design a complete batch pipeline for a bank's end-of-day processing with these jobs:
| Job | Input | Output | Serial Time | Parallelizable? |
|---|---|---|---|---|
| EXTRACT | Online DB | Transaction file | 20 min | No (single source) |
| VALIDATE | Transaction file | Valid/reject files | 45 min | Yes, by account range |
| POSTING | Valid file + Account DB | Updated account DB | 90 min | Yes, by account range |
| INTEREST | Account DB | Interest entries | 40 min | Yes, by account range |
| FEES | Account DB | Fee entries | 25 min | Yes, by account range |
| STATEMENTS | Account DB + Interest + Fees | Statement file | 60 min | Yes, by account hash |
| REPORTS | All outputs | Report file | 15 min | No (aggregation) |
Part A: Draw the dependency diagram showing which jobs can run in parallel. Indicate fan-out and fan-in points.
Part B: Assuming VALIDATE uses 4 partitions, POSTING uses 4, INTEREST and FEES each use 3, and STATEMENTS uses 6: calculate the critical path elapsed time.
Part C: The total serial time is 295 minutes (nearly 5 hours). What is the parallel elapsed time from Part B? What is the speedup ratio?
Part D: INTEREST and FEES both read from Account DB but do not modify it (they write to separate output tables). Can they run concurrently with each other? What about concurrently with the tail end of POSTING?
Exercise 25.7 — DFSORT Partition Split (Apply)
You have a flat file with 10 million records, record length 200, sorted by account number in positions 1-10.
Part A: Write the DFSORT OUTFIL control statements to split this file into four partitions by account number range. Assume boundaries at 2500000, 5000000, and 7500000.
Part B: The approach in Part A reads the input file four times (once per OUTFIL). Write an alternative using a single-pass DFSORT OUTFIL with multiple output files and INCLUDE conditions.
Part C: Write the DFSORT MERGE control statements to combine the four sorted partition outputs back into a single sorted file.
Part D: Write ICETOOL control statements that split the file into four partitions AND produce a record count for each partition, in a single ICETOOL invocation.
Exercise 25.8 — Partition-Level Restart Scenario (Analyze)
A four-partition account posting job runs nightly. Last night's run:
| Partition | Records | Checkpoint Interval | Last Checkpoint | Status |
|---|---|---|---|---|
| P1 | 2,500,000 | Every 1,000 | Record 2,500,000 | Complete |
| P2 | 2,500,000 | Every 1,000 | Record 2,500,000 | Complete |
| P3 | 2,500,000 | Every 1,000 | Record 1,847,000 | Failed |
| P4 | 2,500,000 | Every 1,000 | Record 2,500,000 | Complete |
P3 failed with SQLCODE -904 (resource unavailable) at record 1,847,523.
Part A: How many records in P3 were successfully committed to DB2? How many records were processed but not committed (in the current unit of work)?
Part B: On restart, P3 should reposition to what record? Why not record 1,847,523?
Part C: Records 1,847,001 through 1,847,523 were partially processed. The UPDATE statements for records 1,847,001–1,847,000+N may or may not have been committed before the failure. Describe two strategies for handling these records on restart.
Part D: The merge job ran before P3 was restarted (operational error). It concatenated the outputs of P1, P2, and P4 — but P3's output is incomplete. What reconciliation check would have caught this? Write the SQL.
Exercise 25.9 — Hiperspace and Sort Optimization (Apply)
You sort a 5 GB dataset nightly. Current configuration: - 3 SORTWK datasets, all on the same DASD volume - MAINSIZE=4M - No hiperspace (HIPRMAX=0) - Elapsed time: 22 minutes
Part A: Identify three configuration problems and their likely impact on sort elapsed time.
Part B: Propose an optimized configuration. Specify SORTWK allocation (how many, which volumes), MAINSIZE, and HIPRMAX settings.
Part C: The dataset is 5 GB but available hiperspace is only 2 GB. Can DFSORT still use hiperspace? How does it handle the overflow?
Part D: Estimate the sort elapsed time with your optimized configuration. State your assumptions.
Exercise 25.10 — Monitoring Dashboard Design (Apply)
Design a monitoring solution for parallel batch processing that provides real-time visibility into partition execution.
Part A: Write a REXX exec (pseudocode is acceptable) that queries the partition control table every 60 seconds and displays a formatted status board showing each partition's status, elapsed time, record count, and percentage complete.
Part B: Define alert conditions and thresholds for: (1) partition running too long, (2) partition failed, (3) partition imbalance, (4) DB2 lock contention.
Part C: Write the SQL query that detects partition imbalance during execution — specifically, when the fastest-running partition is more than 50% ahead of the slowest-running partition (based on record counts).
Part D: After the parallel batch run completes, you need a post-mortem report. Write the SQL query that produces a summary showing each job's total elapsed time, partition count, fastest/slowest partition times, imbalance ratio, total records processed, and any partitions that required restart.
Exercise 25.11 — Cross-Partition Dependency Analysis (Analyze)
A batch program calculates running account balances. Each account's new balance depends on its previous balance and today's transactions. The formula is:
NEW-BALANCE = PREVIOUS-BALANCE + SUM(CREDITS) - SUM(DEBITS)
Part A: Can this program be partitioned by account number range? Explain why or why not, considering inter-record dependencies.
Part B: Now consider a transfer transaction: account A sends $500 to account B. If accounts A and B are in different partitions, what problem arises? How do you handle it?
Part C: Propose a two-pass design where pass 1 is fully parallelizable and pass 2 handles cross-partition transfers. Describe each pass.
Part D: What percentage of transactions are typically cross-partition transfers? (Hint: if partition boundaries split the account space roughly evenly, and transfers are random, what fraction cross a boundary?) How does this percentage affect the viability of the two-pass approach?
Exercise 25.12 — Sysplex Parallelism Design (Analyze)
SecureFirst Insurance runs a 4-member DB2 data sharing group: DB2A (6 CPs), DB2B (6 CPs), DB2C (4 CPs), DB2D (4 CPs).
Part A: A Sysplex-parallel query scans all 20 partitions of the CLAIMS table. If DB2 distributes partitions proportionally to member capacity, how many partitions does each member process?
Part B: The coupling facility has a 4 GB group buffer pool for the CLAIMS tablespace. Each partition scan generates approximately 200 MB of GBP traffic. Is the GBP adequately sized for the parallel query? What happens if it is not?
Part C: DB2D is also handling online transaction processing during the batch window. Sysplex parallelism assigns it 4 partitions to scan. How does this affect online response time? What controls can limit the impact?
Part D: Yuki wants to run both application-level partitioning (4 batch partitions) AND Sysplex parallelism (within each batch partition). Each batch partition submits queries with DEGREE(ANY) against a 20-partition tablespace. Calculate the maximum number of concurrent DB2 tasks this creates. Is this practical?
Exercise 25.13 — Full Pipeline JCL (Apply)
Write complete JCL for a three-job parallel pipeline:
Job 1: SPLIT — Splits the input file into 4 partitions using DFSORT. Runs first, serially.
Jobs 2A–2D: PROCESS-P01 through PROCESS-P04 — Four parallel jobs, each running program ACCTPROC with its partition number in the PARM. Each job reads its partition input file, accesses DB2, and writes a partition output file.
Job 3: MERGE — Merges the four partition output files using DFSORT MERGE. Runs after all four PROCESS jobs complete.
Include: - Complete DD statements for all files - PARM specifications for ACCTPROC - DFSORT control statements for SPLIT and MERGE - DB2 attachment (DSNRLI) in the PROCESS steps - Appropriate COND parameters - Comments explaining the parallel flow
Exercise 25.14 — Capacity Planning for Parallelism (Analyze)
Your system has: - 8 general CPs, typically 70% utilized during batch window - 4 zIIPs, typically 40% utilized - 16 DASD volumes for batch work files, 4 channels - DB2 MAX_BATCH_CONNECTED = 50, current batch threads = 25 - JES2 initiators for batch = 20, current use = 12
You want to implement parallel batch with 6 partitions for the main posting job, 4 partitions for interest calculation, and 3 partitions for statement generation. Maximum concurrent partitions = 6 (posting and interest do not overlap).
Part A: Calculate CP utilization during peak parallelism (6 partitions). Assume each partition consumes 5% CP on average. Is the system capacity sufficient?
Part B: Calculate DB2 thread consumption during peak parallelism. Each partition requires 1 DB2 thread. Is the thread capacity sufficient?
Part C: Calculate DASD volume requirements. Each partition needs 2 work files (SORTWK + output) on separate volumes. During 6-partition processing, how many volumes are needed? Can the existing 16 volumes accommodate this?
Part D: You propose increasing to 8 partitions for the posting job. Recalculate all resource requirements. Identify the first bottleneck.
Exercise 25.15 — Production Incident Response (Analyze)
It is 2:30 AM. The batch controller calls you. The end-of-day posting job (4 partitions) has been running since midnight. Partitions 1, 2, and 4 completed at 1:15 AM, but Partition 3 is still running after 2.5 hours (normal is 45 minutes).
Part A: List your first five diagnostic steps, in order.
Part B: You discover that Partition 3 is consuming almost no CPU but is in a "lock wait" state on DB2. The DISPLAY THREAD output shows it is waiting for a lock held by an online CICS transaction that is in "indoubt" status (CICS region crashed at 1:45 AM and has not been restarted). What is the root cause? What is the solution?
Part C: After resolving the lock, Partition 3 resumes but you realize it will not complete until 3:45 AM. The merge job, interest calculation, and statement generation are all waiting. The batch window closes at 4:30 AM. Can the remaining pipeline complete in 45 minutes? What can you do to compress the remaining schedule?
Part D: Write the operations procedure for this scenario: the steps to resolve the issue, restart processing, and verify completion. Include commands and SQL queries.
Exercise 25.16 — Partition Testing Strategy (Apply)
You have developed a partition-safe COBOL program and need to test it before production deployment.
Part A: Describe a unit test for partition boundary handling. What edge cases must you test? (Hint: what happens at the exact boundary value?)
Part B: Describe an integration test for two partitions running concurrently against the same DB2 table. What are you specifically testing?
Part C: Describe a stress test for deadlock handling. How do you deliberately provoke deadlocks to verify retry logic?
Part D: Describe a reconciliation test. How do you verify that the parallel run produces exactly the same results as a serial run?
Exercise 25.17 — Hash Partitioning Implementation (Apply)
Implement hash-based partitioning in COBOL for a 10-digit numeric account number and 4 partitions.
Part A: Write the COBOL code for a hash function that distributes account numbers evenly across 4 partitions. Use modulo arithmetic on the account number.
Part B: Account numbers are not uniformly distributed — numbers starting with 1 and 2 are heavily overrepresented. Does your hash function from Part A still distribute evenly? Why or why not?
Part C: Implement a better hash function that applies a mixing step before the modulo. Write the COBOL code.
Part D: Hash partitioning destroys key-order locality. If the downstream merge requires sorted output, what is the additional cost compared to key-range partitioning? Quantify in terms of sort operations.
Exercise 25.18 — GBP Sizing for Sysplex Parallelism (Apply)
A DB2 data sharing group has 3 members. The ACCOUNT_MASTER tablespace has 8 partitions. Buffer pool BP2 on each member has 30,000 pages (4K pages).
Part A: During serial batch processing (1 member), GBP write activity is 500 pages/second. During 3-member Sysplex parallel processing, estimate the GBP write activity. Explain your reasoning.
Part B: The GBP is 2 GB. Each page entry in the GBP consumes approximately 4.1 KB (4K data + overhead). How many pages can the GBP hold? At the write rate from Part A, how quickly does the GBP fill?
Part C: When the GBP fills, DB2 must castout (write back to DASD) pages to make room. This introduces latency. What is the impact on parallel batch performance?
Part D: Calculate the minimum GBP size needed to avoid castout during a 15-minute parallel batch step. Assume pages are invalidated (and can be reused) after 10 seconds on average.
Exercise 25.19 — Partition Imbalance Recovery (Analyze)
Your 6-partition interest calculation job shows this execution profile:
| Partition | Records | Elapsed Time | CPU Time |
|---|---|---|---|
| P1 | 1,900,000 | 11 min | 8 min |
| P2 | 2,100,000 | 12 min | 9 min |
| P3 | 2,000,000 | 38 min | 9 min |
| P4 | 1,800,000 | 10 min | 7 min |
| P5 | 2,200,000 | 13 min | 10 min |
| P6 | 2,000,000 | 11 min | 8 min |
Part A: Partition 3 has approximately the same record count and CPU time as other partitions, but 3x the elapsed time. What is the most likely cause? (Hint: elapsed time vs CPU time discrepancy.)
Part B: You investigate and find that P3's VSAM input file is on a volume with heavy concurrent I/O from an unrelated job. Propose an immediate fix and a long-term fix.
Part C: Calculate the overall pipeline delay caused by P3. If the merge job takes 5 minutes and downstream jobs take 30 minutes, how much total batch window time was wasted?
Part D: The imbalance ratio is 38/10 = 3.8:1. Your alert threshold is 1.5:1. Design an automated response that triggers when imbalance exceeds 1.5:1 during execution. What can automation do, and what requires human intervention?
Exercise 25.20 — Parallel Batch Design Review (Analyze)
A junior developer proposes this parallel batch design for CNB's end-of-day processing:
- Run 16 partitions for account posting (to maximize parallelism)
- All partitions write to the same output file using DISP=MOD
- No partition control table — partition boundaries are hardcoded in JCL PARMs
- Checkpoint every 100,000 records (to minimize checkpoint overhead)
- No deadlock handling (assumes ROW locking prevents all deadlocks)
- DFSORT merge of all 16 partition outputs using a single SORTWK dataset
- No reconciliation step — trust the return codes
Part A: Identify all problems with this design. For each problem, explain the specific failure mode it will cause in production.
Part B: Rank the problems from most critical to least critical. Justify your ranking.
Part C: Rewrite the design with corrections. Specify the exact partition count, file handling, checkpoint strategy, deadlock handling, sort configuration, and reconciliation approach you would use.
Part D: How would you communicate these issues to the junior developer constructively? Write a brief code review comment (3-4 sentences) for each of the top three issues.
Exercise 25.21 — IDCAMS REPRO Partitioning (Apply)
You have a VSAM KSDS with a 10-byte alphanumeric key (customer ID format: AAA9999999, where AAA is a 3-letter branch code and 9999999 is a sequence number).
Part A: Write IDCAMS REPRO statements to create 4 partition files, split by the first letter of the branch code: A–F, G–L, M–R, S–Z.
Part B: Branch codes are not uniformly distributed. 40% of records have branch codes starting with A–F (old East Coast branches). Propose a better split.
Part C: Each REPRO reads the entire KSDS from the beginning and scans to the FROMKEY. For a 50-million-record KSDS, the fourth REPRO scans 37.5 million records before reaching its start point. Calculate the total I/O overhead of four sequential REPROs.
Part D: Propose a more I/O-efficient alternative to IDCAMS REPRO for splitting this KSDS. Compare the I/O cost.
Exercise 25.22 — Temporal Partitioning Design (Analyze)
Federal Benefits must reprocess six months of benefit payments due to a legislative change. The volume is 180 million records (30 million per month).
Part A: Design a temporal partitioning scheme where each month is a partition. How many parallel partitions can you run? What limits you?
Part B: Month 3 has a legislative change that alters the calculation formula used by months 4, 5, and 6. This creates a dependency: month 3 must complete before months 4-6 can start. Redesign the pipeline to accommodate this dependency while maximizing parallelism.
Part C: Some beneficiaries have records in multiple months (ongoing payments). If month 3's recalculation changes a beneficiary's status, months 4-6 must use the updated status. How does this cross-partition dependency affect your design?
Part D: Sandra proposes running this reprocessing during the regular batch window alongside normal nightly processing. What resource conflicts do you anticipate? Propose a scheduling strategy.
Exercise 25.23 — Performance Modeling (Apply)
Model the parallel batch performance for a job with these serial characteristics: - 10 million records - Serial elapsed time: 120 minutes - CPU time: 40 minutes - I/O time: 70 minutes - DB2 lock wait time: 10 minutes
Part A: Using Amdahl's Law, calculate the theoretical speedup for 4 partitions assuming the CPU portion is perfectly parallelizable but I/O and lock wait have 20% serial overhead each.
Part B: In practice, parallelism introduces overhead: partition setup (2 minutes fixed), partition control table access (0.5 minutes per partition), and merge (5 minutes fixed). Adjust your calculation from Part A.
Part C: At what partition count does the overhead exceed the benefit? (The point where adding more partitions increases total elapsed time.)
Part D: Plot (or describe) the curve of elapsed time vs. partition count from 1 to 16 partitions. Where is the sweet spot?
Exercise 25.24 — Comprehensive Design Exercise (Apply/Analyze)
Pinnacle Health processes 5 million medical claims nightly. Current serial processing takes 4 hours 30 minutes. The batch window is 3 hours.
The pipeline consists of: 1. CLAIM-VALIDATE (validate claim data, 45 min serial) 2. ELIGIBILITY-CHECK (verify member eligibility, 60 min serial) 3. ADJUDICATE (apply benefit rules, calculate payment, 90 min serial) 4. PROVIDER-PAY (generate provider payments, 30 min serial) 5. MEMBER-EOB (generate explanation of benefits, 40 min serial) 6. GL-POSTING (post to general ledger, 25 min serial)
Dependencies: VALIDATE → ELIGIBILITY → ADJUDICATE → {PROVIDER-PAY, MEMBER-EOB (parallel)} → GL-POSTING
Part A: Design the parallel pipeline. Determine which jobs to parallelize and with how many partitions. Choose partition keys for each parallelized job.
Part B: Calculate the critical path elapsed time with your design. Does it fit in the 3-hour window?
Part C: Design the partition control table, checkpoint strategy, and reconciliation approach for the ADJUDICATE step (the most complex and longest-running job).
Part D: Ahmad discovers that ADJUDICATE has cross-claim dependencies: some claims reference prior claims for the same member (pre-authorization, referral chains). These cross-references may span partitions. Design a solution that handles cross-partition claim references without serializing the entire ADJUDICATE step.
Exercise 25.25 — Parallel Batch Runbook (Apply)
Write a production runbook (operations procedure document) for a four-partition parallel batch job. The runbook should cover:
Part A: Normal execution procedure — step-by-step instructions for submitting the pipeline, monitoring progress, and verifying successful completion.
Part B: Single partition failure procedure — steps to diagnose, restart the failed partition, and resume the pipeline.
Part C: Multiple partition failure procedure — when two or more partitions fail, steps to assess whether restart or full rerun is appropriate.
Part D: Escalation procedure — when to call the on-call DBA, when to call the application developer, when to call the batch window manager, and what information each person needs.
Include specific commands (MVS, DB2, scheduler) and SQL queries for each step.