Case Study 2: Production JCL Standards and Error Recovery
Background
Continental Savings Bank operates over 400 production batch jobs nightly. In 2024, the bank experienced a significant incident: a poorly coded JCL job overwrote a critical master file because a STEPLIB pointed to a test load library, causing a test version of a program to execute in production. The recovery took eight hours and resulted in delayed statement processing for 50,000 customers.
Following this incident, the bank's CTO commissioned a JCL standards review led by senior systems programmer Lorraine Fujimoto. Her team developed a comprehensive set of JCL coding standards, restart procedures, and operational guidelines. This case study examines the standards through a concrete example: the bank's daily account reconciliation job, which was refactored to comply with the new standards.
The Standards Document: Key Rules
Lorraine's team established the following rules for all production JCL:
- JOBLIB is mandatory for production jobs. STEPLIB is only permitted within cataloged procedures to override environment-specific libraries.
- Every job must include a restart step that can be used to resume processing after an abend.
- DD overrides in procedure calls must be documented with comments explaining why the override is necessary.
- Output class management must follow the bank's classification scheme: Class A for printed reports, Class H for held output, Class X for job logs, and Class Z for purge-after-review.
- Referback notation should be used for consistency when multiple steps reference the same dataset attributes.
- All dataset names must follow the naming convention:
HLQ.ENV.APP.TYPE.QUALIFIER.
The Reconciliation Job: Before Refactoring
The original JCL for the daily reconciliation job had several problems that Lorraine's team identified:
//*================================================================*
//* ORIGINAL JCL - DO NOT USE IN PRODUCTION *
//* This JCL contains multiple violations of coding standards. *
//* Shown here for educational comparison with the refactored *
//* version below. *
//*================================================================*
//RECON JOB (ACCT),'RECONCILIATION',CLASS=A
//STEP1 EXEC PGM=RECONPGM
//STEPLIB DD DSN=CONT.LOADLIB,DISP=SHR
// DD DSN=CONT.TEST.LOADLIB,DISP=SHR
//INFILE DD DSN=CONT.DAILY.TRANS,DISP=SHR
//OUTFILE DD DSN=CONT.RECON.REPORT,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(TRK,(10,5)),
// UNIT=SYSDA
//SYSOUT DD SYSOUT=*
//STEP2 EXEC PGM=RECONRPT
//INFILE DD DSN=CONT.RECON.REPORT,DISP=SHR
//REPORT DD SYSOUT=*
Problems identified: - Missing MSGCLASS, MSGLEVEL, NOTIFY on the JOB statement. - STEPLIB includes a test library in the concatenation. This is what caused the 2024 incident. - No JOBLIB -- each step uses STEPLIB, risking inconsistency. - Generic step names (STEP1, STEP2) provide no operational context. - No COND or IF/THEN/ELSE for conditional execution. - No DCB parameters on output datasets. - SYSOUT=* uses the default class rather than an explicit class. - No restart capability documented or coded. - Dataset names do not follow the naming convention.
The Reconciliation Job: After Refactoring
//RECONJOB JOB (ACCTG01),'DAILY RECONCILIATION',
// CLASS=A,
// MSGCLASS=X,
// MSGLEVEL=(1,1),
// NOTIFY=&SYSUID,
// REGION=0M,
// RESTART=RCEXTRACT
//*================================================================*
//* JOB: RECONJOB - Daily Account Reconciliation *
//* OWNER: Accounting Department (ACCTG01) *
//* SCHEDULE: Daily at 20:00 EST via CA-7 *
//* ONCALL: Operations Center x4500 *
//* *
//* RESTART INSTRUCTIONS: *
//* If RCEXTRACT abends: Restart from RCEXTRACT. *
//* No cleanup needed (output datasets use DISP MOD or NEW). *
//* If RCMATCH abends: Restart from RCMATCH. *
//* Delete CONT.PROD.ACCTG.RECON.EXTRACT first. *
//* If RCRPT abends: Restart from RCRPT. *
//* No cleanup needed (SYSOUT output only). *
//* If RCNOTIFY abends: Restart from RCNOTIFY. *
//* No cleanup needed. *
//*================================================================*
//*
//*---------- JOBLIB: Production load library only -----------------*
//* STANDARD: JOBLIB must reference only production libraries. *
//* NEVER include test or QA libraries in JOBLIB. *
//*---------- -------------------------------------------------------*
//JOBLIB DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
// DD DSN=CONT.PROD.COMMON.LOADLIB,DISP=SHR
//*
//*================================================================*
//* RCEXTRACT: Extract today's transactions and GL balances *
//* Reads the daily transaction file and the general ledger *
//* summary, producing a reconciliation extract file. *
//*================================================================*
//RCEXTRACT EXEC PGM=RECONEXT
//TRANSIN DD DSN=CONT.PROD.ACCTG.DAILY.TRANS,DISP=SHR
//GLSUMM DD DSN=CONT.PROD.ACCTG.GL.SUMMARY,DISP=SHR
//EXTRACT DD DSN=CONT.PROD.ACCTG.RECON.EXTRACT,
// DISP=(NEW,CATLG,DELETE),
// DCB=(RECFM=FB,LRECL=300,BLKSIZE=27000),
// SPACE=(CYL,(10,5),RLSE),
// UNIT=SYSDA
//CTLCARD DD DSN=CONT.PROD.ACCTG.RECON.PARMS,DISP=SHR
//SYSOUT DD SYSOUT=X
//SYSUDUMP DD SYSOUT=X
//*
//*================================================================*
//* RCMATCH: Match extracted transactions against GL entries *
//* Produces matched, unmatched, and exception files. *
//*================================================================*
//RCMATCH EXEC PGM=RECONMCH,
// COND=(4,LT,RCEXTRACT)
//EXTRACT DD DSN=CONT.PROD.ACCTG.RECON.EXTRACT,DISP=SHR
//MATCHED DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
// DISP=(NEW,CATLG,DELETE),
// DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
// SPACE=(CYL,(10,5),RLSE),
// UNIT=SYSDA
//UNMATCHED DD DSN=CONT.PROD.ACCTG.RECON.UNMATCHED,
// DISP=(NEW,CATLG,DELETE),
// DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
// SPACE=(CYL,(2,1),RLSE),
// UNIT=SYSDA
//EXCEPTN DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,
// DISP=(NEW,CATLG,DELETE),
// DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
// SPACE=(CYL,(1,1),RLSE),
// UNIT=SYSDA
//SYSOUT DD SYSOUT=X
//SYSUDUMP DD SYSOUT=X
//*
//*================================================================*
//* RCRPT: Generate reconciliation reports *
//* Uses cataloged procedure RECONRPT with DD overrides. *
//*================================================================*
//RCRPT EXEC RECONRPT,
// COND=(4,LT,RCMATCH)
//*---------- DD Override: RPTSTEP1.MATCHIN -------------------------*
//* Override the default matched file name in the procedure to *
//* use the output from RCMATCH step. This override is required *
//* because the procedure default references a different dataset. *
//*------------------------------------------------------------------*
//RPTSTEP1.MATCHIN DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
// DISP=SHR
//*---------- DD Override: RPTSTEP1.RPTOUT --------------------------*
//* Override output class from procedure default (A) to class H *
//* for held output, per production standards. *
//*------------------------------------------------------------------*
//RPTSTEP1.RPTOUT DD SYSOUT=H,
// DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//*---------- DD Override: RPTSTEP2.EXCPIN --------------------------*
//* Override the exception input file name. *
//*------------------------------------------------------------------*
//RPTSTEP2.EXCPIN DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,
// DISP=SHR
//RPTSTEP2.RPTOUT DD SYSOUT=H,
// DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//*
//*================================================================*
//* RCNOTIFY: Send completion notification *
//* Runs even if previous steps had warnings (RC <= 4). *
//* Uses IF/THEN/ELSE for different notification paths. *
//*================================================================*
// IF (RCMATCH.RC <= 4) THEN
//RCNOTIFY EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=X
//SYSTSIN DD *
SEND 'RECONJOB: Daily reconciliation completed ' +
'successfully.' USER(ACCTOPS)
SEND 'RECONJOB: Reports available in HELD class H.' +
USER(ACCTOPS)
/*
// ELSE
//RCFAIL EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=X
//SYSTSIN DD *
SEND 'RECONJOB: *** RECONCILIATION FAILED *** ' +
'Review job log immediately.' USER(ACCTOPS)
SEND 'RECONJOB: *** RECONCILIATION FAILED *** ' +
'Escalate to on-call x4500.' USER(OPER01)
/*
// ENDIF
//*
//*================================================================*
//* RCCLEAN: Cleanup intermediate files *
//* Runs only after successful completion of all processing steps. *
//*================================================================*
// IF (RCMATCH.RC <= 4 AND RCRPT.RC <= 4) THEN
//RCCLEAN EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=X
//SYSIN DD *
DELETE CONT.PROD.ACCTG.RECON.EXTRACT PURGE
DELETE CONT.PROD.ACCTG.RECON.MATCHED PURGE
SET MAXCC = 0
/*
// ENDIF
The Cataloged Procedure for Reports
//*================================================================*
//* RECONRPT - Cataloged procedure for reconciliation reports *
//* Stored in: CONT.PROD.PROCLIB(RECONRPT) *
//*================================================================*
//RECONRPT PROC
//*
//RPTSTEP1 EXEC PGM=RECRPT01
//STEPLIB DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//MATCHIN DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,DISP=SHR
//RPTOUT DD SYSOUT=A,
// DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//SYSOUT DD SYSOUT=X
//*
//RPTSTEP2 EXEC PGM=RECRPT02,
// COND=(4,LT,RPTSTEP1)
//STEPLIB DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//EXCPIN DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,DISP=SHR
//UNMTCHIN DD DSN=CONT.PROD.ACCTG.RECON.UNMATCHED,DISP=SHR
//RPTOUT DD SYSOUT=A,
// DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//SYSOUT DD SYSOUT=X
//*
// PEND
Solution Walkthrough
JOBLIB Concatenation Standards
The refactored JCL uses a JOBLIB with two libraries:
//JOBLIB DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
// DD DSN=CONT.PROD.COMMON.LOADLIB,DISP=SHR
The system searches these in order. Application-specific programs (RECONEXT, RECONMCH) are found in the first library. Common utility programs are found in the second. The critical rule: no test or QA libraries are ever included in production JOBLIB.
When a procedure step specifies its own STEPLIB, the JOBLIB is completely ignored for that step. This is why the procedure includes STEPLIB -- it needs the same libraries, but if the procedure were reused in a test job, the test job's JOBLIB would not contaminate the procedure's library search.
The distinction between JOBLIB and STEPLIB:
| Aspect | JOBLIB | STEPLIB |
|---|---|---|
| Scope | Entire job | Single step |
| When overridden | When STEPLIB is coded for a step | Never (it is the override) |
| Maximum concatenation | 16 libraries | 16 libraries |
| Placement in JCL | After JOB, before first EXEC | After EXEC for the step |
| In procedures | Not allowed | Allowed and common |
DD Overrides in Procedure Calls
When a job invokes a cataloged procedure, the caller can override any DD statement in the procedure. The override syntax uses the format stepname.ddname:
//RCRPT EXEC RECONRPT,
// COND=(4,LT,RCMATCH)
//RPTSTEP1.MATCHIN DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
// DISP=SHR
//RPTSTEP1.RPTOUT DD SYSOUT=H,
// DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
RPTSTEP1.MATCHIN overrides the MATCHIN DD in the RPTSTEP1 step of the RECONRPT procedure. The override completely replaces the original DD -- it does not merge parameters. If the procedure's MATCHIN DD specifies DSN, DISP, and DCB, but the override only specifies DSN and DISP, the DCB from the procedure is lost.
Lorraine's standard requires that every DD override be preceded by a comment block explaining the reason for the override. This makes the intent clear to operations staff who review JCL during problem determination.
Referback Notation
Referback notation allows a DD statement to copy parameters from a previous DD statement in the same job. While the refactored example does not use extensive referbacks, the technique is valuable when multiple steps process the same dataset with identical attributes:
//* Example of referback notation (not in the main job above)
//STEP1 EXEC PGM=PROG1
//OUTPUT DD DSN=CONT.PROD.WORK.FILE,
// DISP=(NEW,CATLG,DELETE),
// DCB=(RECFM=FB,LRECL=300,BLKSIZE=27000),
// SPACE=(CYL,(10,5),RLSE),
// UNIT=SYSDA
//*
//STEP2 EXEC PGM=PROG2
//INPUT DD DSN=*.STEP1.OUTPUT,DISP=SHR
//OUTPUT2 DD DSN=CONT.PROD.WORK.FILE2,
// DISP=(NEW,CATLG,DELETE),
// DCB=*.STEP1.OUTPUT,
// SPACE=(CYL,(10,5),RLSE),
// UNIT=SYSDA
DSN=*.STEP1.OUTPUT is a referback to the dataset name in STEP1's OUTPUT DD. DCB=*.STEP1.OUTPUT copies all DCB attributes from that DD. This ensures consistency and reduces the chance of coding errors when the same attributes appear in multiple places.
Output Class Management
The bank's output class scheme:
| Class | Purpose | Retention | Access |
|---|---|---|---|
| A | Printed reports | Print immediately | All users |
| H | Held output | Hold until released | Authorized users |
| X | Job logs | Hold for 7 days | Operations |
| Z | Purge-after-review | Hold for 24 hours | Operations |
| T | Test output | Hold for 4 hours | Developers |
The refactored JCL uses class X for diagnostic output (SYSOUT, SYSUDUMP) and class H for business reports. The original JCL used SYSOUT=*, which defaults to the MSGCLASS -- meaning business reports and diagnostic output went to the same class, making it difficult for operations to manage report distribution.
Restart Procedures
The JOB statement includes RESTART=RCEXTRACT, which tells JES to begin execution at the RCEXTRACT step. In normal operation, this parameter is commented out. When a restart is needed, operations edits the JCL to uncomment it and set the appropriate restart step.
The restart instructions in the comment block are critical. They document: 1. Which step to restart from for each possible failure point. 2. What cleanup is needed before restarting (deleting partial output datasets). 3. Whether the input data is still valid for reprocessing.
For a restart after RCMATCH abends:
//* RESTART PROCEDURE FOR RCMATCH FAILURE:
//* 1. Delete partial output:
//* DELETE CONT.PROD.ACCTG.RECON.MATCHED
//* DELETE CONT.PROD.ACCTG.RECON.UNMATCHED
//* DELETE CONT.PROD.ACCTG.RECON.EXCEPTIONS
//* 2. Edit JOB statement: RESTART=RCMATCH
//* 3. Resubmit job.
The RCEXTRACT step's output uses DISP=(NEW,CATLG,DELETE). The DELETE in the third positional (abnormal termination disposition) means the dataset is automatically deleted if the step abends. This simplifies restart because the partial dataset is cleaned up automatically.
SYSUDUMP DD Statement
Every step includes a SYSUDUMP DD:
//SYSUDUMP DD SYSOUT=X
If the step abends, the system writes a formatted dump to this DD. Without SYSUDUMP (or SYSABEND or SYSMDUMP), the dump is lost, making problem diagnosis extremely difficult. Lorraine's standard requires SYSUDUMP on every step that executes a COBOL program.
The three dump DD names differ in what they capture:
| DD Name | Content | Size |
|---|---|---|
| SYSUDUMP | User regions only (WORKING-STORAGE, etc.) | Small |
| SYSABEND | User regions + system areas | Medium |
| SYSMDUMP | Machine-readable dump for IPCS analysis | Large |
SYSUDUMP is the standard choice for COBOL programs because it captures the WORKING-STORAGE and LINKAGE SECTION data that programmers need for debugging.
Common JCL Errors and Their Consequences
Lorraine's team documented the most common JCL errors they found during the standards review:
-
JCL ERROR - IEF605I: Occurs when a dataset in a DD statement does not exist. Often caused by misspelled dataset names or missing qualifiers.
-
S806 ABEND: Program not found in any searched library. Usually caused by JOBLIB/STEPLIB pointing to the wrong library or a missing program member.
-
S913 ABEND: RACF authorization failure. The job's user ID does not have access to a dataset. Common when moving JCL between environments without updating dataset names.
-
S0C7 ABEND: Data exception in the COBOL program, but often caused by JCL problems -- for example, pointing to the wrong input file or a file with the wrong record format.
-
S322 ABEND: Job exceeded its CPU time limit. May require increasing the TIME parameter on the JOB or EXEC statement.
-
SB37 ABEND: Output dataset ran out of space. The SPACE parameter needs larger primary or secondary allocations.
Discussion Questions
-
The original JCL included
CONT.TEST.LOADLIBin the STEPLIB concatenation. Explain the specific chain of events that could lead to a production data corruption from this configuration. Why does the new standard prohibit test libraries in production JOBLIB/STEPLIB? -
DD overrides completely replace the procedure DD rather than merging parameters. What are the implications of this behavior? Design a scenario where forgetting to copy a parameter from the procedure DD to the override DD would cause a failure.
-
The restart instructions are documented in JCL comments. What are the limitations of this approach? How would you design an automated restart mechanism that does not depend on operations staff reading and following comment instructions?
-
The output class scheme uses different classes for different types of output. How does this facilitate operations management in a large shop with hundreds of jobs? What would happen if all output went to the same class?
-
Compare JOBLIB and STEPLIB in the context of a multi-application job that calls programs from three different application libraries. What are the trade-offs between using a single JOBLIB with three concatenated libraries versus using STEPLIB on each step?
-
The
SET MAXCC = 0command in the IDCAMS cleanup step forces the return code to zero even if a DELETE fails (because the dataset might not exist). Is this good practice or does it mask real errors? How would you code the cleanup to handle both "dataset not found" (acceptable) and "dataset in use" (a real error) scenarios?