Case Study 2: Production JCL Standards and Error Recovery

Background

Continental Savings Bank operates over 400 production batch jobs nightly. In 2024, the bank experienced a significant incident: a poorly coded JCL job overwrote a critical master file because a STEPLIB pointed to a test load library, causing a test version of a program to execute in production. The recovery took eight hours and resulted in delayed statement processing for 50,000 customers.

Following this incident, the bank's CTO commissioned a JCL standards review led by senior systems programmer Lorraine Fujimoto. Her team developed a comprehensive set of JCL coding standards, restart procedures, and operational guidelines. This case study examines the standards through a concrete example: the bank's daily account reconciliation job, which was refactored to comply with the new standards.

The Standards Document: Key Rules

Lorraine's team established the following rules for all production JCL:

  1. JOBLIB is mandatory for production jobs. STEPLIB is only permitted within cataloged procedures to override environment-specific libraries.
  2. Every job must include a restart step that can be used to resume processing after an abend.
  3. DD overrides in procedure calls must be documented with comments explaining why the override is necessary.
  4. Output class management must follow the bank's classification scheme: Class A for printed reports, Class H for held output, Class X for job logs, and Class Z for purge-after-review.
  5. Referback notation should be used for consistency when multiple steps reference the same dataset attributes.
  6. All dataset names must follow the naming convention: HLQ.ENV.APP.TYPE.QUALIFIER.

The Reconciliation Job: Before Refactoring

The original JCL for the daily reconciliation job had several problems that Lorraine's team identified:

//*================================================================*
//* ORIGINAL JCL - DO NOT USE IN PRODUCTION                         *
//* This JCL contains multiple violations of coding standards.       *
//* Shown here for educational comparison with the refactored        *
//* version below.                                                   *
//*================================================================*
//RECON    JOB (ACCT),'RECONCILIATION',CLASS=A
//STEP1    EXEC PGM=RECONPGM
//STEPLIB  DD DSN=CONT.LOADLIB,DISP=SHR
//         DD DSN=CONT.TEST.LOADLIB,DISP=SHR
//INFILE   DD DSN=CONT.DAILY.TRANS,DISP=SHR
//OUTFILE  DD DSN=CONT.RECON.REPORT,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(TRK,(10,5)),
//            UNIT=SYSDA
//SYSOUT   DD SYSOUT=*
//STEP2    EXEC PGM=RECONRPT
//INFILE   DD DSN=CONT.RECON.REPORT,DISP=SHR
//REPORT   DD SYSOUT=*

Problems identified: - Missing MSGCLASS, MSGLEVEL, NOTIFY on the JOB statement. - STEPLIB includes a test library in the concatenation. This is what caused the 2024 incident. - No JOBLIB -- each step uses STEPLIB, risking inconsistency. - Generic step names (STEP1, STEP2) provide no operational context. - No COND or IF/THEN/ELSE for conditional execution. - No DCB parameters on output datasets. - SYSOUT=* uses the default class rather than an explicit class. - No restart capability documented or coded. - Dataset names do not follow the naming convention.

The Reconciliation Job: After Refactoring

//RECONJOB JOB (ACCTG01),'DAILY RECONCILIATION',
//             CLASS=A,
//             MSGCLASS=X,
//             MSGLEVEL=(1,1),
//             NOTIFY=&SYSUID,
//             REGION=0M,
//             RESTART=RCEXTRACT
//*================================================================*
//* JOB: RECONJOB - Daily Account Reconciliation                    *
//* OWNER: Accounting Department (ACCTG01)                          *
//* SCHEDULE: Daily at 20:00 EST via CA-7                            *
//* ONCALL: Operations Center x4500                                  *
//*                                                                  *
//* RESTART INSTRUCTIONS:                                            *
//*   If RCEXTRACT abends: Restart from RCEXTRACT.                   *
//*     No cleanup needed (output datasets use DISP MOD or NEW).     *
//*   If RCMATCH  abends: Restart from RCMATCH.                      *
//*     Delete CONT.PROD.ACCTG.RECON.EXTRACT first.                  *
//*   If RCRPT    abends: Restart from RCRPT.                        *
//*     No cleanup needed (SYSOUT output only).                      *
//*   If RCNOTIFY abends: Restart from RCNOTIFY.                     *
//*     No cleanup needed.                                           *
//*================================================================*
//*
//*---------- JOBLIB: Production load library only -----------------*
//*  STANDARD: JOBLIB must reference only production libraries.      *
//*  NEVER include test or QA libraries in JOBLIB.                   *
//*---------- -------------------------------------------------------*
//JOBLIB   DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//         DD DSN=CONT.PROD.COMMON.LOADLIB,DISP=SHR
//*
//*================================================================*
//* RCEXTRACT: Extract today's transactions and GL balances          *
//* Reads the daily transaction file and the general ledger          *
//* summary, producing a reconciliation extract file.                *
//*================================================================*
//RCEXTRACT EXEC PGM=RECONEXT
//TRANSIN  DD DSN=CONT.PROD.ACCTG.DAILY.TRANS,DISP=SHR
//GLSUMM   DD DSN=CONT.PROD.ACCTG.GL.SUMMARY,DISP=SHR
//EXTRACT  DD DSN=CONT.PROD.ACCTG.RECON.EXTRACT,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=(RECFM=FB,LRECL=300,BLKSIZE=27000),
//            SPACE=(CYL,(10,5),RLSE),
//            UNIT=SYSDA
//CTLCARD  DD DSN=CONT.PROD.ACCTG.RECON.PARMS,DISP=SHR
//SYSOUT   DD SYSOUT=X
//SYSUDUMP DD SYSOUT=X
//*
//*================================================================*
//* RCMATCH: Match extracted transactions against GL entries         *
//* Produces matched, unmatched, and exception files.                *
//*================================================================*
//RCMATCH  EXEC PGM=RECONMCH,
//             COND=(4,LT,RCEXTRACT)
//EXTRACT  DD DSN=CONT.PROD.ACCTG.RECON.EXTRACT,DISP=SHR
//MATCHED  DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
//            SPACE=(CYL,(10,5),RLSE),
//            UNIT=SYSDA
//UNMATCHED DD DSN=CONT.PROD.ACCTG.RECON.UNMATCHED,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
//            SPACE=(CYL,(2,1),RLSE),
//            UNIT=SYSDA
//EXCEPTN  DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=27200),
//            SPACE=(CYL,(1,1),RLSE),
//            UNIT=SYSDA
//SYSOUT   DD SYSOUT=X
//SYSUDUMP DD SYSOUT=X
//*
//*================================================================*
//* RCRPT: Generate reconciliation reports                           *
//* Uses cataloged procedure RECONRPT with DD overrides.             *
//*================================================================*
//RCRPT    EXEC RECONRPT,
//             COND=(4,LT,RCMATCH)
//*---------- DD Override: RPTSTEP1.MATCHIN -------------------------*
//* Override the default matched file name in the procedure to       *
//* use the output from RCMATCH step. This override is required      *
//* because the procedure default references a different dataset.    *
//*------------------------------------------------------------------*
//RPTSTEP1.MATCHIN DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
//            DISP=SHR
//*---------- DD Override: RPTSTEP1.RPTOUT --------------------------*
//* Override output class from procedure default (A) to class H      *
//* for held output, per production standards.                       *
//*------------------------------------------------------------------*
//RPTSTEP1.RPTOUT DD SYSOUT=H,
//            DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//*---------- DD Override: RPTSTEP2.EXCPIN --------------------------*
//* Override the exception input file name.                          *
//*------------------------------------------------------------------*
//RPTSTEP2.EXCPIN DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,
//            DISP=SHR
//RPTSTEP2.RPTOUT DD SYSOUT=H,
//            DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//*
//*================================================================*
//* RCNOTIFY: Send completion notification                           *
//* Runs even if previous steps had warnings (RC <= 4).              *
//* Uses IF/THEN/ELSE for different notification paths.              *
//*================================================================*
//         IF (RCMATCH.RC <= 4) THEN
//RCNOTIFY EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=X
//SYSTSIN  DD *
  SEND 'RECONJOB: Daily reconciliation completed ' +
       'successfully.' USER(ACCTOPS)
  SEND 'RECONJOB: Reports available in HELD class H.' +
       USER(ACCTOPS)
/*
//         ELSE
//RCFAIL   EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=X
//SYSTSIN  DD *
  SEND 'RECONJOB: *** RECONCILIATION FAILED *** ' +
       'Review job log immediately.' USER(ACCTOPS)
  SEND 'RECONJOB: *** RECONCILIATION FAILED *** ' +
       'Escalate to on-call x4500.' USER(OPER01)
/*
//         ENDIF
//*
//*================================================================*
//* RCCLEAN: Cleanup intermediate files                              *
//* Runs only after successful completion of all processing steps.   *
//*================================================================*
//         IF (RCMATCH.RC <= 4 AND RCRPT.RC <= 4) THEN
//RCCLEAN  EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=X
//SYSIN    DD *
  DELETE CONT.PROD.ACCTG.RECON.EXTRACT PURGE
  DELETE CONT.PROD.ACCTG.RECON.MATCHED PURGE
  SET MAXCC = 0
/*
//         ENDIF

The Cataloged Procedure for Reports

//*================================================================*
//* RECONRPT - Cataloged procedure for reconciliation reports       *
//* Stored in: CONT.PROD.PROCLIB(RECONRPT)                         *
//*================================================================*
//RECONRPT PROC
//*
//RPTSTEP1 EXEC PGM=RECRPT01
//STEPLIB  DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//MATCHIN  DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,DISP=SHR
//RPTOUT   DD SYSOUT=A,
//            DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//SYSOUT   DD SYSOUT=X
//*
//RPTSTEP2 EXEC PGM=RECRPT02,
//             COND=(4,LT,RPTSTEP1)
//STEPLIB  DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//EXCPIN   DD DSN=CONT.PROD.ACCTG.RECON.EXCEPTIONS,DISP=SHR
//UNMTCHIN DD DSN=CONT.PROD.ACCTG.RECON.UNMATCHED,DISP=SHR
//RPTOUT   DD SYSOUT=A,
//            DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)
//SYSOUT   DD SYSOUT=X
//*
//         PEND

Solution Walkthrough

JOBLIB Concatenation Standards

The refactored JCL uses a JOBLIB with two libraries:

//JOBLIB   DD DSN=CONT.PROD.ACCTG.LOADLIB,DISP=SHR
//         DD DSN=CONT.PROD.COMMON.LOADLIB,DISP=SHR

The system searches these in order. Application-specific programs (RECONEXT, RECONMCH) are found in the first library. Common utility programs are found in the second. The critical rule: no test or QA libraries are ever included in production JOBLIB.

When a procedure step specifies its own STEPLIB, the JOBLIB is completely ignored for that step. This is why the procedure includes STEPLIB -- it needs the same libraries, but if the procedure were reused in a test job, the test job's JOBLIB would not contaminate the procedure's library search.

The distinction between JOBLIB and STEPLIB:

Aspect JOBLIB STEPLIB
Scope Entire job Single step
When overridden When STEPLIB is coded for a step Never (it is the override)
Maximum concatenation 16 libraries 16 libraries
Placement in JCL After JOB, before first EXEC After EXEC for the step
In procedures Not allowed Allowed and common

DD Overrides in Procedure Calls

When a job invokes a cataloged procedure, the caller can override any DD statement in the procedure. The override syntax uses the format stepname.ddname:

//RCRPT    EXEC RECONRPT,
//             COND=(4,LT,RCMATCH)
//RPTSTEP1.MATCHIN DD DSN=CONT.PROD.ACCTG.RECON.MATCHED,
//            DISP=SHR
//RPTSTEP1.RPTOUT DD SYSOUT=H,
//            DCB=(RECFM=FBA,LRECL=133,BLKSIZE=0)

RPTSTEP1.MATCHIN overrides the MATCHIN DD in the RPTSTEP1 step of the RECONRPT procedure. The override completely replaces the original DD -- it does not merge parameters. If the procedure's MATCHIN DD specifies DSN, DISP, and DCB, but the override only specifies DSN and DISP, the DCB from the procedure is lost.

Lorraine's standard requires that every DD override be preceded by a comment block explaining the reason for the override. This makes the intent clear to operations staff who review JCL during problem determination.

Referback Notation

Referback notation allows a DD statement to copy parameters from a previous DD statement in the same job. While the refactored example does not use extensive referbacks, the technique is valuable when multiple steps process the same dataset with identical attributes:

//* Example of referback notation (not in the main job above)
//STEP1    EXEC PGM=PROG1
//OUTPUT   DD DSN=CONT.PROD.WORK.FILE,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=(RECFM=FB,LRECL=300,BLKSIZE=27000),
//            SPACE=(CYL,(10,5),RLSE),
//            UNIT=SYSDA
//*
//STEP2    EXEC PGM=PROG2
//INPUT    DD DSN=*.STEP1.OUTPUT,DISP=SHR
//OUTPUT2  DD DSN=CONT.PROD.WORK.FILE2,
//            DISP=(NEW,CATLG,DELETE),
//            DCB=*.STEP1.OUTPUT,
//            SPACE=(CYL,(10,5),RLSE),
//            UNIT=SYSDA

DSN=*.STEP1.OUTPUT is a referback to the dataset name in STEP1's OUTPUT DD. DCB=*.STEP1.OUTPUT copies all DCB attributes from that DD. This ensures consistency and reduces the chance of coding errors when the same attributes appear in multiple places.

Output Class Management

The bank's output class scheme:

Class Purpose Retention Access
A Printed reports Print immediately All users
H Held output Hold until released Authorized users
X Job logs Hold for 7 days Operations
Z Purge-after-review Hold for 24 hours Operations
T Test output Hold for 4 hours Developers

The refactored JCL uses class X for diagnostic output (SYSOUT, SYSUDUMP) and class H for business reports. The original JCL used SYSOUT=*, which defaults to the MSGCLASS -- meaning business reports and diagnostic output went to the same class, making it difficult for operations to manage report distribution.

Restart Procedures

The JOB statement includes RESTART=RCEXTRACT, which tells JES to begin execution at the RCEXTRACT step. In normal operation, this parameter is commented out. When a restart is needed, operations edits the JCL to uncomment it and set the appropriate restart step.

The restart instructions in the comment block are critical. They document: 1. Which step to restart from for each possible failure point. 2. What cleanup is needed before restarting (deleting partial output datasets). 3. Whether the input data is still valid for reprocessing.

For a restart after RCMATCH abends:

//* RESTART PROCEDURE FOR RCMATCH FAILURE:
//* 1. Delete partial output:
//*    DELETE CONT.PROD.ACCTG.RECON.MATCHED
//*    DELETE CONT.PROD.ACCTG.RECON.UNMATCHED
//*    DELETE CONT.PROD.ACCTG.RECON.EXCEPTIONS
//* 2. Edit JOB statement: RESTART=RCMATCH
//* 3. Resubmit job.

The RCEXTRACT step's output uses DISP=(NEW,CATLG,DELETE). The DELETE in the third positional (abnormal termination disposition) means the dataset is automatically deleted if the step abends. This simplifies restart because the partial dataset is cleaned up automatically.

SYSUDUMP DD Statement

Every step includes a SYSUDUMP DD:

//SYSUDUMP DD SYSOUT=X

If the step abends, the system writes a formatted dump to this DD. Without SYSUDUMP (or SYSABEND or SYSMDUMP), the dump is lost, making problem diagnosis extremely difficult. Lorraine's standard requires SYSUDUMP on every step that executes a COBOL program.

The three dump DD names differ in what they capture:

DD Name Content Size
SYSUDUMP User regions only (WORKING-STORAGE, etc.) Small
SYSABEND User regions + system areas Medium
SYSMDUMP Machine-readable dump for IPCS analysis Large

SYSUDUMP is the standard choice for COBOL programs because it captures the WORKING-STORAGE and LINKAGE SECTION data that programmers need for debugging.

Common JCL Errors and Their Consequences

Lorraine's team documented the most common JCL errors they found during the standards review:

  1. JCL ERROR - IEF605I: Occurs when a dataset in a DD statement does not exist. Often caused by misspelled dataset names or missing qualifiers.

  2. S806 ABEND: Program not found in any searched library. Usually caused by JOBLIB/STEPLIB pointing to the wrong library or a missing program member.

  3. S913 ABEND: RACF authorization failure. The job's user ID does not have access to a dataset. Common when moving JCL between environments without updating dataset names.

  4. S0C7 ABEND: Data exception in the COBOL program, but often caused by JCL problems -- for example, pointing to the wrong input file or a file with the wrong record format.

  5. S322 ABEND: Job exceeded its CPU time limit. May require increasing the TIME parameter on the JOB or EXEC statement.

  6. SB37 ABEND: Output dataset ran out of space. The SPACE parameter needs larger primary or secondary allocations.

Discussion Questions

  1. The original JCL included CONT.TEST.LOADLIB in the STEPLIB concatenation. Explain the specific chain of events that could lead to a production data corruption from this configuration. Why does the new standard prohibit test libraries in production JOBLIB/STEPLIB?

  2. DD overrides completely replace the procedure DD rather than merging parameters. What are the implications of this behavior? Design a scenario where forgetting to copy a parameter from the procedure DD to the override DD would cause a failure.

  3. The restart instructions are documented in JCL comments. What are the limitations of this approach? How would you design an automated restart mechanism that does not depend on operations staff reading and following comment instructions?

  4. The output class scheme uses different classes for different types of output. How does this facilitate operations management in a large shop with hundreds of jobs? What would happen if all output went to the same class?

  5. Compare JOBLIB and STEPLIB in the context of a multi-application job that calls programs from three different application libraries. What are the trade-offs between using a single JOBLIB with three concatenated libraries versus using STEPLIB on each step?

  6. The SET MAXCC = 0 command in the IDCAMS cleanup step forces the return code to zero even if a DELETE fails (because the dataset might not exist). Is this good practice or does it mask real errors? How would you code the cleanup to handle both "dataset not found" (acceptable) and "dataset in use" (a real error) scenarios?