Chapter 29: HADR and Replication — High Availability for DB2 LUW

35 min read

> "Your database is only as available as the weakest link in the chain — hardware, software, network, or human. HADR and replication are how DB2 LUW forges every link in steel."

In This Chapter

29.1 HA Concepts for DB2 LUW
29.2 HADR Architecture
29.3 HADR Synchronization Modes
29.4 Setting Up HADR Step by Step
29.5 Automatic Client Reroute (ACR)
29.6 HADR with Reads on Standby
29.7 Db2 pureScale
29.8 Q Replication
29.9 Change Data Capture (CDC)
29.10 Choosing the Right HA Solution
29.11 Meridian Bank LUW HA Design
Spaced Review
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 29: HADR and Replication — High Availability for DB2 LUW

"Your database is only as available as the weakest link in the chain — hardware, software, network, or human. HADR and replication are how DB2 LUW forges every link in steel."

When Meridian National Bank's distributed applications — web portals, mobile APIs, analytics platforms — run on DB2 for Linux, UNIX, and Windows, the high availability challenge takes a different shape than on z/OS. There is no coupling facility, no Parallel Sysplex, no hardware-assisted global lock manager. Instead, DB2 LUW provides a rich set of software-based HA technologies: HADR for database-level failover, pureScale for continuous availability, and replication technologies (Q Replication, CDC) for data distribution and disaster recovery.

This chapter explores each of these technologies in depth — how they work internally, how to configure them, and when to choose one over another. By the end, you will be able to design and implement a comprehensive HA strategy for Meridian Bank's LUW environment.

If your primary platform is z/OS, Chapter 28 covers data sharing and Parallel Sysplex. But the concepts here — RPO, RTO, synchronous versus asynchronous replication, automatic failover — are universal. Every DBA should understand them.

29.1 HA Concepts for DB2 LUW

Defining Availability

Availability is measured as the percentage of time a system is operational and accessible:

Availability = (Total Time - Downtime) / Total Time * 100

The industry uses "nines" to express availability targets:

Availability	Annual Downtime	Classification
99% (two nines)	3.65 days	Basic
99.9% (three nines)	8.76 hours	Standard
99.99% (four nines)	52.6 minutes	High availability
99.999% (five nines)	5.26 minutes	Continuous availability
99.9999% (six nines)	31.5 seconds	Ultra-high availability

Meridian National Bank targets four nines (99.99%) for its customer-facing LUW databases — meaning no more than about 53 minutes of unplanned downtime per year.

RPO and RTO

Two metrics define HA requirements:

Recovery Point Objective (RPO): How much data can you afford to lose? Measured in time. An RPO of zero means no data loss under any circumstances. An RPO of 5 minutes means you can tolerate losing up to 5 minutes of committed transactions.

Recovery Time Objective (RTO): How quickly must the system be operational after a failure? An RTO of 30 seconds means the database must be accepting connections within 30 seconds of a failure.

HA Technology	Typical RPO	Typical RTO
HADR SYNC mode	0 (zero data loss)	10-120 seconds
HADR NEARSYNC mode	Near-zero	10-120 seconds
HADR ASYNC mode	Seconds	10-120 seconds
HADR SUPERASYNC mode	Seconds to minutes	10-120 seconds
Db2 pureScale	0 (zero data loss)	Seconds
Q Replication	Seconds	Minutes (manual failover)
Backup/Restore	Hours	Hours

The HA Spectrum

HA solutions exist on a spectrum from simple to sophisticated:

Low HA                                               High HA
◄────────────────────────────────────────────────────────────►
Backup/    Log         HADR        HADR      pureScale
Restore    Shipping    ASYNC       SYNC

Hours      Minutes     Seconds     Zero      Zero
RPO        RPO         RPO         RPO       RPO
Hours      Hour+       <2 min      <2 min    Seconds
RTO        RTO         RTO         RTO       RTO
$          MATH0$         $$$$      $$$$$
Cost       Cost        Cost        Cost      Cost

The right choice depends on the application's RPO/RTO requirements, budget, and operational complexity tolerance.

Types of Failures

Understanding what can fail is essential for selecting the right HA solution. Failures fall into several categories:

Hardware failures: - Disk failure (single drive or entire storage array) - Memory failure (ECC error, DIMM failure) - CPU failure - Network interface or switch failure - Power supply failure - Complete server failure

Software failures: - DB2 engine crash (trap, assertion failure) - Operating system crash (kernel panic) - Storage driver or filesystem corruption - Application-induced problems (runaway queries, connection exhaustion)

Human errors: - Accidental DROP TABLE or DELETE without WHERE clause - Misconfigured parameters causing performance collapse - Incorrect maintenance procedures - Unauthorized schema changes

Environmental failures: - Data center power outage - Cooling failure - Natural disaster (flood, earthquake, fire) - Network backbone failure

Each HA technology addresses different failure types. HADR handles hardware and software failures through standby failover. A delayed standby or point-in-time recovery addresses human errors. Geographic replication (Q Replication, CDC, or HADR ASYNC to a remote standby) addresses environmental failures. A comprehensive HA strategy layers multiple technologies to cover all categories.

Planned vs. Unplanned Downtime

Total downtime has two components:

Unplanned downtime: Caused by failures — hardware, software, human error. HADR and pureScale reduce this by providing rapid failover or continuous availability.

Planned downtime: Caused by maintenance — DB2 upgrades, OS patches, hardware replacement, schema changes. In a single-server environment, every maintenance activity that requires a database restart contributes to planned downtime. HADR with rolling upgrades and pureScale with online member maintenance can reduce planned downtime to near-zero.

For Meridian Bank, historical analysis shows that planned downtime (quarterly maintenance windows) accounted for 80% of total annual downtime before HADR was implemented. Eliminating planned downtime through HADR with reads-on-standby was actually more impactful than the unplanned failure protection.

29.2 HADR Architecture

Overview

High Availability Disaster Recovery (HADR) is DB2 LUW's primary HA feature. It maintains a synchronized copy of a database on a standby server by continuously shipping and replaying transaction log records from the primary to the standby.

The fundamental architecture:

┌─────────────┐         Log shipping        ┌─────────────┐
│   PRIMARY   │  ─────────────────────────►  │   STANDBY   │
│  Database   │         TCP/IP               │  Database    │
│             │  ◄─────────────────────────  │             │
│  (Read/     │     Acknowledgment           │  (Log       │
│   Write)    │                              │   Replay)   │
└─────────────┘                              └─────────────┘
     │                                            │
     ▼                                            ▼
 ┌────────┐                                  ┌────────┐
 │ Disk   │                                  │ Disk   │
 │ Storage│                                  │ Storage│
 └────────┘                                  └────────┘

Key architectural elements:

Primary database — the active database serving all read/write traffic
Standby database — receives log records from the primary and replays them continuously
Log shipping — transaction log records are sent from primary to standby over TCP/IP
Log replay — the standby applies log records to keep its database synchronized
Takeover — when the primary fails, the standby is promoted to become the new primary

Log Shipping Mechanics

HADR ships log records at the log buffer level, not at the log file level. This means:

Records are shipped as soon as they are written to the log buffer — before they are flushed to the log file on disk
This provides much lower latency than file-level log shipping
The primary maintains a TCP/IP connection to the standby and sends log records continuously

On the standby side:

Log records arrive over the TCP connection
They are written to the standby's local log files
A log replay process reads the log files and applies the changes to the standby database
The standby database pages are updated to reflect the primary's state

Three Standby Roles

DB2 11.1 and later supports up to three standby databases in different roles:

Principal Standby

The primary failover target. It receives log records directly from the primary and can perform automatic takeover. There is exactly one principal standby.

Auxiliary Standby

Additional standbys that also receive log records from the primary. They provide extra redundancy but cannot perform automatic takeover. An auxiliary standby can be promoted to principal if the original principal fails. Up to two auxiliary standbys are supported.

Delayed Standby (Replay Delay)

A standby configured with a replay delay — it receives log records in real time but waits a configured interval (e.g., 1 hour) before replaying them. This protects against logical corruption: if someone accidentally drops a table, the delayed standby still has the table for up to the delay interval.

                    ┌──────────────────┐
              ┌────►│ Principal Standby│  (immediate replay)
              │     └──────────────────┘
┌─────────┐   │     ┌──────────────────┐
│ PRIMARY │───┼────►│ Auxiliary Standby │  (immediate replay)
└─────────┘   │     └──────────────────┘
              │     ┌──────────────────┐
              └────►│ Delayed Standby  │  (replay delay: 1 hour)
                    └──────────────────┘

State Machine: HADR States

The primary and standby go through several states:

State	Description
LOCAL_CATCHUP	Primary is sending log records from its log files (initial catch-up)
REMOTE_CATCHUP	Standby is catching up with current activity on the primary
PEER	Primary and standby are fully synchronized (SYNC/NEARSYNC modes)
DISCONNECTED	TCP connection lost; primary continues, standby waits for reconnection
REMOTE_CATCHUP_PENDING	Standby is connected but has not finished catching up

The target operational state is PEER — meaning the standby has replayed all received log records and is in sync with the primary within the bounds of the synchronization mode.

HADR Internals: Log Shipping Pipeline

Understanding the internal pipeline helps diagnose performance issues and capacity plan the HADR infrastructure:

On the primary: 1. A transaction writes log records to the log buffer (in memory) 2. The HADR sender thread reads log records from the log buffer (or from log files if catching up) 3. Log records are sent over the TCP connection to the standby 4. Depending on the sync mode, the primary may wait for acknowledgment before completing the commit

On the standby: 1. The HADR receiver thread accepts log records from the TCP connection 2. Log records are written to the standby's local log files 3. The log replay thread reads log records from the local log files 4. Log records are applied to the standby database, updating pages in memory and flushing to disk

The pipeline has three potential bottlenecks:

Network bandwidth: If the primary generates log records faster than the network can carry them, the standby falls behind. A busy OLTP system generating 50 MB/s of log data needs at least a 500 Mbps network link (with overhead), though 1 Gbps or 10 Gbps is recommended.
Standby I/O capacity: The standby must write log records to disk (for durability) and apply changes to database pages. If the standby's storage is slower than the primary's, the replay falls behind.
Standby CPU capacity: Log replay is CPU-intensive, especially for index maintenance. If the standby has fewer CPU resources than the primary, replay may lag.

IBM recommends that the standby have identical hardware specifications to the primary. This ensures that log replay can keep pace under all conditions and that the standby provides the same performance as the primary after a takeover.

The HADR_PEER_WINDOW Parameter

The HADR_PEER_WINDOW parameter deserves special attention because it affects availability behavior during network disruptions.

When the primary loses contact with the standby (network failure, standby crash), the primary enters the DISCONNECTED_PEER state. During the peer window interval:

The primary continues to operate and accept transactions
The primary behaves as if it is still in PEER mode for logging purposes
If the standby reconnects within the peer window, it catches up seamlessly

After the peer window expires:

The primary transitions to DISCONNECTED state
The primary no longer guarantees zero data loss (because the standby may have missed log records)
Operations continue normally, but the RPO guarantee is temporarily suspended

A peer window of 0 means the primary immediately transitions to DISCONNECTED — no grace period. A peer window of 120 seconds (Meridian's setting) means the primary waits 2 minutes before giving up on the standby. Setting this value requires balancing:

Too short: Brief network glitches cause unnecessary loss of PEER state
Too long: During the peer window, commit latency increases because the primary continues attempting to ship logs to the unreachable standby

29.3 HADR Synchronization Modes

The synchronization mode determines how the primary database interacts with the standby during transaction commits. This is the most critical HADR configuration decision because it directly affects:

Data loss risk (RPO) — can committed transactions be lost if the primary fails?
Transaction latency — does the primary wait for the standby during commit?
Availability behavior — what happens if the standby becomes unreachable?

SYNC (Synchronous)

How it works: When a transaction commits on the primary, the commit does not complete until the standby has written the log records to its local disk and acknowledged receipt.

RPO: Zero. If the primary crashes at any point after a commit returns to the application, the standby has the log records on disk.

Latency impact: The commit must wait for: 1. Network round-trip to the standby (0.1 - 5 ms depending on distance) 2. Standby's disk write of the log records 3. Acknowledgment back to the primary

Typical penalty: 1-10 ms per commit, depending on network latency and standby storage speed.

Availability risk: If the standby becomes unreachable, the primary continues in disconnected peer state for a configurable timeout, after which it continues standalone. During the unreachable period, all commits block until the timeout expires.

Best for: Environments where zero data loss is mandatory and the standby is in the same data center (low network latency).

NEARSYNC (Near-Synchronous)

How it works: The commit does not complete until the standby has received the log records in memory (but not necessarily written them to disk).

RPO: Near-zero. Data loss only occurs in the extraordinarily unlikely scenario where the primary crashes AND the standby crashes (or loses power) simultaneously, before the standby flushes the received log records to disk.

Latency impact: Slightly less than SYNC because the standby does not perform a disk flush before acknowledging. The primary waits only for the network round-trip plus memory receipt.

Typical penalty: 0.5-5 ms per commit.

Best for: Most production environments. The marginal RPO risk compared to SYNC is negligible, while the performance improvement is measurable. This is the mode Meridian Bank uses.

ASYNC (Asynchronous)

How it works: The primary sends log records to the standby but does not wait for any acknowledgment before completing the commit. Log records are shipped as fast as the network allows, but the primary is always "ahead" of the standby.

RPO: Seconds. The amount of data loss depends on how far behind the standby is — typically a few seconds of transactions.

Latency impact: Zero. The commit path is identical to a non-HADR database. There is no added latency.

Best for: Environments where some data loss is tolerable (e.g., non-financial applications) or where the standby is geographically distant (cross-continent), making SYNC/NEARSYNC latency unacceptable.

SUPERASYNC

How it works: Similar to ASYNC, but with even fewer guarantees. The primary ships log records on a best-effort basis. If the network is slow or congested, the primary does not slow down its log shipping; it may skip or delay records.

RPO: Seconds to minutes. Under normal conditions, similar to ASYNC. Under network stress, the gap widens.

Latency impact: Zero.

Best for: Disaster recovery standbys at very long distances where network bandwidth is limited. SUPERASYNC is typically used for the auxiliary or delayed standby, not the principal standby.

Comparison Table

Attribute	SYNC	NEARSYNC	ASYNC	SUPERASYNC
RPO	0	Near-zero	Seconds	Seconds-minutes
Commit latency added	1-10 ms	0.5-5 ms	0	0
Standby ack required	Disk write	Memory receive	No	No
Network disruption behavior	Commits block (then timeout)	Commits block (then timeout)	Commits continue	Commits continue
Typical distance	Same data center	Same metro area	Cross-region	Cross-continent
Real-world latency (100km)	~3 ms added	~2 ms added	None	None

29.4 Setting Up HADR Step by Step

Prerequisites

Before configuring HADR, ensure:

Identical database schema — the standby must be a restored image of the primary (same structure, same data)
Compatible DB2 versions — primary and standby should run the same DB2 version (minor version differences are tolerated during rolling upgrades)
Network connectivity — TCP/IP connectivity between primary and standby with sufficient bandwidth for the log generation rate
Sufficient storage — the standby needs at least as much log and data storage as the primary
Same operating system family — the primary and standby can be different Linux distributions but must be the same OS type (Linux-to-Linux, not Linux-to-Windows)

Step 1: Take a Backup of the Primary

# On the primary server
db2 BACKUP DATABASE MERIDIAN TO /db2backup COMPRESS

Step 2: Restore the Backup on the Standby

# On the standby server
db2 RESTORE DATABASE MERIDIAN FROM /db2backup REPLACE EXISTING

Step 3: Configure Database Parameters on the Primary

# On the primary server
db2 UPDATE DB CFG FOR MERIDIAN USING \
    HADR_LOCAL_HOST  primary.meridian.bank \
    HADR_LOCAL_SVC   55001 \
    HADR_REMOTE_HOST standby.meridian.bank \
    HADR_REMOTE_SVC  55001 \
    HADR_REMOTE_INST db2inst1 \
    HADR_SYNCMODE    NEARSYNC \
    HADR_PEER_WINDOW 120 \
    HADR_TIMEOUT     120 \
    LOGINDEXBUILD    ON

Step 4: Configure Database Parameters on the Standby

# On the standby server
db2 UPDATE DB CFG FOR MERIDIAN USING \
    HADR_LOCAL_HOST  standby.meridian.bank \
    HADR_LOCAL_SVC   55001 \
    HADR_REMOTE_HOST primary.meridian.bank \
    HADR_REMOTE_SVC  55001 \
    HADR_REMOTE_INST db2inst1 \
    HADR_SYNCMODE    NEARSYNC \
    HADR_PEER_WINDOW 120 \
    HADR_TIMEOUT     120 \
    LOGINDEXBUILD    ON

Step 5: Start HADR on the Standby First

# On the standby server — start standby FIRST
db2 DEACTIVATE DATABASE MERIDIAN
db2 START HADR ON DATABASE MERIDIAN AS STANDBY

Step 6: Start HADR on the Primary

# On the primary server
db2 START HADR ON DATABASE MERIDIAN AS PRIMARY

Step 7: Verify HADR Status

# On either server
db2pd -db MERIDIAN -hadr

# Or via SQL
db2 "SELECT HADR_ROLE, HADR_STATE, HADR_SYNCMODE,
            HADR_CONNECT_STATUS, HADR_LOG_GAP,
            PRIMARY_LOG_FILE, STANDBY_LOG_FILE,
            PRIMARY_LOG_PAGE, STANDBY_LOG_PAGE
     FROM TABLE(MON_GET_HADR(NULL))"

Expected output when fully synchronized:

HADR_ROLE        = PRIMARY
HADR_STATE       = PEER
HADR_SYNCMODE    = NEARSYNC
HADR_CONNECT_STATUS = CONNECTED
HADR_LOG_GAP     = 0

Step 8: Test Takeover

A planned takeover (graceful):

# On the standby server
db2 TAKEOVER HADR ON DATABASE MERIDIAN

A forced takeover (when the primary is unreachable):

# On the standby server
db2 TAKEOVER HADR ON DATABASE MERIDIAN BY FORCE

Using db2haicu for Automated Setup

IBM provides the DB2 High Availability Instance Configuration Utility (db2haicu) for automated HADR configuration, particularly when integrated with cluster managers like IBM Tivoli System Automation, Pacemaker, or PowerHA:

db2haicu
# Interactive wizard that configures:
# - HADR parameters
# - Cluster manager integration
# - Automatic failover policies
# - Quorum device configuration

Important HADR Database Configuration Parameters

Beyond the core HADR parameters, several database configuration parameters affect HADR behavior and should be reviewed:

Parameter	Recommended Value	Reason
LOGINDEXBUILD	ON	Ensures index builds are logged and replayed on standby
LOGARCHMETH1	DISK:/archive/path	Required; HADR needs archival logging
LOGFILSIZ	Match primary and standby	Log file sizes must be identical
LOGPRIMARY	Sufficient for peak	Prevents log-full conditions during peak write periods
LOGSECOND	Generous (e.g., 100)	Provides buffer for long-running transactions
HADR_SPOOL_LIMIT	AUTOMATIC or specific MB	Limits log spooling on primary when standby is behind

The HADR_SPOOL_LIMIT parameter is particularly important. When the standby falls behind, the primary must retain log records that have not been shipped. If the primary generates logs faster than it can ship them (e.g., during a massive batch load), unshipped logs accumulate. Without a spool limit, this can fill the primary's log space and halt the database. Setting HADR_SPOOL_LIMIT caps how much unshipped log data the primary will retain before disconnecting the standby to protect its own operation.

Takeover Types Explained

There are three distinct takeover operations, each for a different scenario:

Graceful takeover (TAKEOVER HADR ON DB ... ): Used for planned maintenance. The primary and standby coordinate the role switch. Both databases remain consistent, and the old primary becomes the new standby automatically. Zero data loss guaranteed. Requires both primary and standby to be connected and in PEER state.

Forced takeover (TAKEOVER HADR ON DB ... BY FORCE): Used when the primary is unreachable. The standby promotes itself to primary without coordinating with the old primary. Data loss may occur if the standby was not fully caught up. When the old primary comes back online, it cannot automatically become a standby — it must be reinitialized from a fresh backup of the new primary.

Forced takeover with peer window active (TAKEOVER HADR ON DB ... BY FORCE PEER WINDOW ONLY): A safer variant of forced takeover. This command only succeeds if the standby is within the peer window — meaning it was recently in PEER state and the data loss exposure is minimal. If the peer window has expired, the command fails, and the DBA must make an explicit decision to proceed with a full forced takeover.

29.5 Automatic Client Reroute (ACR)

The Problem

HADR solves the server-side failover problem — the standby takes over as the new primary. But what about the clients? After a takeover, every application connection to the old primary is broken. Applications need to reconnect to the new primary.

Without automatic client reroute, every application would need custom reconnection logic — detecting the broken connection, looking up the new primary's address, and reconnecting. This is error-prone and slow.

ACR Configuration

Automatic Client Reroute (ACR) is built into the DB2 client driver. It works by providing the client with alternate server information so that when a connection fails, the client automatically reconnects to the new primary.

Server-Side Configuration

On the primary database, configure the alternate server:

# On the primary server
db2 UPDATE ALTERNATE SERVER FOR DATABASE MERIDIAN \
    USING HOSTNAME standby.meridian.bank PORT 50000

When a client connects to the primary, DB2 sends the alternate server information to the client driver. The client stores this information and uses it if the primary connection fails.

Client-Side Configuration

The client can also be pre-configured with alternate server information in the db2dsdriver.cfg file:

<!-- db2dsdriver.cfg on the client -->
<configuration>
  <dsncollection>
    <dsn alias="MERIDIAN" name="MERIDIAN" host="primary.meridian.bank" port="50000">
      <parameter name="enableACR" value="true"/>
    </dsn>
  </dsncollection>
  <databases>
    <database name="MERIDIAN" host="primary.meridian.bank" port="50000">
      <acr>
        <enableacr>TRUE</enableacr>
        <enableseamlessacr>TRUE</enableseamlessacr>
        <acrtimeout>30</acrtimeout>
        <maxacrretries>5</maxacrretries>
        <acrretryinterval>10</acrretryinterval>
        <alternateserverlist>
          <server name="standby" hostname="standby.meridian.bank" port="50000"/>
        </alternateserverlist>
      </acr>
    </database>
  </databases>
</configuration>

Seamless ACR

Standard ACR reconnects the client after a failover, but in-flight transactions are rolled back. The application receives an SQL error (-30108) and must retry the transaction.

Seamless ACR goes further. For read-only transactions or transactions that have not yet performed any updates, seamless ACR can reconnect transparently — the application does not see an error. This requires:

enableSeamlessACR=true in the driver configuration
The transaction must be eligible (read-only or no work done yet)

Client Driver Support

ACR is supported by all modern DB2 client drivers:

Driver	ACR Support	Seamless ACR
DB2 CLI / ODBC	Yes	Yes
JDBC Type 4 (JCC)	Yes	Yes
.NET Data Provider	Yes	Yes
Node.js ibm_db	Yes	Yes
Python ibm_db	Yes	Yes
Go go_ibm_db	Yes	Limited

Workload Balancing (WLB)

When multiple members are available (e.g., in a pureScale environment or when reads-on-standby is configured), the DB2 client can distribute connections across members for load balancing:

<database name="MERIDIAN" host="primary.meridian.bank" port="50000">
  <wlb>
    <enablewlb>TRUE</enablewlb>
    <maxconnections>0</maxconnections>
  </wlb>
</database>

WLB queries the server for member weights and distributes new connections accordingly. This is particularly valuable for read-heavy workloads where the standby can offload reporting queries.

29.6 HADR with Reads on Standby

The Value Proposition

In a traditional HADR configuration, the standby database sits idle — it replays log records but serves no application traffic. This is wasteful. The standby has CPU, memory, and storage that could serve read-only workloads.

Reads on Standby (ROS) allows applications to run read-only queries against the HADR standby database while it continues to replay log records from the primary. This is valuable for:

Reporting and analytics — offload long-running reports from the primary
Data validation — verify data integrity without impacting production
Development/testing — read production data for testing purposes
Disaster preparedness — ensure the standby is functional by actually using it

How It Works

When a client connects to the standby, it receives a read-only connection. The standby continues replaying log records in the background. The read-only queries see a consistent snapshot of the database — the standby provides transactional consistency at the UR (uncommitted read) or CS (cursor stability) isolation level.

# Connect to the standby for read-only queries
db2 CONNECT TO MERIDIAN  # standby automatically provides read-only access

# Execute read-only queries
db2 "SELECT * FROM ACCOUNTS WHERE BRANCH_ID = 100"

# Attempt to write — this will fail
db2 "INSERT INTO ACCOUNTS VALUES (...)"
# SQL0752N: The database is currently in HADR standby mode

Isolation Level Considerations

On the standby, the available isolation levels are restricted:

Isolation Level	Available on Standby?	Notes
UR (Uncommitted Read)	Yes	May see data from in-flight transactions on the primary
CS (Cursor Stability)	Yes	Default; sees committed data
RS (Read Stability)	No	Not supported on standby
RR (Repeatable Read)	No	Not supported on standby

The restriction exists because the standby cannot take locks in the traditional sense — it is replaying log records, and lock conflicts with replay would cause the standby to stall.

Replay Delay Consideration

If the standby is configured with a replay delay (delayed standby), reads will see data that is behind the primary by the delay interval. This can be useful:

Accidental deletion recovery: If someone drops a table at 2:00 PM, the delayed standby (with a 1-hour delay) still has the table until 3:00 PM
Consistent historical view: Reporting queries see a stable point-in-time, not a moving target

But it can also be confusing — an application querying the standby will see data that is "old" compared to the primary. Ensure that applications and users understand this when using delayed standbys for reporting.

Performance Impact on Log Replay

Read-only queries on the standby compete with log replay for CPU and I/O resources. Under heavy read load:

Log replay may slow down, causing the standby to fall behind the primary
This increases the HADR log gap and, in SYNC/NEARSYNC modes, could impact primary performance

Monitor the HADR_LOG_GAP metric closely when enabling reads on standby. If the gap grows unacceptably, consider:

Limiting the number of concurrent read connections to the standby
Scheduling heavy reporting queries during off-peak hours
Using an auxiliary standby for reads while keeping the principal standby dedicated to replay

29.7 Db2 pureScale

Overview

While HADR provides failover with minimal downtime, Db2 pureScale provides continuous availability — there is no failover because multiple members are active simultaneously, sharing the same data. It is the LUW equivalent of z/OS data sharing (Chapter 28), though the implementation differs significantly.

Architecture

pureScale uses a shared-everything architecture:

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Member 1 │  │ Member 2 │  │ Member 3 │
│ (Active) │  │ (Active) │  │ (Active) │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │
     └──────┬──────┴──────┬──────┘
            │             │
      ┌─────┴─────┐ ┌────┴──────┐
      │  CF Node  │ │  CF Node  │
      │ (Primary) │ │ (Standby) │
      └─────┬─────┘ └────┬──────┘
            │             │
            └──────┬──────┘
                   │
            ┌──────┴──────┐
            │ Shared Disk │
            │   (GPFS)    │
            └─────────────┘

Key components:

Member Nodes

Each member is an active DB2 instance that can serve read/write traffic. All members access the same database on shared storage. Unlike HADR (where only the primary is active), all pureScale members are active simultaneously.

Cluster Caching Facility (CF) Nodes

CF nodes serve a role analogous to the z/OS coupling facility — they manage a global lock manager and a shared buffer pool (group buffer pool). CF nodes:

Centralized lock management — all lock requests go through the CF
Page coherency — the CF ensures that all members see consistent data
Fast inter-member communication — using high-speed interconnects (InfiniBand or RDMA over Converged Ethernet)

There are typically two CF nodes — a primary and a standby. If the primary CF fails, the standby takes over.

Shared Storage (GPFS / Spectrum Scale)

All member nodes and CF nodes access the same storage through IBM Spectrum Scale (GPFS). The shared file system ensures that all members can read and write the same database files, log files, and configuration files.

Aspect	z/OS Data Sharing	Db2 pureScale
Platform	z/OS on mainframe	Linux on x86/POWER
Coupling facility	Dedicated hardware or LPAR	Software CF process on dedicated node
Interconnect	CF links (InfiniBand, ISC)	InfiniBand or RoCE
Shared storage	DASD (ECKD) via FICON	GPFS via SAN
Max members	32	128
Lock management	Hardware-assisted in CF	Software-based in CF node
Maturity	30+ years (since 1994)	Since 2009

Failure and Recovery

When a member node fails in pureScale:

The CF detects the failure (heartbeat timeout)
Other members continue serving traffic immediately — no failover needed
The CF initiates member recovery — replaying the failed member's log to roll back in-flight transactions
Client connections to the failed member are rerouted to surviving members via ACR

When a CF node fails:

The standby CF takes over automatically
Lock state and group buffer pool contents are rebuilt from member state
Brief pause (seconds) while the CF transitions

When to Choose pureScale

pureScale is appropriate when:

Zero downtime is required — no planned or unplanned outages
Read/write scalability is needed — multiple active members serve the workload
The budget supports it — pureScale requires multiple servers, InfiniBand networking, shared storage, and IBM licensing for each member

pureScale is not appropriate when:

Budget is limited — HADR is much simpler and cheaper
The workload is modest — a single server with HADR standby may suffice
Geographic DR is the primary requirement — pureScale requires low-latency interconnects (same data center)

pureScale Operational Considerations

Operating a pureScale cluster requires specialized skills:

Adding and removing members: New members can be added to a running cluster without downtime. The db2iupdt command adds a member, and the CF automatically begins managing locks and cache for the new member. Similarly, members can be removed for maintenance.

CF node maintenance: The standby CF can be taken down for maintenance while the primary CF continues operating. After maintenance, the standby CF is resynchronized. To maintain the primary CF, roles are first swapped (the standby becomes primary) before the original primary is taken down.

Shared storage management: All members access the database through IBM Spectrum Scale (GPFS). The shared file system must be monitored for performance and capacity. GPFS provides its own HA through data replication across storage nodes.

Client routing with WLB: The DB2 client driver's workload balancing (WLB) feature distributes connections across pureScale members based on member weights. This provides automatic load balancing and transparent member failover at the client level.

Combining pureScale with HADR: For geographic DR, pureScale can be combined with HADR. The primary pureScale cluster replicates to a remote standby (either single-member or another pureScale cluster) via HADR ASYNC. This provides both continuous availability within the data center (pureScale) and geographic disaster protection (HADR).

29.8 Q Replication

Overview

Q Replication is an IBM replication technology that captures changes from a source database and applies them to a target database using IBM MQ (Message Queue) as the transport. It is designed for:

Low-latency replication — changes are captured and applied in near-real-time
Bidirectional replication — two databases can both accept writes and replicate changes to each other
Cross-platform replication — DB2 LUW to DB2 z/OS, DB2 to Oracle, etc.
Multi-target replication — one source can replicate to multiple targets

Architecture

┌─────────────┐        ┌──────────┐        ┌─────────────┐
│   SOURCE    │        │  IBM MQ  │        │   TARGET    │
│  Database   │        │          │        │  Database   │
│             │        │          │        │             │
│ ┌─────────┐ │  Msgs  │ ┌──────┐ │  Msgs  │ ┌─────────┐ │
│ │ Q       │─┼───────►│ │Queue │─┼───────►│ │ Q       │ │
│ │ Capture │ │        │ └──────┘ │        │ │ Apply   │ │
│ └─────────┘ │        │          │        │ └─────────┘ │
│      ▲      │        └──────────┘        │      │      │
│      │      │                            │      ▼      │
│  DB2 Log    │                            │   Tables    │
└─────────────┘                            └─────────────┘

Q Capture

The Q Capture program reads the DB2 transaction log on the source database. For each committed transaction that affects a subscribed table, it constructs a message containing the change data and publishes it to an MQ queue.

Key characteristics: - Log-based capture — no triggers, no additional writes to the source database - Transactional consistency — changes are captured at commit boundaries - Minimal source overhead — the capture program reads the log asynchronously

MQ Transport

IBM MQ provides the messaging infrastructure: - Guaranteed delivery — messages are persistent and transactional - Network independence — source and target can be on different networks, data centers, or continents - Buffering — if the target is temporarily unavailable, messages queue up in MQ

Q Apply

The Q Apply program reads messages from the MQ queue and applies the changes to the target database. It: - Preserves transaction boundaries — changes are applied in commit-consistent groups - Handles conflicts — in bidirectional mode, conflicts are detected and resolved - Maintains referential integrity — dependent changes are applied in the correct order

Bidirectional Replication

Q Replication's most powerful feature is bidirectional replication — two databases that both accept writes and replicate to each other. This enables:

Active-active configurations — both sites serve read/write traffic
Geographic distribution — users connect to the nearest site
Load distribution — write traffic is split across sites

However, bidirectional replication introduces conflicts: what happens when the same row is updated at both sites simultaneously?

Conflict Resolution

Q Replication provides configurable conflict detection and resolution:

Detection: Q Apply detects that a row has been modified at both source and target since the last replication
Resolution strategies:
Source wins — the incoming change overwrites the local change
Target wins — the local change is preserved, incoming change is discarded
Timestamp-based — the most recent change (by timestamp) wins
Custom stored procedure — a user-defined procedure examines both versions and decides

For Meridian Bank, a source-wins strategy is typically used for reference data (branch addresses, interest rates), while transactional data (account balances, transaction records) uses application-level coordination to avoid conflicts entirely.

Latency and Throughput

Typical Q Replication latency:

Scenario	Capture Latency	MQ Transport	Apply Latency	End-to-End
Same data center	<100 ms	<10 ms	<100 ms	<300 ms
Cross-region (500km)	<100 ms	5-20 ms	<100 ms	<500 ms
Cross-continent	<100 ms	50-200 ms	<100 ms	200 ms - 1 sec

Throughput depends on the transaction rate, row size, and MQ channel bandwidth. In practice, Q Replication can sustain thousands of transactions per second.

29.9 Change Data Capture (CDC)

Overview

Change Data Capture (CDC) — now part of IBM InfoSphere Data Replication — is a log-based replication technology that captures row-level changes from DB2 and delivers them to diverse targets in real-time. While Q Replication is specifically designed for DB2-to-DB2 replication via MQ, CDC is broader:

Diverse targets: Kafka, flat files, HDFS, data warehouses, other RDBMS
Flexible formats: JSON, Avro, CSV, relational row format
Transformation capability: Data can be filtered, transformed, and enriched during replication

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│   SOURCE    │     │  CDC Engine  │     │     TARGETS     │
│  DB2 LUW    │     │             │     │                 │
│             │     │ ┌─────────┐ │     │  ┌───────────┐  │
│  Transaction├────►│ │ Capture │ │────►│  │ Kafka     │  │
│  Log        │     │ │ Engine  │ │     │  └───────────┘  │
│             │     │ └─────────┘ │     │  ┌───────────┐  │
│             │     │ ┌─────────┐ │     │  │ Data      │  │
│             │     │ │ Apply   │─┼────►│  │ Warehouse │  │
│             │     │ │ Engine  │ │     │  └───────────┘  │
│             │     │ └─────────┘ │     │  ┌───────────┐  │
│             │     │             │     │  │ Flat Files│  │
│             │     │             │────►│  └───────────┘  │
└─────────────┘     └─────────────┘     └─────────────────┘

Log-Based Capture

Like Q Replication, CDC reads the DB2 transaction log — it does not use triggers or polling. This is critical for production databases because:

No additional I/O on the source database — the capture reads log files that DB2 has already written
No schema changes required — no trigger DDL, no audit columns
Minimal latency — changes are captured within milliseconds of the commit
Complete change record — the log contains before-images and after-images of every changed row

Integration with Apache Kafka

One of the most common CDC patterns today is streaming DB2 changes to Apache Kafka for consumption by downstream systems:

DB2 Transaction Log
       │
       ▼
CDC Capture Engine
       │
       ▼
Kafka Producer
       │
       ▼
┌──────────────────────────────────────┐
│           Apache Kafka               │
│  Topic: meridian.accounts.changes    │
│  Topic: meridian.transactions.changes│
└──────┬──────────────┬────────────────┘
       │              │
       ▼              ▼
 Real-time        Data Lake
 Analytics        (S3/HDFS)

Each committed transaction becomes one or more Kafka messages. Consumers subscribe to the topics and process changes in real-time. This pattern is commonly known as event streaming or event-driven architecture.

For Meridian Bank, the CDC-to-Kafka pipeline feeds:

Fraud detection engine — analyzes transactions in real-time to detect anomalies
Data warehouse — incrementally loads transaction data for analytics
Regulatory reporting — captures account changes for compliance audit trails
Customer notification service — triggers alerts for large withdrawals, low balances, etc.

CDC vs. Q Replication

Aspect	CDC	Q Replication
Primary use case	Data distribution to diverse targets	DB2-to-DB2 replication
Transport	Kafka, TCP/IP, flat files	IBM MQ
Bidirectional	No (unidirectional)	Yes
Target diversity	Any database, Kafka, files, HDFS	DB2 (and some others via adapters)
Conflict resolution	Not applicable (unidirectional)	Built-in
Transformation	Built-in ETL capabilities	Limited
Latency	Sub-second	Sub-second
Licensing	InfoSphere Data Replication	DB2 Advanced Edition or standalone

ETL Patterns with CDC

CDC enables real-time ETL — replacing traditional batch ETL jobs that extract, transform, and load data on a schedule (e.g., nightly). With CDC:

Extract: Changes are captured from the log in real-time
Transform: The CDC engine applies mappings, filters, and transformations
Load: Changed rows are applied to the target in near-real-time

This eliminates the batch window and provides the data warehouse with fresh data continuously. Meridian's analytics team can see a transaction in the data warehouse within seconds of it being committed in the operational database.

CDC Implementation Considerations

Deploying CDC in production requires attention to several practical considerations:

Log retention: The CDC capture engine reads the DB2 transaction log. If the capture engine is stopped or falls behind, DB2 must retain the unprocessed log records. This means the log archive retention policy must be sufficient to allow the CDC engine to catch up after a restart. Meridian Bank retains archive logs for 7 days specifically to support CDC recovery.

Schema evolution: When the source table's schema changes (ADD COLUMN, ALTER COLUMN), the CDC subscription must be updated to include the new columns. Some CDC products handle this automatically; others require manual subscription updates. IBM InfoSphere Data Replication provides auto-adaptation for most DDL changes.

Large object (LOB) handling: LOB columns can dramatically increase the volume of change data. If a table has a 10 MB LOB column and the LOB is updated, the entire LOB value flows through the CDC pipeline. For Meridian Bank, document storage tables with large LOBs are excluded from CDC; only the metadata columns are replicated.

Initial load: When a new table is added to CDC, the existing data must be loaded into the target before incremental changes can be applied. The CDC engine provides an initial load capability that reads the full table and publishes the rows as INSERT messages. During the initial load, the capture engine continues collecting incremental changes, which are applied after the load completes. This avoids any gap in the change stream.

Monitoring and alerting: CDC pipelines should be monitored for: - Capture latency (time between a commit and the capture engine processing it) - Apply latency (time between a commit and the data arriving at the target) - Error rates (deserialization errors, target write failures) - Throughput (messages per second, bytes per second)

IBM provides the InfoSphere Data Replication Management Console for CDC monitoring, and most organizations also integrate CDC metrics into their standard monitoring platforms (Prometheus, Grafana, Datadog).

29.10 Choosing the Right HA Solution

Decision Framework

Choosing the right HA solution requires evaluating multiple factors. Here is a structured decision framework:

Factor 1: RPO Requirement

RPO	Recommended Solutions
Zero (no data loss)	HADR SYNC/NEARSYNC, pureScale
Seconds	HADR ASYNC, Q Replication
Minutes	HADR SUPERASYNC, CDC
Hours	Log shipping, backup/restore

Factor 2: RTO Requirement

RTO	Recommended Solutions
Seconds (no perceptible outage)	pureScale
Under 2 minutes	HADR with automatic failover
Under 30 minutes	HADR with manual takeover, Q Replication
Hours	Backup/restore

Factor 3: Budget and Complexity

Solution	Infrastructure Cost	Operational Complexity	Licensing Cost
HADR	Low (2 servers)	Low	Included in DB2
pureScale	High (multiple servers + InfiniBand + shared storage)	High	Premium licensing
Q Replication	Medium (2 servers + MQ infrastructure)	Medium	Additional licensing
CDC	Medium (2 servers + Kafka optional)	Medium	InfoSphere licensing

Factor 4: Geographic Distribution

Scenario	Recommended Solutions
Same data center	pureScale, HADR SYNC
Same metro area (<100km)	HADR NEARSYNC
Cross-region (100-1000km)	HADR ASYNC, Q Replication
Cross-continent (>1000km)	HADR SUPERASYNC, Q Replication, CDC

Decision Matrix for Meridian National Bank

Meridian evaluates each technology for different workloads:

Workload	RPO	RTO	Solution Selected	Rationale
Core banking (OLTP)	Near-zero	<2 min	HADR NEARSYNC	Best cost/benefit ratio
Customer portal (web)	Near-zero	<2 min	HADR NEARSYNC (same instance)	Shared with core banking
Analytics/reporting	Minutes	Hours	CDC to data warehouse	Different technology stack
Mobile API	Near-zero	<30 sec	HADR + ACR + reads-on-standby	Fast reroute + standby for reads
Regulatory archive	Hours	Hours	Backup/restore	Cost-effective for cold data
Cross-site DR	Seconds	<30 min	HADR ASYNC (auxiliary standby at DR site)	Distance precludes SYNC

Combining Solutions

These technologies are not mutually exclusive. A comprehensive HA strategy often combines multiple approaches:

                    ┌─────────────────┐
                    │ PRIMARY DB      │
                    │ (Production)    │
                    └────┬───┬───┬────┘
                         │   │   │
              NEARSYNC   │   │   │  Log-based
              HADR       │   │   │  CDC
                         │   │   │
          ┌──────────────┘   │   └──────────────┐
          ▼                  │                   ▼
 ┌─────────────────┐        │         ┌─────────────────┐
 │ PRINCIPAL       │        │         │ KAFKA CLUSTER   │
 │ STANDBY         │        │         │ (CDC target)    │
 │ (same city)     │        │         └────────┬────────┘
 │ + Reads for     │        │                  │
 │   reporting     │     ASYNC                 ▼
 └─────────────────┘     HADR        ┌─────────────────┐
                         │           │ DATA WAREHOUSE  │
                         ▼           │ FRAUD ENGINE    │
                ┌─────────────────┐  │ NOTIFICATIONS   │
                │ AUXILIARY       │  └─────────────────┘
                │ STANDBY         │
                │ (DR site, 500km)│
                └─────────────────┘

29.11 Meridian Bank LUW HA Design

Environment Overview

Meridian National Bank's LUW environment runs on:

Production server: IBM Power Systems S1022, 32 cores, 512 GB RAM, running RHEL 8.6
Standby server: Identical hardware, same data center (different rack, different power feed)
DR server: IBM Power Systems in a data center 400 km away
DB2 version: Db2 11.5.8
Database: MERIDIAN (core banking), 2 TB

HADR Configuration: NEARSYNC with Principal and Auxiliary Standbys

Meridian deploys HADR with two standbys:

Principal Standby (Same Data Center)

Sync mode: NEARSYNC
Purpose: Primary failover target, reads-on-standby for reporting
RPO: Near-zero
RTO: <60 seconds with automatic failover via Pacemaker

Auxiliary Standby (DR Site, 400 km Away)

Sync mode: ASYNC
Purpose: Disaster recovery
RPO: 2-5 seconds
RTO: <30 minutes (manual takeover after site declaration)

Data Center A                          Data Center B (DR)
┌──────────────────────────────────┐   ┌──────────────────┐
│                                  │   │                  │
│  ┌──────────┐    ┌──────────┐   │   │  ┌──────────┐   │
│  │ PRIMARY  │───►│ PRINCIPAL│   │   │  │ AUXILIARY│   │
│  │ MERIDIAN │    │ STANDBY  │   │   │  │ STANDBY  │   │
│  │          │    │ (ROS)    │   │   │  │ (DR)     │   │
│  └──────────┘    └──────────┘   │   │  └──────────┘   │
│       │          NEARSYNC       │   │       ▲          │
│       │                         │   │       │          │
│       └─────────────────────────┼───┼───────┘          │
│                  ASYNC          │   │                   │
│                                  │   │                  │
│  Pacemaker cluster (VIP)         │   │                  │
│  ┌──────────────────────────┐   │   │                  │
│  │ VIP: 10.100.1.100       │   │   │                  │
│  │ Monitors HADR state      │   │   │                  │
│  │ Triggers automatic       │   │   │                  │
│  │ takeover if primary fails│   │   │                  │
│  └──────────────────────────┘   │   │                  │
└──────────────────────────────────┘   └──────────────────┘

ACR Configuration for Applications

All Meridian applications use the following db2dsdriver.cfg:

<configuration>
  <dsncollection>
    <dsn alias="MERIDIAN" name="MERIDIAN"
         host="10.100.1.100" port="50000">
      <parameter name="enableACR" value="true"/>
      <parameter name="enableSeamlessACR" value="true"/>
    </dsn>
  </dsncollection>
  <databases>
    <database name="MERIDIAN" host="10.100.1.100" port="50000">
      <acr>
        <enableacr>TRUE</enableacr>
        <enableseamlessacr>TRUE</enableseamlessacr>
        <acrtimeout>30</acrtimeout>
        <maxacrretries>3</maxacrretries>
        <acrretryinterval>10</acrretryinterval>
        <alternateserverlist>
          <server name="standby1"
                  hostname="10.100.1.101" port="50000"/>
        </alternateserverlist>
      </acr>
    </database>
  </databases>
</configuration>

The VIP (10.100.1.100) is managed by Pacemaker and moves to the standby server during failover. Applications connect to the VIP and use ACR as a secondary mechanism.

Reads-on-Standby for Reporting

Meridian's reporting applications connect directly to the principal standby:

# Reporting application connection string
DATABASE=MERIDIAN;HOSTNAME=10.100.1.101;PORT=50000;
CURRENTSCHEMA=MERIDIAN;CONCURRENTACCESSRESOLUTION=WAITFOROUTCOME;

Reports running on the standby include: - Daily transaction summaries - Customer account statements - Branch performance dashboards - Regulatory compliance reports (that do not need real-time data)

This offloads approximately 30% of the read workload from the primary, freeing CPU and I/O capacity for OLTP transactions.

CDC to Data Warehouse

Meridian deploys CDC to stream changes from the MERIDIAN database to its enterprise data warehouse:

DB2 MERIDIAN (Primary)
       │
       ▼ (log reader)
CDC Capture Engine
       │
       ▼
Apache Kafka (3-node cluster)
  Topics:
    meridian.accounts
    meridian.transactions
    meridian.customers
    meridian.loans
       │
       ├──► Kafka Connect → Data Warehouse (Db2 Warehouse)
       ├──► Fraud Detection Engine (Kafka Streams)
       └──► Customer Notification Service (Spring Boot consumer)

The CDC pipeline achieves: - End-to-end latency: <2 seconds from commit on primary to availability in Kafka - Throughput: Up to 50,000 changes per second - Availability: Kafka's replication provides its own HA; if the CDC engine fails, it resumes from the last checkpoint in the DB2 log

Monitoring and Alerting

Meridian monitors the following HADR metrics continuously:

Metric	Source	Alert Threshold
HADR state	`MON_GET_HADR()`	State != PEER for >30 seconds
Log gap (bytes)	`MON_GET_HADR()`	>100 MB
Log gap (time)	`MON_GET_HADR()`	>5 seconds
Connection status	`MON_GET_HADR()`	DISCONNECTED
Replay lag on standby	`db2pd -hadr`	>60 seconds
CDC latency	CDC dashboard	>10 seconds
Kafka consumer lag	Kafka metrics	>1000 messages

Alerts feed into Meridian's enterprise monitoring system (IBM Instana / Prometheus + Grafana) with escalation to the on-call DBA within 5 minutes.

Disaster Recovery Procedure

In the event of a complete Data Center A failure:

Declaration: Operations declares a disaster (may involve regulatory coordination)
DNS update: Application DNS records are updated to point to Data Center B
Forced takeover: db2 TAKEOVER HADR ON DATABASE MERIDIAN BY FORCE on the auxiliary standby at the DR site
Application restart: Applications reconnect to the DR database
Verify: DBAs verify data integrity and confirm the RPO (check last committed transaction)
Target RTO: 30 minutes from declaration to applications serving traffic at the DR site

Annual DR Testing

Meridian conducts a full DR test annually:

Redirect a subset of test traffic to the DR standby
Perform a controlled takeover at the DR site
Run application validation tests
Measure actual RTO and RPO
Document findings and update the DR runbook
Reverse the roles and restore normal operation

Spaced Review

From Chapter 3: SQL Foundations

Connection: Every SQL statement in your application — the SELECTs that fetch account balances, the INSERTs that record transactions, the UPDATEs that modify customer records — depends on the HA infrastructure described in this chapter. HADR transparently replicates these SQL operations from the primary to the standby by shipping the log records they generate. When ACR reroutes your application to the standby after a failure, the SQL interface is identical. The application does not know it is talking to a different server.

From Chapter 18: Concurrency and Locking

Connection: HADR's reads-on-standby feature (Section 29.6) restricts isolation levels to UR and CS. In Chapter 18, you learned that RS and RR provide stronger consistency guarantees through lock retention. On the standby, these stronger levels are not available because the standby does not participate in the primary's lock management. If your reporting queries require RS or RR semantics, they must run on the primary. Understanding the isolation level spectrum from Chapter 18 is essential for designing a reads-on-standby strategy.

From Chapter 20: Application Development

Connection: Chapter 20 covered application development patterns including connection management, error handling, and transaction design. HADR and ACR require applications to handle specific error codes gracefully. SQL error -30108 indicates that the connection was rerouted — the application should retry the transaction. Long-running transactions delay log replay on the standby and increase the HADR log gap. The transaction design principles from Chapter 20 — keep transactions short, commit frequently — are doubly important in an HADR environment.

Summary

DB2 LUW provides a comprehensive portfolio of high availability technologies, each serving different points on the RPO/RTO/cost spectrum. HADR is the workhorse — a mature, well-understood technology that provides near-zero data loss and sub-minute recovery for most enterprise workloads. pureScale delivers continuous availability for the most demanding environments where even seconds of downtime are unacceptable. Q Replication enables bidirectional, active-active configurations and cross-platform data distribution. CDC feeds real-time change streams to modern data platforms like Kafka, enabling event-driven architectures.

For Meridian National Bank, the combination of HADR NEARSYNC (for primary HA), HADR ASYNC (for geographic DR), reads-on-standby (for reporting offload), and CDC-to-Kafka (for real-time data distribution) provides a robust, cost-effective HA strategy that meets the bank's availability, performance, and regulatory requirements.

No single technology solves every problem. The art of HA design lies in understanding your requirements — RPO, RTO, budget, operational capability — and selecting the right combination of technologies. This chapter has given you the knowledge to make those decisions with confidence.

In This Chapter

Chapter 29: HADR and Replication — High Availability for DB2 LUW

29.1 HA Concepts for DB2 LUW

Defining Availability

RPO and RTO

The HA Spectrum

Types of Failures

Planned vs. Unplanned Downtime

29.2 HADR Architecture

Overview

Log Shipping Mechanics

Three Standby Roles

Principal Standby

Auxiliary Standby

Delayed Standby (Replay Delay)

State Machine: HADR States

HADR Internals: Log Shipping Pipeline

The HADR_PEER_WINDOW Parameter

29.3 HADR Synchronization Modes

SYNC (Synchronous)

NEARSYNC (Near-Synchronous)

ASYNC (Asynchronous)

SUPERASYNC

Comparison Table

29.4 Setting Up HADR Step by Step

Prerequisites

Step 1: Take a Backup of the Primary

Step 2: Restore the Backup on the Standby

Step 3: Configure Database Parameters on the Primary

Step 4: Configure Database Parameters on the Standby

Step 5: Start HADR on the Standby First

Step 6: Start HADR on the Primary

Step 7: Verify HADR Status

Step 8: Test Takeover

Using db2haicu for Automated Setup

Important HADR Database Configuration Parameters

Takeover Types Explained

29.5 Automatic Client Reroute (ACR)

The Problem

ACR Configuration

Server-Side Configuration

Client-Side Configuration

Seamless ACR

Client Driver Support

Workload Balancing (WLB)

29.6 HADR with Reads on Standby

The Value Proposition

How It Works

Isolation Level Considerations

Replay Delay Consideration

Performance Impact on Log Replay

29.7 Db2 pureScale

Overview

Architecture

Member Nodes

Cluster Caching Facility (CF) Nodes

Shared Storage (GPFS / Spectrum Scale)

How pureScale Differs from z/OS Data Sharing

Failure and Recovery

When to Choose pureScale

pureScale Operational Considerations

29.8 Q Replication

Overview

Architecture

Q Capture

MQ Transport

Q Apply

Bidirectional Replication

Conflict Resolution

Latency and Throughput

29.9 Change Data Capture (CDC)

Overview

Architecture

Log-Based Capture

Integration with Apache Kafka

CDC vs. Q Replication

ETL Patterns with CDC

CDC Implementation Considerations

29.10 Choosing the Right HA Solution

Decision Framework