> "Your database is only as available as the weakest link in the chain — hardware, software, network, or human. HADR and replication are how DB2 LUW forges every link in steel."
In This Chapter
- 29.1 HA Concepts for DB2 LUW
- 29.2 HADR Architecture
- 29.3 HADR Synchronization Modes
- 29.4 Setting Up HADR Step by Step
- 29.5 Automatic Client Reroute (ACR)
- 29.6 HADR with Reads on Standby
- 29.7 Db2 pureScale
- 29.8 Q Replication
- 29.9 Change Data Capture (CDC)
- 29.10 Choosing the Right HA Solution
- 29.11 Meridian Bank LUW HA Design
- Spaced Review
- Summary
Chapter 29: HADR and Replication — High Availability for DB2 LUW
"Your database is only as available as the weakest link in the chain — hardware, software, network, or human. HADR and replication are how DB2 LUW forges every link in steel."
When Meridian National Bank's distributed applications — web portals, mobile APIs, analytics platforms — run on DB2 for Linux, UNIX, and Windows, the high availability challenge takes a different shape than on z/OS. There is no coupling facility, no Parallel Sysplex, no hardware-assisted global lock manager. Instead, DB2 LUW provides a rich set of software-based HA technologies: HADR for database-level failover, pureScale for continuous availability, and replication technologies (Q Replication, CDC) for data distribution and disaster recovery.
This chapter explores each of these technologies in depth — how they work internally, how to configure them, and when to choose one over another. By the end, you will be able to design and implement a comprehensive HA strategy for Meridian Bank's LUW environment.
If your primary platform is z/OS, Chapter 28 covers data sharing and Parallel Sysplex. But the concepts here — RPO, RTO, synchronous versus asynchronous replication, automatic failover — are universal. Every DBA should understand them.
29.1 HA Concepts for DB2 LUW
Defining Availability
Availability is measured as the percentage of time a system is operational and accessible:
Availability = (Total Time - Downtime) / Total Time * 100
The industry uses "nines" to express availability targets:
| Availability | Annual Downtime | Classification |
|---|---|---|
| 99% (two nines) | 3.65 days | Basic |
| 99.9% (three nines) | 8.76 hours | Standard |
| 99.99% (four nines) | 52.6 minutes | High availability |
| 99.999% (five nines) | 5.26 minutes | Continuous availability |
| 99.9999% (six nines) | 31.5 seconds | Ultra-high availability |
Meridian National Bank targets four nines (99.99%) for its customer-facing LUW databases — meaning no more than about 53 minutes of unplanned downtime per year.
RPO and RTO
Two metrics define HA requirements:
Recovery Point Objective (RPO): How much data can you afford to lose? Measured in time. An RPO of zero means no data loss under any circumstances. An RPO of 5 minutes means you can tolerate losing up to 5 minutes of committed transactions.
Recovery Time Objective (RTO): How quickly must the system be operational after a failure? An RTO of 30 seconds means the database must be accepting connections within 30 seconds of a failure.
| HA Technology | Typical RPO | Typical RTO |
|---|---|---|
| HADR SYNC mode | 0 (zero data loss) | 10-120 seconds |
| HADR NEARSYNC mode | Near-zero | 10-120 seconds |
| HADR ASYNC mode | Seconds | 10-120 seconds |
| HADR SUPERASYNC mode | Seconds to minutes | 10-120 seconds |
| Db2 pureScale | 0 (zero data loss) | Seconds |
| Q Replication | Seconds | Minutes (manual failover) |
| Backup/Restore | Hours | Hours |
The HA Spectrum
HA solutions exist on a spectrum from simple to sophisticated:
Low HA High HA
◄────────────────────────────────────────────────────────────►
Backup/ Log HADR HADR pureScale
Restore Shipping ASYNC SYNC
Hours Minutes Seconds Zero Zero
RPO RPO RPO RPO RPO
Hours Hour+ <2 min <2 min Seconds
RTO RTO RTO RTO RTO
$ MATH0 $ $$$$ $$$$$
Cost Cost Cost Cost Cost
The right choice depends on the application's RPO/RTO requirements, budget, and operational complexity tolerance.
Types of Failures
Understanding what can fail is essential for selecting the right HA solution. Failures fall into several categories:
Hardware failures: - Disk failure (single drive or entire storage array) - Memory failure (ECC error, DIMM failure) - CPU failure - Network interface or switch failure - Power supply failure - Complete server failure
Software failures: - DB2 engine crash (trap, assertion failure) - Operating system crash (kernel panic) - Storage driver or filesystem corruption - Application-induced problems (runaway queries, connection exhaustion)
Human errors: - Accidental DROP TABLE or DELETE without WHERE clause - Misconfigured parameters causing performance collapse - Incorrect maintenance procedures - Unauthorized schema changes
Environmental failures: - Data center power outage - Cooling failure - Natural disaster (flood, earthquake, fire) - Network backbone failure
Each HA technology addresses different failure types. HADR handles hardware and software failures through standby failover. A delayed standby or point-in-time recovery addresses human errors. Geographic replication (Q Replication, CDC, or HADR ASYNC to a remote standby) addresses environmental failures. A comprehensive HA strategy layers multiple technologies to cover all categories.
Planned vs. Unplanned Downtime
Total downtime has two components:
Unplanned downtime: Caused by failures — hardware, software, human error. HADR and pureScale reduce this by providing rapid failover or continuous availability.
Planned downtime: Caused by maintenance — DB2 upgrades, OS patches, hardware replacement, schema changes. In a single-server environment, every maintenance activity that requires a database restart contributes to planned downtime. HADR with rolling upgrades and pureScale with online member maintenance can reduce planned downtime to near-zero.
For Meridian Bank, historical analysis shows that planned downtime (quarterly maintenance windows) accounted for 80% of total annual downtime before HADR was implemented. Eliminating planned downtime through HADR with reads-on-standby was actually more impactful than the unplanned failure protection.
29.2 HADR Architecture
Overview
High Availability Disaster Recovery (HADR) is DB2 LUW's primary HA feature. It maintains a synchronized copy of a database on a standby server by continuously shipping and replaying transaction log records from the primary to the standby.
The fundamental architecture:
┌─────────────┐ Log shipping ┌─────────────┐
│ PRIMARY │ ─────────────────────────► │ STANDBY │
│ Database │ TCP/IP │ Database │
│ │ ◄───────────────────────── │ │
│ (Read/ │ Acknowledgment │ (Log │
│ Write) │ │ Replay) │
└─────────────┘ └─────────────┘
│ │
▼ ▼
┌────────┐ ┌────────┐
│ Disk │ │ Disk │
│ Storage│ │ Storage│
└────────┘ └────────┘
Key architectural elements:
- Primary database — the active database serving all read/write traffic
- Standby database — receives log records from the primary and replays them continuously
- Log shipping — transaction log records are sent from primary to standby over TCP/IP
- Log replay — the standby applies log records to keep its database synchronized
- Takeover — when the primary fails, the standby is promoted to become the new primary
Log Shipping Mechanics
HADR ships log records at the log buffer level, not at the log file level. This means:
- Records are shipped as soon as they are written to the log buffer — before they are flushed to the log file on disk
- This provides much lower latency than file-level log shipping
- The primary maintains a TCP/IP connection to the standby and sends log records continuously
On the standby side:
- Log records arrive over the TCP connection
- They are written to the standby's local log files
- A log replay process reads the log files and applies the changes to the standby database
- The standby database pages are updated to reflect the primary's state
Three Standby Roles
DB2 11.1 and later supports up to three standby databases in different roles:
Principal Standby
The primary failover target. It receives log records directly from the primary and can perform automatic takeover. There is exactly one principal standby.
Auxiliary Standby
Additional standbys that also receive log records from the primary. They provide extra redundancy but cannot perform automatic takeover. An auxiliary standby can be promoted to principal if the original principal fails. Up to two auxiliary standbys are supported.
Delayed Standby (Replay Delay)
A standby configured with a replay delay — it receives log records in real time but waits a configured interval (e.g., 1 hour) before replaying them. This protects against logical corruption: if someone accidentally drops a table, the delayed standby still has the table for up to the delay interval.
┌──────────────────┐
┌────►│ Principal Standby│ (immediate replay)
│ └──────────────────┘
┌─────────┐ │ ┌──────────────────┐
│ PRIMARY │───┼────►│ Auxiliary Standby │ (immediate replay)
└─────────┘ │ └──────────────────┘
│ ┌──────────────────┐
└────►│ Delayed Standby │ (replay delay: 1 hour)
└──────────────────┘
State Machine: HADR States
The primary and standby go through several states:
| State | Description |
|---|---|
| LOCAL_CATCHUP | Primary is sending log records from its log files (initial catch-up) |
| REMOTE_CATCHUP | Standby is catching up with current activity on the primary |
| PEER | Primary and standby are fully synchronized (SYNC/NEARSYNC modes) |
| DISCONNECTED | TCP connection lost; primary continues, standby waits for reconnection |
| REMOTE_CATCHUP_PENDING | Standby is connected but has not finished catching up |
The target operational state is PEER — meaning the standby has replayed all received log records and is in sync with the primary within the bounds of the synchronization mode.
HADR Internals: Log Shipping Pipeline
Understanding the internal pipeline helps diagnose performance issues and capacity plan the HADR infrastructure:
On the primary: 1. A transaction writes log records to the log buffer (in memory) 2. The HADR sender thread reads log records from the log buffer (or from log files if catching up) 3. Log records are sent over the TCP connection to the standby 4. Depending on the sync mode, the primary may wait for acknowledgment before completing the commit
On the standby: 1. The HADR receiver thread accepts log records from the TCP connection 2. Log records are written to the standby's local log files 3. The log replay thread reads log records from the local log files 4. Log records are applied to the standby database, updating pages in memory and flushing to disk
The pipeline has three potential bottlenecks:
- Network bandwidth: If the primary generates log records faster than the network can carry them, the standby falls behind. A busy OLTP system generating 50 MB/s of log data needs at least a 500 Mbps network link (with overhead), though 1 Gbps or 10 Gbps is recommended.
- Standby I/O capacity: The standby must write log records to disk (for durability) and apply changes to database pages. If the standby's storage is slower than the primary's, the replay falls behind.
- Standby CPU capacity: Log replay is CPU-intensive, especially for index maintenance. If the standby has fewer CPU resources than the primary, replay may lag.
IBM recommends that the standby have identical hardware specifications to the primary. This ensures that log replay can keep pace under all conditions and that the standby provides the same performance as the primary after a takeover.
The HADR_PEER_WINDOW Parameter
The HADR_PEER_WINDOW parameter deserves special attention because it affects availability behavior during network disruptions.
When the primary loses contact with the standby (network failure, standby crash), the primary enters the DISCONNECTED_PEER state. During the peer window interval:
- The primary continues to operate and accept transactions
- The primary behaves as if it is still in PEER mode for logging purposes
- If the standby reconnects within the peer window, it catches up seamlessly
After the peer window expires:
- The primary transitions to DISCONNECTED state
- The primary no longer guarantees zero data loss (because the standby may have missed log records)
- Operations continue normally, but the RPO guarantee is temporarily suspended
A peer window of 0 means the primary immediately transitions to DISCONNECTED — no grace period. A peer window of 120 seconds (Meridian's setting) means the primary waits 2 minutes before giving up on the standby. Setting this value requires balancing:
- Too short: Brief network glitches cause unnecessary loss of PEER state
- Too long: During the peer window, commit latency increases because the primary continues attempting to ship logs to the unreachable standby
29.3 HADR Synchronization Modes
The synchronization mode determines how the primary database interacts with the standby during transaction commits. This is the most critical HADR configuration decision because it directly affects:
- Data loss risk (RPO) — can committed transactions be lost if the primary fails?
- Transaction latency — does the primary wait for the standby during commit?
- Availability behavior — what happens if the standby becomes unreachable?
SYNC (Synchronous)
How it works: When a transaction commits on the primary, the commit does not complete until the standby has written the log records to its local disk and acknowledged receipt.
RPO: Zero. If the primary crashes at any point after a commit returns to the application, the standby has the log records on disk.
Latency impact: The commit must wait for: 1. Network round-trip to the standby (0.1 - 5 ms depending on distance) 2. Standby's disk write of the log records 3. Acknowledgment back to the primary
Typical penalty: 1-10 ms per commit, depending on network latency and standby storage speed.
Availability risk: If the standby becomes unreachable, the primary continues in disconnected peer state for a configurable timeout, after which it continues standalone. During the unreachable period, all commits block until the timeout expires.
Best for: Environments where zero data loss is mandatory and the standby is in the same data center (low network latency).
NEARSYNC (Near-Synchronous)
How it works: The commit does not complete until the standby has received the log records in memory (but not necessarily written them to disk).
RPO: Near-zero. Data loss only occurs in the extraordinarily unlikely scenario where the primary crashes AND the standby crashes (or loses power) simultaneously, before the standby flushes the received log records to disk.
Latency impact: Slightly less than SYNC because the standby does not perform a disk flush before acknowledging. The primary waits only for the network round-trip plus memory receipt.
Typical penalty: 0.5-5 ms per commit.
Best for: Most production environments. The marginal RPO risk compared to SYNC is negligible, while the performance improvement is measurable. This is the mode Meridian Bank uses.
ASYNC (Asynchronous)
How it works: The primary sends log records to the standby but does not wait for any acknowledgment before completing the commit. Log records are shipped as fast as the network allows, but the primary is always "ahead" of the standby.
RPO: Seconds. The amount of data loss depends on how far behind the standby is — typically a few seconds of transactions.
Latency impact: Zero. The commit path is identical to a non-HADR database. There is no added latency.
Best for: Environments where some data loss is tolerable (e.g., non-financial applications) or where the standby is geographically distant (cross-continent), making SYNC/NEARSYNC latency unacceptable.
SUPERASYNC
How it works: Similar to ASYNC, but with even fewer guarantees. The primary ships log records on a best-effort basis. If the network is slow or congested, the primary does not slow down its log shipping; it may skip or delay records.
RPO: Seconds to minutes. Under normal conditions, similar to ASYNC. Under network stress, the gap widens.
Latency impact: Zero.
Best for: Disaster recovery standbys at very long distances where network bandwidth is limited. SUPERASYNC is typically used for the auxiliary or delayed standby, not the principal standby.
Comparison Table
| Attribute | SYNC | NEARSYNC | ASYNC | SUPERASYNC |
|---|---|---|---|---|
| RPO | 0 | Near-zero | Seconds | Seconds-minutes |
| Commit latency added | 1-10 ms | 0.5-5 ms | 0 | 0 |
| Standby ack required | Disk write | Memory receive | No | No |
| Network disruption behavior | Commits block (then timeout) | Commits block (then timeout) | Commits continue | Commits continue |
| Typical distance | Same data center | Same metro area | Cross-region | Cross-continent |
| Real-world latency (100km) | ~3 ms added | ~2 ms added | None | None |
29.4 Setting Up HADR Step by Step
Prerequisites
Before configuring HADR, ensure:
- Identical database schema — the standby must be a restored image of the primary (same structure, same data)
- Compatible DB2 versions — primary and standby should run the same DB2 version (minor version differences are tolerated during rolling upgrades)
- Network connectivity — TCP/IP connectivity between primary and standby with sufficient bandwidth for the log generation rate
- Sufficient storage — the standby needs at least as much log and data storage as the primary
- Same operating system family — the primary and standby can be different Linux distributions but must be the same OS type (Linux-to-Linux, not Linux-to-Windows)
Step 1: Take a Backup of the Primary
# On the primary server
db2 BACKUP DATABASE MERIDIAN TO /db2backup COMPRESS
Step 2: Restore the Backup on the Standby
# On the standby server
db2 RESTORE DATABASE MERIDIAN FROM /db2backup REPLACE EXISTING
Step 3: Configure Database Parameters on the Primary
# On the primary server
db2 UPDATE DB CFG FOR MERIDIAN USING \
HADR_LOCAL_HOST primary.meridian.bank \
HADR_LOCAL_SVC 55001 \
HADR_REMOTE_HOST standby.meridian.bank \
HADR_REMOTE_SVC 55001 \
HADR_REMOTE_INST db2inst1 \
HADR_SYNCMODE NEARSYNC \
HADR_PEER_WINDOW 120 \
HADR_TIMEOUT 120 \
LOGINDEXBUILD ON
Step 4: Configure Database Parameters on the Standby
# On the standby server
db2 UPDATE DB CFG FOR MERIDIAN USING \
HADR_LOCAL_HOST standby.meridian.bank \
HADR_LOCAL_SVC 55001 \
HADR_REMOTE_HOST primary.meridian.bank \
HADR_REMOTE_SVC 55001 \
HADR_REMOTE_INST db2inst1 \
HADR_SYNCMODE NEARSYNC \
HADR_PEER_WINDOW 120 \
HADR_TIMEOUT 120 \
LOGINDEXBUILD ON
Step 5: Start HADR on the Standby First
# On the standby server — start standby FIRST
db2 DEACTIVATE DATABASE MERIDIAN
db2 START HADR ON DATABASE MERIDIAN AS STANDBY
Step 6: Start HADR on the Primary
# On the primary server
db2 START HADR ON DATABASE MERIDIAN AS PRIMARY
Step 7: Verify HADR Status
# On either server
db2pd -db MERIDIAN -hadr
# Or via SQL
db2 "SELECT HADR_ROLE, HADR_STATE, HADR_SYNCMODE,
HADR_CONNECT_STATUS, HADR_LOG_GAP,
PRIMARY_LOG_FILE, STANDBY_LOG_FILE,
PRIMARY_LOG_PAGE, STANDBY_LOG_PAGE
FROM TABLE(MON_GET_HADR(NULL))"
Expected output when fully synchronized:
HADR_ROLE = PRIMARY
HADR_STATE = PEER
HADR_SYNCMODE = NEARSYNC
HADR_CONNECT_STATUS = CONNECTED
HADR_LOG_GAP = 0
Step 8: Test Takeover
A planned takeover (graceful):
# On the standby server
db2 TAKEOVER HADR ON DATABASE MERIDIAN
A forced takeover (when the primary is unreachable):
# On the standby server
db2 TAKEOVER HADR ON DATABASE MERIDIAN BY FORCE
Using db2haicu for Automated Setup
IBM provides the DB2 High Availability Instance Configuration Utility (db2haicu) for automated HADR configuration, particularly when integrated with cluster managers like IBM Tivoli System Automation, Pacemaker, or PowerHA:
db2haicu
# Interactive wizard that configures:
# - HADR parameters
# - Cluster manager integration
# - Automatic failover policies
# - Quorum device configuration
Important HADR Database Configuration Parameters
Beyond the core HADR parameters, several database configuration parameters affect HADR behavior and should be reviewed:
| Parameter | Recommended Value | Reason |
|---|---|---|
| LOGINDEXBUILD | ON | Ensures index builds are logged and replayed on standby |
| LOGARCHMETH1 | DISK:/archive/path | Required; HADR needs archival logging |
| LOGFILSIZ | Match primary and standby | Log file sizes must be identical |
| LOGPRIMARY | Sufficient for peak | Prevents log-full conditions during peak write periods |
| LOGSECOND | Generous (e.g., 100) | Provides buffer for long-running transactions |
| HADR_SPOOL_LIMIT | AUTOMATIC or specific MB | Limits log spooling on primary when standby is behind |
The HADR_SPOOL_LIMIT parameter is particularly important. When the standby falls behind, the primary must retain log records that have not been shipped. If the primary generates logs faster than it can ship them (e.g., during a massive batch load), unshipped logs accumulate. Without a spool limit, this can fill the primary's log space and halt the database. Setting HADR_SPOOL_LIMIT caps how much unshipped log data the primary will retain before disconnecting the standby to protect its own operation.
Takeover Types Explained
There are three distinct takeover operations, each for a different scenario:
Graceful takeover (TAKEOVER HADR ON DB ... ): Used for planned maintenance. The primary and standby coordinate the role switch. Both databases remain consistent, and the old primary becomes the new standby automatically. Zero data loss guaranteed. Requires both primary and standby to be connected and in PEER state.
Forced takeover (TAKEOVER HADR ON DB ... BY FORCE): Used when the primary is unreachable. The standby promotes itself to primary without coordinating with the old primary. Data loss may occur if the standby was not fully caught up. When the old primary comes back online, it cannot automatically become a standby — it must be reinitialized from a fresh backup of the new primary.
Forced takeover with peer window active (TAKEOVER HADR ON DB ... BY FORCE PEER WINDOW ONLY): A safer variant of forced takeover. This command only succeeds if the standby is within the peer window — meaning it was recently in PEER state and the data loss exposure is minimal. If the peer window has expired, the command fails, and the DBA must make an explicit decision to proceed with a full forced takeover.
29.5 Automatic Client Reroute (ACR)
The Problem
HADR solves the server-side failover problem — the standby takes over as the new primary. But what about the clients? After a takeover, every application connection to the old primary is broken. Applications need to reconnect to the new primary.
Without automatic client reroute, every application would need custom reconnection logic — detecting the broken connection, looking up the new primary's address, and reconnecting. This is error-prone and slow.
ACR Configuration
Automatic Client Reroute (ACR) is built into the DB2 client driver. It works by providing the client with alternate server information so that when a connection fails, the client automatically reconnects to the new primary.
Server-Side Configuration
On the primary database, configure the alternate server:
# On the primary server
db2 UPDATE ALTERNATE SERVER FOR DATABASE MERIDIAN \
USING HOSTNAME standby.meridian.bank PORT 50000
When a client connects to the primary, DB2 sends the alternate server information to the client driver. The client stores this information and uses it if the primary connection fails.
Client-Side Configuration
The client can also be pre-configured with alternate server information in the db2dsdriver.cfg file:
<!-- db2dsdriver.cfg on the client -->
<configuration>
<dsncollection>
<dsn alias="MERIDIAN" name="MERIDIAN" host="primary.meridian.bank" port="50000">
<parameter name="enableACR" value="true"/>
</dsn>
</dsncollection>
<databases>
<database name="MERIDIAN" host="primary.meridian.bank" port="50000">
<acr>
<enableacr>TRUE</enableacr>
<enableseamlessacr>TRUE</enableseamlessacr>
<acrtimeout>30</acrtimeout>
<maxacrretries>5</maxacrretries>
<acrretryinterval>10</acrretryinterval>
<alternateserverlist>
<server name="standby" hostname="standby.meridian.bank" port="50000"/>
</alternateserverlist>
</acr>
</database>
</databases>
</configuration>
Seamless ACR
Standard ACR reconnects the client after a failover, but in-flight transactions are rolled back. The application receives an SQL error (-30108) and must retry the transaction.
Seamless ACR goes further. For read-only transactions or transactions that have not yet performed any updates, seamless ACR can reconnect transparently — the application does not see an error. This requires:
enableSeamlessACR=truein the driver configuration- The transaction must be eligible (read-only or no work done yet)
Client Driver Support
ACR is supported by all modern DB2 client drivers:
| Driver | ACR Support | Seamless ACR |
|---|---|---|
| DB2 CLI / ODBC | Yes | Yes |
| JDBC Type 4 (JCC) | Yes | Yes |
| .NET Data Provider | Yes | Yes |
| Node.js ibm_db | Yes | Yes |
| Python ibm_db | Yes | Yes |
| Go go_ibm_db | Yes | Limited |
Workload Balancing (WLB)
When multiple members are available (e.g., in a pureScale environment or when reads-on-standby is configured), the DB2 client can distribute connections across members for load balancing:
<database name="MERIDIAN" host="primary.meridian.bank" port="50000">
<wlb>
<enablewlb>TRUE</enablewlb>
<maxconnections>0</maxconnections>
</wlb>
</database>
WLB queries the server for member weights and distributes new connections accordingly. This is particularly valuable for read-heavy workloads where the standby can offload reporting queries.
29.6 HADR with Reads on Standby
The Value Proposition
In a traditional HADR configuration, the standby database sits idle — it replays log records but serves no application traffic. This is wasteful. The standby has CPU, memory, and storage that could serve read-only workloads.
Reads on Standby (ROS) allows applications to run read-only queries against the HADR standby database while it continues to replay log records from the primary. This is valuable for:
- Reporting and analytics — offload long-running reports from the primary
- Data validation — verify data integrity without impacting production
- Development/testing — read production data for testing purposes
- Disaster preparedness — ensure the standby is functional by actually using it
How It Works
When a client connects to the standby, it receives a read-only connection. The standby continues replaying log records in the background. The read-only queries see a consistent snapshot of the database — the standby provides transactional consistency at the UR (uncommitted read) or CS (cursor stability) isolation level.
# Connect to the standby for read-only queries
db2 CONNECT TO MERIDIAN # standby automatically provides read-only access
# Execute read-only queries
db2 "SELECT * FROM ACCOUNTS WHERE BRANCH_ID = 100"
# Attempt to write — this will fail
db2 "INSERT INTO ACCOUNTS VALUES (...)"
# SQL0752N: The database is currently in HADR standby mode
Isolation Level Considerations
On the standby, the available isolation levels are restricted:
| Isolation Level | Available on Standby? | Notes |
|---|---|---|
| UR (Uncommitted Read) | Yes | May see data from in-flight transactions on the primary |
| CS (Cursor Stability) | Yes | Default; sees committed data |
| RS (Read Stability) | No | Not supported on standby |
| RR (Repeatable Read) | No | Not supported on standby |
The restriction exists because the standby cannot take locks in the traditional sense — it is replaying log records, and lock conflicts with replay would cause the standby to stall.
Replay Delay Consideration
If the standby is configured with a replay delay (delayed standby), reads will see data that is behind the primary by the delay interval. This can be useful:
- Accidental deletion recovery: If someone drops a table at 2:00 PM, the delayed standby (with a 1-hour delay) still has the table until 3:00 PM
- Consistent historical view: Reporting queries see a stable point-in-time, not a moving target
But it can also be confusing — an application querying the standby will see data that is "old" compared to the primary. Ensure that applications and users understand this when using delayed standbys for reporting.
Performance Impact on Log Replay
Read-only queries on the standby compete with log replay for CPU and I/O resources. Under heavy read load:
- Log replay may slow down, causing the standby to fall behind the primary
- This increases the HADR log gap and, in SYNC/NEARSYNC modes, could impact primary performance
Monitor the HADR_LOG_GAP metric closely when enabling reads on standby. If the gap grows unacceptably, consider:
- Limiting the number of concurrent read connections to the standby
- Scheduling heavy reporting queries during off-peak hours
- Using an auxiliary standby for reads while keeping the principal standby dedicated to replay
29.7 Db2 pureScale
Overview
While HADR provides failover with minimal downtime, Db2 pureScale provides continuous availability — there is no failover because multiple members are active simultaneously, sharing the same data. It is the LUW equivalent of z/OS data sharing (Chapter 28), though the implementation differs significantly.
Architecture
pureScale uses a shared-everything architecture:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Member 1 │ │ Member 2 │ │ Member 3 │
│ (Active) │ │ (Active) │ │ (Active) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└──────┬──────┴──────┬──────┘
│ │
┌─────┴─────┐ ┌────┴──────┐
│ CF Node │ │ CF Node │
│ (Primary) │ │ (Standby) │
└─────┬─────┘ └────┬──────┘
│ │
└──────┬──────┘
│
┌──────┴──────┐
│ Shared Disk │
│ (GPFS) │
└─────────────┘
Key components:
Member Nodes
Each member is an active DB2 instance that can serve read/write traffic. All members access the same database on shared storage. Unlike HADR (where only the primary is active), all pureScale members are active simultaneously.
Cluster Caching Facility (CF) Nodes
CF nodes serve a role analogous to the z/OS coupling facility — they manage a global lock manager and a shared buffer pool (group buffer pool). CF nodes:
- Centralized lock management — all lock requests go through the CF
- Page coherency — the CF ensures that all members see consistent data
- Fast inter-member communication — using high-speed interconnects (InfiniBand or RDMA over Converged Ethernet)
There are typically two CF nodes — a primary and a standby. If the primary CF fails, the standby takes over.
Shared Storage (GPFS / Spectrum Scale)
All member nodes and CF nodes access the same storage through IBM Spectrum Scale (GPFS). The shared file system ensures that all members can read and write the same database files, log files, and configuration files.
How pureScale Differs from z/OS Data Sharing
| Aspect | z/OS Data Sharing | Db2 pureScale |
|---|---|---|
| Platform | z/OS on mainframe | Linux on x86/POWER |
| Coupling facility | Dedicated hardware or LPAR | Software CF process on dedicated node |
| Interconnect | CF links (InfiniBand, ISC) | InfiniBand or RoCE |
| Shared storage | DASD (ECKD) via FICON | GPFS via SAN |
| Max members | 32 | 128 |
| Lock management | Hardware-assisted in CF | Software-based in CF node |
| Maturity | 30+ years (since 1994) | Since 2009 |
Failure and Recovery
When a member node fails in pureScale:
- The CF detects the failure (heartbeat timeout)
- Other members continue serving traffic immediately — no failover needed
- The CF initiates member recovery — replaying the failed member's log to roll back in-flight transactions
- Client connections to the failed member are rerouted to surviving members via ACR
When a CF node fails:
- The standby CF takes over automatically
- Lock state and group buffer pool contents are rebuilt from member state
- Brief pause (seconds) while the CF transitions
When to Choose pureScale
pureScale is appropriate when:
- Zero downtime is required — no planned or unplanned outages
- Read/write scalability is needed — multiple active members serve the workload
- The budget supports it — pureScale requires multiple servers, InfiniBand networking, shared storage, and IBM licensing for each member
pureScale is not appropriate when:
- Budget is limited — HADR is much simpler and cheaper
- The workload is modest — a single server with HADR standby may suffice
- Geographic DR is the primary requirement — pureScale requires low-latency interconnects (same data center)
pureScale Operational Considerations
Operating a pureScale cluster requires specialized skills:
Adding and removing members: New members can be added to a running cluster without downtime. The db2iupdt command adds a member, and the CF automatically begins managing locks and cache for the new member. Similarly, members can be removed for maintenance.
CF node maintenance: The standby CF can be taken down for maintenance while the primary CF continues operating. After maintenance, the standby CF is resynchronized. To maintain the primary CF, roles are first swapped (the standby becomes primary) before the original primary is taken down.
Shared storage management: All members access the database through IBM Spectrum Scale (GPFS). The shared file system must be monitored for performance and capacity. GPFS provides its own HA through data replication across storage nodes.
Client routing with WLB: The DB2 client driver's workload balancing (WLB) feature distributes connections across pureScale members based on member weights. This provides automatic load balancing and transparent member failover at the client level.
Combining pureScale with HADR: For geographic DR, pureScale can be combined with HADR. The primary pureScale cluster replicates to a remote standby (either single-member or another pureScale cluster) via HADR ASYNC. This provides both continuous availability within the data center (pureScale) and geographic disaster protection (HADR).
29.8 Q Replication
Overview
Q Replication is an IBM replication technology that captures changes from a source database and applies them to a target database using IBM MQ (Message Queue) as the transport. It is designed for:
- Low-latency replication — changes are captured and applied in near-real-time
- Bidirectional replication — two databases can both accept writes and replicate changes to each other
- Cross-platform replication — DB2 LUW to DB2 z/OS, DB2 to Oracle, etc.
- Multi-target replication — one source can replicate to multiple targets
Architecture
┌─────────────┐ ┌──────────┐ ┌─────────────┐
│ SOURCE │ │ IBM MQ │ │ TARGET │
│ Database │ │ │ │ Database │
│ │ │ │ │ │
│ ┌─────────┐ │ Msgs │ ┌──────┐ │ Msgs │ ┌─────────┐ │
│ │ Q │─┼───────►│ │Queue │─┼───────►│ │ Q │ │
│ │ Capture │ │ │ └──────┘ │ │ │ Apply │ │
│ └─────────┘ │ │ │ │ └─────────┘ │
│ ▲ │ └──────────┘ │ │ │
│ │ │ │ ▼ │
│ DB2 Log │ │ Tables │
└─────────────┘ └─────────────┘
Q Capture
The Q Capture program reads the DB2 transaction log on the source database. For each committed transaction that affects a subscribed table, it constructs a message containing the change data and publishes it to an MQ queue.
Key characteristics: - Log-based capture — no triggers, no additional writes to the source database - Transactional consistency — changes are captured at commit boundaries - Minimal source overhead — the capture program reads the log asynchronously
MQ Transport
IBM MQ provides the messaging infrastructure: - Guaranteed delivery — messages are persistent and transactional - Network independence — source and target can be on different networks, data centers, or continents - Buffering — if the target is temporarily unavailable, messages queue up in MQ
Q Apply
The Q Apply program reads messages from the MQ queue and applies the changes to the target database. It: - Preserves transaction boundaries — changes are applied in commit-consistent groups - Handles conflicts — in bidirectional mode, conflicts are detected and resolved - Maintains referential integrity — dependent changes are applied in the correct order
Bidirectional Replication
Q Replication's most powerful feature is bidirectional replication — two databases that both accept writes and replicate to each other. This enables:
- Active-active configurations — both sites serve read/write traffic
- Geographic distribution — users connect to the nearest site
- Load distribution — write traffic is split across sites
However, bidirectional replication introduces conflicts: what happens when the same row is updated at both sites simultaneously?
Conflict Resolution
Q Replication provides configurable conflict detection and resolution:
- Detection: Q Apply detects that a row has been modified at both source and target since the last replication
- Resolution strategies:
- Source wins — the incoming change overwrites the local change
- Target wins — the local change is preserved, incoming change is discarded
- Timestamp-based — the most recent change (by timestamp) wins
- Custom stored procedure — a user-defined procedure examines both versions and decides
For Meridian Bank, a source-wins strategy is typically used for reference data (branch addresses, interest rates), while transactional data (account balances, transaction records) uses application-level coordination to avoid conflicts entirely.
Latency and Throughput
Typical Q Replication latency:
| Scenario | Capture Latency | MQ Transport | Apply Latency | End-to-End |
|---|---|---|---|---|
| Same data center | <100 ms | <10 ms | <100 ms | <300 ms |
| Cross-region (500km) | <100 ms | 5-20 ms | <100 ms | <500 ms |
| Cross-continent | <100 ms | 50-200 ms | <100 ms | 200 ms - 1 sec |
Throughput depends on the transaction rate, row size, and MQ channel bandwidth. In practice, Q Replication can sustain thousands of transactions per second.
29.9 Change Data Capture (CDC)
Overview
Change Data Capture (CDC) — now part of IBM InfoSphere Data Replication — is a log-based replication technology that captures row-level changes from DB2 and delivers them to diverse targets in real-time. While Q Replication is specifically designed for DB2-to-DB2 replication via MQ, CDC is broader:
- Diverse targets: Kafka, flat files, HDFS, data warehouses, other RDBMS
- Flexible formats: JSON, Avro, CSV, relational row format
- Transformation capability: Data can be filtered, transformed, and enriched during replication
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ SOURCE │ │ CDC Engine │ │ TARGETS │
│ DB2 LUW │ │ │ │ │
│ │ │ ┌─────────┐ │ │ ┌───────────┐ │
│ Transaction├────►│ │ Capture │ │────►│ │ Kafka │ │
│ Log │ │ │ Engine │ │ │ └───────────┘ │
│ │ │ └─────────┘ │ │ ┌───────────┐ │
│ │ │ ┌─────────┐ │ │ │ Data │ │
│ │ │ │ Apply │─┼────►│ │ Warehouse │ │
│ │ │ │ Engine │ │ │ └───────────┘ │
│ │ │ └─────────┘ │ │ ┌───────────┐ │
│ │ │ │ │ │ Flat Files│ │
│ │ │ │────►│ └───────────┘ │
└─────────────┘ └─────────────┘ └─────────────────┘
Log-Based Capture
Like Q Replication, CDC reads the DB2 transaction log — it does not use triggers or polling. This is critical for production databases because:
- No additional I/O on the source database — the capture reads log files that DB2 has already written
- No schema changes required — no trigger DDL, no audit columns
- Minimal latency — changes are captured within milliseconds of the commit
- Complete change record — the log contains before-images and after-images of every changed row
Integration with Apache Kafka
One of the most common CDC patterns today is streaming DB2 changes to Apache Kafka for consumption by downstream systems:
DB2 Transaction Log
│
▼
CDC Capture Engine
│
▼
Kafka Producer
│
▼
┌──────────────────────────────────────┐
│ Apache Kafka │
│ Topic: meridian.accounts.changes │
│ Topic: meridian.transactions.changes│
└──────┬──────────────┬────────────────┘
│ │
▼ ▼
Real-time Data Lake
Analytics (S3/HDFS)
Each committed transaction becomes one or more Kafka messages. Consumers subscribe to the topics and process changes in real-time. This pattern is commonly known as event streaming or event-driven architecture.
For Meridian Bank, the CDC-to-Kafka pipeline feeds:
- Fraud detection engine — analyzes transactions in real-time to detect anomalies
- Data warehouse — incrementally loads transaction data for analytics
- Regulatory reporting — captures account changes for compliance audit trails
- Customer notification service — triggers alerts for large withdrawals, low balances, etc.
CDC vs. Q Replication
| Aspect | CDC | Q Replication |
|---|---|---|
| Primary use case | Data distribution to diverse targets | DB2-to-DB2 replication |
| Transport | Kafka, TCP/IP, flat files | IBM MQ |
| Bidirectional | No (unidirectional) | Yes |
| Target diversity | Any database, Kafka, files, HDFS | DB2 (and some others via adapters) |
| Conflict resolution | Not applicable (unidirectional) | Built-in |
| Transformation | Built-in ETL capabilities | Limited |
| Latency | Sub-second | Sub-second |
| Licensing | InfoSphere Data Replication | DB2 Advanced Edition or standalone |
ETL Patterns with CDC
CDC enables real-time ETL — replacing traditional batch ETL jobs that extract, transform, and load data on a schedule (e.g., nightly). With CDC:
- Extract: Changes are captured from the log in real-time
- Transform: The CDC engine applies mappings, filters, and transformations
- Load: Changed rows are applied to the target in near-real-time
This eliminates the batch window and provides the data warehouse with fresh data continuously. Meridian's analytics team can see a transaction in the data warehouse within seconds of it being committed in the operational database.
CDC Implementation Considerations
Deploying CDC in production requires attention to several practical considerations:
Log retention: The CDC capture engine reads the DB2 transaction log. If the capture engine is stopped or falls behind, DB2 must retain the unprocessed log records. This means the log archive retention policy must be sufficient to allow the CDC engine to catch up after a restart. Meridian Bank retains archive logs for 7 days specifically to support CDC recovery.
Schema evolution: When the source table's schema changes (ADD COLUMN, ALTER COLUMN), the CDC subscription must be updated to include the new columns. Some CDC products handle this automatically; others require manual subscription updates. IBM InfoSphere Data Replication provides auto-adaptation for most DDL changes.
Large object (LOB) handling: LOB columns can dramatically increase the volume of change data. If a table has a 10 MB LOB column and the LOB is updated, the entire LOB value flows through the CDC pipeline. For Meridian Bank, document storage tables with large LOBs are excluded from CDC; only the metadata columns are replicated.
Initial load: When a new table is added to CDC, the existing data must be loaded into the target before incremental changes can be applied. The CDC engine provides an initial load capability that reads the full table and publishes the rows as INSERT messages. During the initial load, the capture engine continues collecting incremental changes, which are applied after the load completes. This avoids any gap in the change stream.
Monitoring and alerting: CDC pipelines should be monitored for: - Capture latency (time between a commit and the capture engine processing it) - Apply latency (time between a commit and the data arriving at the target) - Error rates (deserialization errors, target write failures) - Throughput (messages per second, bytes per second)
IBM provides the InfoSphere Data Replication Management Console for CDC monitoring, and most organizations also integrate CDC metrics into their standard monitoring platforms (Prometheus, Grafana, Datadog).
29.10 Choosing the Right HA Solution
Decision Framework
Choosing the right HA solution requires evaluating multiple factors. Here is a structured decision framework:
Factor 1: RPO Requirement
| RPO | Recommended Solutions |
|---|---|
| Zero (no data loss) | HADR SYNC/NEARSYNC, pureScale |
| Seconds | HADR ASYNC, Q Replication |
| Minutes | HADR SUPERASYNC, CDC |
| Hours | Log shipping, backup/restore |
Factor 2: RTO Requirement
| RTO | Recommended Solutions |
|---|---|
| Seconds (no perceptible outage) | pureScale |
| Under 2 minutes | HADR with automatic failover |
| Under 30 minutes | HADR with manual takeover, Q Replication |
| Hours | Backup/restore |
Factor 3: Budget and Complexity
| Solution | Infrastructure Cost | Operational Complexity | Licensing Cost |
|---|---|---|---|
| HADR | Low (2 servers) | Low | Included in DB2 |
| pureScale | High (multiple servers + InfiniBand + shared storage) | High | Premium licensing |
| Q Replication | Medium (2 servers + MQ infrastructure) | Medium | Additional licensing |
| CDC | Medium (2 servers + Kafka optional) | Medium | InfoSphere licensing |
Factor 4: Geographic Distribution
| Scenario | Recommended Solutions |
|---|---|
| Same data center | pureScale, HADR SYNC |
| Same metro area (<100km) | HADR NEARSYNC |
| Cross-region (100-1000km) | HADR ASYNC, Q Replication |
| Cross-continent (>1000km) | HADR SUPERASYNC, Q Replication, CDC |
Decision Matrix for Meridian National Bank
Meridian evaluates each technology for different workloads:
| Workload | RPO | RTO | Solution Selected | Rationale |
|---|---|---|---|---|
| Core banking (OLTP) | Near-zero | <2 min | HADR NEARSYNC | Best cost/benefit ratio |
| Customer portal (web) | Near-zero | <2 min | HADR NEARSYNC (same instance) | Shared with core banking |
| Analytics/reporting | Minutes | Hours | CDC to data warehouse | Different technology stack |
| Mobile API | Near-zero | <30 sec | HADR + ACR + reads-on-standby | Fast reroute + standby for reads |
| Regulatory archive | Hours | Hours | Backup/restore | Cost-effective for cold data |
| Cross-site DR | Seconds | <30 min | HADR ASYNC (auxiliary standby at DR site) | Distance precludes SYNC |
Combining Solutions
These technologies are not mutually exclusive. A comprehensive HA strategy often combines multiple approaches:
┌─────────────────┐
│ PRIMARY DB │
│ (Production) │
└────┬───┬───┬────┘
│ │ │
NEARSYNC │ │ │ Log-based
HADR │ │ │ CDC
│ │ │
┌──────────────┘ │ └──────────────┐
▼ │ ▼
┌─────────────────┐ │ ┌─────────────────┐
│ PRINCIPAL │ │ │ KAFKA CLUSTER │
│ STANDBY │ │ │ (CDC target) │
│ (same city) │ │ └────────┬────────┘
│ + Reads for │ │ │
│ reporting │ ASYNC ▼
└─────────────────┘ HADR ┌─────────────────┐
│ │ DATA WAREHOUSE │
▼ │ FRAUD ENGINE │
┌─────────────────┐ │ NOTIFICATIONS │
│ AUXILIARY │ └─────────────────┘
│ STANDBY │
│ (DR site, 500km)│
└─────────────────┘
29.11 Meridian Bank LUW HA Design
Environment Overview
Meridian National Bank's LUW environment runs on:
- Production server: IBM Power Systems S1022, 32 cores, 512 GB RAM, running RHEL 8.6
- Standby server: Identical hardware, same data center (different rack, different power feed)
- DR server: IBM Power Systems in a data center 400 km away
- DB2 version: Db2 11.5.8
- Database: MERIDIAN (core banking), 2 TB
HADR Configuration: NEARSYNC with Principal and Auxiliary Standbys
Meridian deploys HADR with two standbys:
Principal Standby (Same Data Center)
- Sync mode: NEARSYNC
- Purpose: Primary failover target, reads-on-standby for reporting
- RPO: Near-zero
- RTO: <60 seconds with automatic failover via Pacemaker
Auxiliary Standby (DR Site, 400 km Away)
- Sync mode: ASYNC
- Purpose: Disaster recovery
- RPO: 2-5 seconds
- RTO: <30 minutes (manual takeover after site declaration)
Data Center A Data Center B (DR)
┌──────────────────────────────────┐ ┌──────────────────┐
│ │ │ │
│ ┌──────────┐ ┌──────────┐ │ │ ┌──────────┐ │
│ │ PRIMARY │───►│ PRINCIPAL│ │ │ │ AUXILIARY│ │
│ │ MERIDIAN │ │ STANDBY │ │ │ │ STANDBY │ │
│ │ │ │ (ROS) │ │ │ │ (DR) │ │
│ └──────────┘ └──────────┘ │ │ └──────────┘ │
│ │ NEARSYNC │ │ ▲ │
│ │ │ │ │ │
│ └─────────────────────────┼───┼───────┘ │
│ ASYNC │ │ │
│ │ │ │
│ Pacemaker cluster (VIP) │ │ │
│ ┌──────────────────────────┐ │ │ │
│ │ VIP: 10.100.1.100 │ │ │ │
│ │ Monitors HADR state │ │ │ │
│ │ Triggers automatic │ │ │ │
│ │ takeover if primary fails│ │ │ │
│ └──────────────────────────┘ │ │ │
└──────────────────────────────────┘ └──────────────────┘
ACR Configuration for Applications
All Meridian applications use the following db2dsdriver.cfg:
<configuration>
<dsncollection>
<dsn alias="MERIDIAN" name="MERIDIAN"
host="10.100.1.100" port="50000">
<parameter name="enableACR" value="true"/>
<parameter name="enableSeamlessACR" value="true"/>
</dsn>
</dsncollection>
<databases>
<database name="MERIDIAN" host="10.100.1.100" port="50000">
<acr>
<enableacr>TRUE</enableacr>
<enableseamlessacr>TRUE</enableseamlessacr>
<acrtimeout>30</acrtimeout>
<maxacrretries>3</maxacrretries>
<acrretryinterval>10</acrretryinterval>
<alternateserverlist>
<server name="standby1"
hostname="10.100.1.101" port="50000"/>
</alternateserverlist>
</acr>
</database>
</databases>
</configuration>
The VIP (10.100.1.100) is managed by Pacemaker and moves to the standby server during failover. Applications connect to the VIP and use ACR as a secondary mechanism.
Reads-on-Standby for Reporting
Meridian's reporting applications connect directly to the principal standby:
# Reporting application connection string
DATABASE=MERIDIAN;HOSTNAME=10.100.1.101;PORT=50000;
CURRENTSCHEMA=MERIDIAN;CONCURRENTACCESSRESOLUTION=WAITFOROUTCOME;
Reports running on the standby include: - Daily transaction summaries - Customer account statements - Branch performance dashboards - Regulatory compliance reports (that do not need real-time data)
This offloads approximately 30% of the read workload from the primary, freeing CPU and I/O capacity for OLTP transactions.
CDC to Data Warehouse
Meridian deploys CDC to stream changes from the MERIDIAN database to its enterprise data warehouse:
DB2 MERIDIAN (Primary)
│
▼ (log reader)
CDC Capture Engine
│
▼
Apache Kafka (3-node cluster)
Topics:
meridian.accounts
meridian.transactions
meridian.customers
meridian.loans
│
├──► Kafka Connect → Data Warehouse (Db2 Warehouse)
├──► Fraud Detection Engine (Kafka Streams)
└──► Customer Notification Service (Spring Boot consumer)
The CDC pipeline achieves: - End-to-end latency: <2 seconds from commit on primary to availability in Kafka - Throughput: Up to 50,000 changes per second - Availability: Kafka's replication provides its own HA; if the CDC engine fails, it resumes from the last checkpoint in the DB2 log
Monitoring and Alerting
Meridian monitors the following HADR metrics continuously:
| Metric | Source | Alert Threshold |
|---|---|---|
| HADR state | MON_GET_HADR() |
State != PEER for >30 seconds |
| Log gap (bytes) | MON_GET_HADR() |
>100 MB |
| Log gap (time) | MON_GET_HADR() |
>5 seconds |
| Connection status | MON_GET_HADR() |
DISCONNECTED |
| Replay lag on standby | db2pd -hadr |
>60 seconds |
| CDC latency | CDC dashboard | >10 seconds |
| Kafka consumer lag | Kafka metrics | >1000 messages |
Alerts feed into Meridian's enterprise monitoring system (IBM Instana / Prometheus + Grafana) with escalation to the on-call DBA within 5 minutes.
Disaster Recovery Procedure
In the event of a complete Data Center A failure:
- Declaration: Operations declares a disaster (may involve regulatory coordination)
- DNS update: Application DNS records are updated to point to Data Center B
- Forced takeover:
db2 TAKEOVER HADR ON DATABASE MERIDIAN BY FORCEon the auxiliary standby at the DR site - Application restart: Applications reconnect to the DR database
- Verify: DBAs verify data integrity and confirm the RPO (check last committed transaction)
- Target RTO: 30 minutes from declaration to applications serving traffic at the DR site
Annual DR Testing
Meridian conducts a full DR test annually:
- Redirect a subset of test traffic to the DR standby
- Perform a controlled takeover at the DR site
- Run application validation tests
- Measure actual RTO and RPO
- Document findings and update the DR runbook
- Reverse the roles and restore normal operation
Spaced Review
From Chapter 3: SQL Foundations
Connection: Every SQL statement in your application — the SELECTs that fetch account balances, the INSERTs that record transactions, the UPDATEs that modify customer records — depends on the HA infrastructure described in this chapter. HADR transparently replicates these SQL operations from the primary to the standby by shipping the log records they generate. When ACR reroutes your application to the standby after a failure, the SQL interface is identical. The application does not know it is talking to a different server.
From Chapter 18: Concurrency and Locking
Connection: HADR's reads-on-standby feature (Section 29.6) restricts isolation levels to UR and CS. In Chapter 18, you learned that RS and RR provide stronger consistency guarantees through lock retention. On the standby, these stronger levels are not available because the standby does not participate in the primary's lock management. If your reporting queries require RS or RR semantics, they must run on the primary. Understanding the isolation level spectrum from Chapter 18 is essential for designing a reads-on-standby strategy.
From Chapter 20: Application Development
Connection: Chapter 20 covered application development patterns including connection management, error handling, and transaction design. HADR and ACR require applications to handle specific error codes gracefully. SQL error -30108 indicates that the connection was rerouted — the application should retry the transaction. Long-running transactions delay log replay on the standby and increase the HADR log gap. The transaction design principles from Chapter 20 — keep transactions short, commit frequently — are doubly important in an HADR environment.
Summary
DB2 LUW provides a comprehensive portfolio of high availability technologies, each serving different points on the RPO/RTO/cost spectrum. HADR is the workhorse — a mature, well-understood technology that provides near-zero data loss and sub-minute recovery for most enterprise workloads. pureScale delivers continuous availability for the most demanding environments where even seconds of downtime are unacceptable. Q Replication enables bidirectional, active-active configurations and cross-platform data distribution. CDC feeds real-time change streams to modern data platforms like Kafka, enabling event-driven architectures.
For Meridian National Bank, the combination of HADR NEARSYNC (for primary HA), HADR ASYNC (for geographic DR), reads-on-standby (for reporting offload), and CDC-to-Kafka (for real-time data distribution) provides a robust, cost-effective HA strategy that meets the bank's availability, performance, and regulatory requirements.
No single technology solves every problem. The art of HA design lies in understanding your requirements — RPO, RTO, budget, operational capability — and selecting the right combination of technologies. This chapter has given you the knowledge to make those decisions with confidence.
Related Reading
Explore this topic in other books
IBM DB2 Data Sharing Advanced COBOL Disaster Recovery Advanced COBOL CICS Failure and Recovery