Case Study 29.1: HADR Implementation and Takeover Test

Background

Meridian National Bank's digital banking division runs its customer-facing applications on DB2 LUW 11.5. The environment consists of a single production server handling:

  • Online banking web portal: 3,000 concurrent users at peak
  • Mobile banking API: 8,000 API calls per minute
  • Internal teller application: 500 active sessions across 120 branches
  • Transaction processing rate: 2,800 TPS at peak

The database (MERIDIAN) is 1.8 TB on an IBM Power Systems S1022 running RHEL 8.6. There is no HA solution in place — the bank relies on nightly backups and has an estimated RTO of 4-6 hours for a catastrophic failure.

Following a board mandate to achieve 99.99% availability, the DBA team — led by senior DBA Marcus Rivera — is tasked with implementing HADR.

Challenge

  1. Implement HADR with near-zero data loss (RPO < 1 second)
  2. Achieve RTO under 2 minutes for automatic failover
  3. Enable reads-on-standby for the reporting workload
  4. Conduct a full takeover test to validate the configuration
  5. Complete the project within 6 weeks with zero impact to production

Phase 1: Infrastructure Preparation (Weeks 1-2)

Hardware Procurement

Marcus provisions a standby server identical to the primary: - IBM Power Systems S1022, 32 cores, 512 GB RAM - Same storage configuration (IBM FlashSystem 7300, 4 TB usable) - RHEL 8.6 with identical patch level - DB2 11.5.8 installed with same fixpack

Network Configuration

The standby is in the same data center but on a different rack with: - Separate power feed (UPS B) - Separate network switch (TOR-B) - Dedicated HADR network: 10 Gbps dedicated VLAN between primary and standby - Network latency: 0.3 ms (same data center)

Marcus configures dedicated network interfaces for HADR traffic to avoid competition with application traffic:

Primary HADR interface:  10.10.100.10 (eth2, VLAN 100)
Standby HADR interface:  10.10.100.20 (eth2, VLAN 100)

Pacemaker Cluster Setup

Marcus installs and configures Pacemaker for automatic failover:

  • Quorum device: a lightweight VM on a third server
  • Virtual IP (VIP): 10.10.1.100 — this is the address applications connect to
  • Pacemaker monitors HADR state and triggers takeover if the primary becomes unreachable

Phase 2: HADR Configuration (Week 3)

Backup and Restore

# On primary
db2 BACKUP DATABASE MERIDIAN TO /db2backup COMPRESS INCLUDE LOGS

# Transfer backup to standby (dedicated 10 Gbps link)
rsync -av --progress /db2backup/MERIDIAN.0.* standby:/db2restore/

# On standby
db2 RESTORE DATABASE MERIDIAN FROM /db2restore REPLACE EXISTING
db2 ROLLFORWARD DATABASE MERIDIAN TO END OF LOGS AND STOP

The 1.8 TB backup took 45 minutes to create (compressed to 680 GB) and 20 minutes to transfer over the dedicated link. Restore completed in 38 minutes.

HADR Parameter Configuration

On the primary:

db2 UPDATE DB CFG FOR MERIDIAN USING \
    HADR_LOCAL_HOST  10.10.100.10 \
    HADR_LOCAL_SVC   55001 \
    HADR_REMOTE_HOST 10.10.100.20 \
    HADR_REMOTE_SVC  55001 \
    HADR_REMOTE_INST db2inst1 \
    HADR_SYNCMODE    NEARSYNC \
    HADR_PEER_WINDOW 120 \
    HADR_TIMEOUT     120 \
    LOGINDEXBUILD    ON

On the standby:

db2 UPDATE DB CFG FOR MERIDIAN USING \
    HADR_LOCAL_HOST  10.10.100.20 \
    HADR_LOCAL_SVC   55001 \
    HADR_REMOTE_HOST 10.10.100.10 \
    HADR_REMOTE_SVC  55001 \
    HADR_REMOTE_INST db2inst1 \
    HADR_SYNCMODE    NEARSYNC \
    HADR_PEER_WINDOW 120 \
    HADR_TIMEOUT     120 \
    LOGINDEXBUILD    ON

Starting HADR

# Standby first
ssh standby "db2 DEACTIVATE DATABASE MERIDIAN"
ssh standby "db2 START HADR ON DATABASE MERIDIAN AS STANDBY"

# Then primary
db2 START HADR ON DATABASE MERIDIAN AS PRIMARY

Initial Synchronization

Marcus monitors the initial catch-up:

db2pd -db MERIDIAN -hadr
HADR_ROLE     = PRIMARY
HADR_STATE    = REMOTE_CATCHUP
HADR_SYNCMODE = NEARSYNC
LOG_GAP       = 245,760 bytes    # Catching up...

After 3 minutes:

HADR_ROLE     = PRIMARY
HADR_STATE    = PEER
HADR_SYNCMODE = NEARSYNC
LOG_GAP       = 0

HADR is in PEER state — fully synchronized.

Phase 3: ACR and Application Configuration (Week 4)

ACR Configuration

Marcus configures the alternate server on the primary:

db2 UPDATE ALTERNATE SERVER FOR DATABASE MERIDIAN \
    USING HOSTNAME 10.10.1.101 PORT 50000

He deploys the db2dsdriver.cfg to all application servers:

<configuration>
  <dsncollection>
    <dsn alias="MERIDIAN" name="MERIDIAN"
         host="10.10.1.100" port="50000">
    </dsn>
  </dsncollection>
  <databases>
    <database name="MERIDIAN" host="10.10.1.100" port="50000">
      <acr>
        <enableacr>TRUE</enableacr>
        <enableseamlessacr>TRUE</enableseamlessacr>
        <acrtimeout>30</acrtimeout>
        <maxacrretries>3</maxacrretries>
        <acrretryinterval>10</acrretryinterval>
        <alternateserverlist>
          <server name="standby"
                  hostname="10.10.1.101" port="50000"/>
        </alternateserverlist>
      </acr>
    </database>
  </databases>
</configuration>

Application Error Handling Update

The Java application team adds HADR-aware error handling:

try {
    // Execute transaction
    connection.setAutoCommit(false);
    stmt.executeUpdate(transferSQL);
    connection.commit();
} catch (SQLException e) {
    if (e.getErrorCode() == -30108) {
        // ACR reroute occurred — retry the transaction
        log.warn("Connection rerouted via ACR, retrying transaction");
        retryTransaction();
    } else if (e.getErrorCode() == -4499) {
        // Connection lost, ACR will reconnect automatically
        log.warn("Connection lost, waiting for ACR reconnection");
        retryTransaction();
    } else {
        throw e;
    }
}

Reads-on-Standby Configuration

Marcus configures a dedicated connection pool for reporting applications that connects directly to the standby:

Reporting DSN: MERIDIAN_ROS
Host: 10.10.1.101 (standby direct address)
Port: 50000

The reporting team updates their Crystal Reports and Cognos connections to use the MERIDIAN_ROS data source.

Phase 4: Takeover Test (Week 5)

Test Plan

Marcus schedules a takeover test during a maintenance window (Saturday 11 PM to Sunday 3 AM). The test plan:

  1. Verify HADR state is PEER
  2. Run a synthetic workload simulating production traffic
  3. Execute a graceful takeover
  4. Measure RTO (time from takeover initiation to application reconnection)
  5. Verify data integrity (compare row counts and checksums)
  6. Reverse the roles (takeover back to the original primary)
  7. Conduct a forced takeover test (simulating sudden primary failure)

Test Execution

Test 1: Graceful Takeover

11:15 PM — Synthetic workload running at 1,500 TPS:

# On standby
db2 TAKEOVER HADR ON DATABASE MERIDIAN

# Takeover initiated at 23:15:00.000
# Standby promoted to PRIMARY at 23:15:08.342
# Old primary demoted to STANDBY at 23:15:08.342

RTO: 8.3 seconds from initiation to the standby accepting connections.

Application logs show: - 12 connections received SQL30108N (rerouted via ACR) - 3 in-flight transactions rolled back - All 12 connections successfully reconnected and resumed within 4 seconds - No data loss (verified via transaction count comparison)

Test 2: Forced Takeover (Simulated Failure)

12:30 AM — After reversing roles, Marcus simulates a primary failure:

# On primary — simulate a crash
db2 FORCE APPLICATION ALL
kill -9 $(pgrep db2sysc)

# On standby — forced takeover
db2 TAKEOVER HADR ON DATABASE MERIDIAN BY FORCE

RTO: 23 seconds from the kill signal to the standby accepting connections. Pacemaker detected the failure in 12 seconds and triggered the takeover automatically.

Application logs show: - All 1,500 active connections dropped - ACR reconnected 1,487 connections within 15 seconds - 13 connections failed ACR (exceeded retry limit) and were re-established by the connection pool - 47 in-flight transactions rolled back - Zero committed transactions lost (verified by comparing primary's last commit LSN with standby's replayed LSN)

Data Integrity Verification

-- Compare row counts for critical tables
SELECT 'ACCOUNTS' AS TBL, COUNT(*) AS CNT FROM ACCOUNTS
UNION ALL
SELECT 'TRANSACTIONS', COUNT(*) FROM TRANSACTIONS
UNION ALL
SELECT 'CUSTOMERS', COUNT(*) FROM CUSTOMERS;

Row counts matched exactly between the pre-failure primary and the post-takeover standby.

Results

Metric Target Achieved
RPO <1 second 0 (zero data loss)
RTO (graceful) <2 minutes 8.3 seconds
RTO (forced, with Pacemaker) <2 minutes 23 seconds
ACR reconnection <30 seconds 4-15 seconds
Reporting offload (reads-on-standby) 30% of read traffic 35% achieved
Production impact during setup Zero Zero

Lessons Learned

  1. Dedicated HADR network is essential. Initial testing over the shared production network showed variable latency (0.3-5 ms). The dedicated VLAN provides consistent 0.3 ms latency.

  2. Test ACR with real application code, not just db2 CLP. Two application modules had hardcoded connection strings that bypassed the db2dsdriver.cfg. These were discovered during the takeover test.

  3. LOGINDEXBUILD ON is non-negotiable. Without it, indexes on the standby may be marked invalid after a REORG INDEX operation on the primary. Marcus discovered this during a REORG maintenance window when the standby's indexes required rebuilding.

  4. Pacemaker quorum configuration matters. Initial setup without a quorum device led to a split-brain scenario during network testing. Adding a quorum device on a third server resolved this.

  5. Monitor HADR continuously, not just during tests. Marcus implemented Prometheus-based monitoring that polls MON_GET_HADR() every 10 seconds and alerts if the log gap exceeds 100 MB or the state leaves PEER for more than 30 seconds.

Discussion Questions

  1. Why did Marcus choose NEARSYNC instead of SYNC mode? Under what circumstances would SYNC be the better choice?
  2. If Meridian later wants to add a DR standby at a remote site (400 km away), what changes to the HADR configuration are needed?
  3. The forced takeover RTO was 23 seconds, with 12 seconds consumed by Pacemaker failure detection. How could this be reduced?
  4. What is the risk of using reads-on-standby for the regulatory compliance report that runs for 2 hours? How might it impact HADR performance?