Chapter 29 Key Takeaways

HADR Fundamentals

  1. HADR is DB2 LUW's primary HA technology. It maintains a synchronized copy of the database on a standby server by continuously shipping and replaying transaction log records. It is included with all DB2 editions at no additional license cost.

  2. The synchronization mode is the most critical HADR decision. SYNC guarantees zero data loss but adds commit latency. NEARSYNC provides near-zero data loss with less latency. ASYNC adds no latency but permits seconds of data loss. Choose based on RPO requirements and network distance.

  3. NEARSYNC is the right default for most production environments. It provides near-zero RPO (data loss only in the astronomically unlikely scenario of simultaneous primary and standby failure) with minimal performance impact. SYNC is justified only when zero data loss is an absolute legal or regulatory requirement.

  4. Always start the standby first, then the primary. The standby must be listening before the primary attempts to connect. Starting them in the wrong order results in a connection timeout that delays synchronization.

  5. Three standby roles serve different purposes. The principal standby is the automatic failover target. Auxiliary standbys provide additional redundancy or geographic DR. A delayed standby protects against logical corruption by replaying logs after a configurable delay.

Automatic Client Reroute

  1. ACR is essential for transparent failover. Without ACR, applications need custom reconnection logic. With ACR, the DB2 client driver automatically reconnects to the standby after a failover. Configure it in db2dsdriver.cfg for all client applications.

  2. Seamless ACR minimizes application disruption. For read-only transactions or transactions with no work done yet, seamless ACR reconnects without the application seeing an error. For in-flight write transactions, SQL30108N is returned and the application must retry.

  3. Applications must handle SQL30108N gracefully. This error code means the connection was rerouted. The application should catch it and retry the failed transaction from the beginning.

Reads-on-Standby

  1. Reads-on-standby offloads reporting from the primary. The standby database can serve read-only queries while replaying logs. This is free capacity that would otherwise be wasted.

  2. Only UR and CS isolation levels are available on the standby. RS and RR are not supported because the standby cannot manage traditional locks while replaying the primary's log. Design your reporting queries accordingly.

  3. Heavy reads on the standby can slow log replay. Monitor the HADR log gap when running read-intensive workloads on the standby. If the gap grows, the standby is falling behind the primary, which could increase RTO during a failover.

pureScale

  1. pureScale provides continuous availability with multiple active read/write members. Unlike HADR (active-passive), pureScale runs multiple active members that all serve read/write traffic. There is no failover because all members are always active.

  2. pureScale requires significant infrastructure investment. Multiple servers, InfiniBand networking, shared storage (GPFS/Spectrum Scale), and premium licensing. It is justified for workloads that genuinely require sub-second RTO and continuous availability.

  3. pureScale is limited to same-data-center deployments. The low-latency interconnect requirement means pureScale members must be physically close. For geographic DR, pureScale must be combined with HADR or Q Replication.

Replication Technologies

  1. Q Replication enables bidirectional, active-active configurations. Both databases accept writes and replicate to each other via IBM MQ. Conflict detection and resolution are built in. This is the technology for multi-site active-active architectures.

  2. CDC streams real-time changes to diverse targets. Unlike Q Replication (DB2-to-DB2), CDC can feed Kafka, data warehouses, flat files, and other systems. It reads the DB2 log directly — no triggers, no polling, minimal source overhead.

  3. Q Replication and CDC serve different purposes. Q Replication is for database-to-database synchronization with transactional consistency. CDC is for data distribution and event streaming. Use Q Replication for active-active; use CDC for analytics pipelines.

Design Principles

  1. No single technology solves every HA requirement. The best architectures combine multiple technologies: HADR for local HA, Q Replication for cross-site synchronization, and CDC for analytics distribution. Understand each technology's strengths and use them together.

  2. RPO and RTO drive technology selection. Zero RPO with sub-second RTO demands pureScale. Near-zero RPO with sub-minute RTO is well served by HADR NEARSYNC. Cross-region DR with seconds of RPO uses HADR ASYNC or Q Replication.

  3. Test everything before you need it. Conduct regular takeover tests, failover drills, and replication validation. The first time you execute a forced takeover should not be during a real disaster at 3 AM. Schedule annual DR tests and document every finding.