Part VI: High Availability and Scalability

Production databases do not get the luxury of downtime. The transaction does not care that you are applying maintenance. The customer does not care that a disk failed. The regulator does not care that your data center lost power. The expectation in modern enterprise computing is continuous availability, and DB2 provides the mechanisms to deliver it — but only if you understand them, configure them correctly, and test them before you need them.

This part covers the technologies and architectures that keep DB2 running when hardware fails, when workloads grow beyond a single server's capacity, and when the business demands that the database be available in places and at scales it was not originally designed for.

What This Part Covers

Four chapters, each addressing a distinct aspect of availability and scalability. The first two are platform-specific. The last two apply to both.

Chapter 28 covers DB2 data sharing on z/OS — the Parallel Sysplex technology that allows multiple DB2 subsystems to share a single database simultaneously. Data sharing is the gold standard for database high availability, and it is unique to the z/OS platform. We cover the coupling facility, group buffer pools, the IRLM in a data sharing context, inter-system lock contention, workload balancing across members, and the operational procedures for planned and unplanned member outages. We address the performance considerations — the overhead of global locking, the XI/SOS conditions that indicate group buffer pool pressure, and the design patterns that minimize cross-member contention. If you work on z/OS, this chapter is essential. Data sharing is not a feature you bolt on after the fact; it is an architectural decision that influences your physical design, your application design, and your operational procedures.

Chapter 29 covers HADR (High Availability Disaster Recovery) on LUW — DB2's log-shipping replication technology for automatic failover. We cover the HADR architecture: primary, standby, principal, and auxiliary roles. We work through the synchronization modes (SYNC, NEARSYNC, ASYNC, SUPERASYNC) and the trade-offs each mode makes between data protection and performance impact. We cover automatic client reroute, the Db2 pureScale environment that provides active-active clustering analogous to z/OS data sharing, and the integration with operating system clustering tools like PowerHA and Pacemaker. Failover testing gets extensive treatment — you will practice planned and unplanned failover scenarios because the only time you want to discover a configuration problem is during a test, never during a real outage.

Chapter 30 addresses partitioning and scalability — the strategies for distributing data and workload across multiple structures or systems. On z/OS, we cover partition-by-range universal tablespaces, the operational benefits of partition independence (you can REORG one partition while the others remain available), and the performance implications of partition pruning during query processing. On LUW, we cover range partitioning, multi-dimensional clustering (MDC), and database partitioning (DPF) across multiple physical or logical nodes. We also cover federation — the ability to present data from multiple heterogeneous sources through a single DB2 interface — and the use of nicknames, wrappers, and server definitions to build federated architectures. Scalability is not just about handling more data; it is about handling more data without proportional increases in response time, administrative overhead, or complexity.

Chapter 31 takes DB2 into the cloud. IBM's Db2 on Cloud, Db2 Warehouse on Cloud, and the hybrid deployment patterns that connect on-premises DB2 to cloud-based resources. We cover the architectural differences between a self-managed DB2 instance and a managed cloud service, the migration paths for moving workloads from on-premises to cloud, and the operational model changes that cloud deployment demands. We also address the hybrid patterns that most enterprises actually use: a z/OS DB2 system of record on-premises, with cloud-based Db2 instances serving analytics, development, or geographic distribution needs. Cloud is not a destination; it is an additional deployment option, and understanding when it makes sense — and when it does not — requires the same engineering judgment that every other decision in this book requires.

Why It Matters

Downtime costs money. That statement is true for every database, but the magnitude varies enormously. For Meridian National Bank, an hour of unplanned downtime on the core banking database means ATMs that do not dispense cash, branches that cannot process transactions, online banking that returns errors, and regulatory reporting that misses its window. The direct costs are quantifiable. The reputational costs are not.

High availability is not a product you buy. It is an architecture you build, test, and maintain. DB2 provides the components — data sharing, HADR, pureScale, partitioning — but assembling those components into a system that actually survives failure requires understanding the failure modes, the recovery mechanisms, and the operational procedures that connect them.

I have seen organizations invest heavily in HA technology and then fail to test failover for two years. When they finally needed it, the configuration had drifted, the standby was behind, and the failover that was supposed to take 30 seconds took four hours. The technology worked. The operational discipline did not. This part covers both.

Scalability is equally important. Databases grow. Transaction volumes increase. Reporting requirements expand. If your architecture cannot accommodate growth without redesign, you are building a system with a built-in expiration date. The scalability technologies in this part — partitioning, data sharing, DPF, cloud elasticity — provide the mechanisms for growth, but using them effectively requires planning during the design phase, not as a reactive measure when the system is already under stress.

Platform-Specific Guidance

This is the most platform-divergent part of the book, and I want to be direct about how to navigate it.

If you are a z/OS professional, Chapter 28 is your priority. Data sharing is the defining high-availability technology on the mainframe, and deep expertise in it is a career differentiator. Read Chapter 29 for context — understanding HADR helps you communicate with colleagues who work on LUW — but your focus should be on Chapter 28 and then Chapters 30 and 31.

If you are a LUW professional, Chapter 29 is your priority. HADR is the technology you will configure, monitor, and rely on. Read Chapter 28 for context — understanding data sharing gives you perspective on what the mainframe world takes for granted — but your focus should be on Chapter 29 and then Chapters 30 and 31.

If you work on both platforms, read everything. You are the person who bridges the two worlds, and that bridge role is increasingly valuable as organizations integrate mainframe and distributed DB2 environments.

Chapters 30 and 31 are essential reading regardless of platform. Partitioning and cloud deployment are universal concerns.

The Meridian Bank HA Architecture

Meridian National Bank runs DB2 on both platforms. The core banking system runs on a z/OS Parallel Sysplex with a three-member data sharing group. The digital banking platform runs on LUW with HADR configured between a primary site and a disaster recovery site 200 miles away. The analytics environment runs on Db2 Warehouse on Cloud.

Throughout Part VI, we design, configure, and test the high-availability architecture for all three environments. We build the data sharing configuration for the z/OS members, including group buffer pool sizing and workload balancing. We configure HADR for the LUW environment and test failover under realistic conditions. We set up the cloud analytics environment and establish the data synchronization pipeline from the on-premises systems.

This multi-platform scenario is not contrived. It is the reality at most large financial institutions, and working through it gives you exposure to integration challenges that single-platform exercises miss.

How to Approach This Part

Start with the chapter that matches your primary platform (Chapter 28 for z/OS, Chapter 29 for LUW). Then read Chapter 30 on partitioning, which provides scalability foundations for both platforms. Finish with Chapter 31 on cloud, which increasingly applies to everyone.

Test failover in your lab. I cannot stress this enough. Reading about HADR or data sharing is necessary but not sufficient. You need to simulate failures — kill a member, drop a network connection, corrupt a log file — and observe how the system responds. The exercises in this part are designed for exactly this purpose. Do them.