Case Study 37.2: Building a DB2 Center of Excellence
Background
Northeastern Federal Credit Union (NFCU) is a mid-sized credit union with $6.8 billion in assets, 420,000 members, and 2,400 employees. Like many financial institutions of its size, NFCU's database management had evolved organically over two decades. The result was a fragmented landscape:
- 8 DB2 databases spread across three servers with no consistent configuration standards.
- 4 DBAs who had each developed their own practices, scripts, and naming conventions.
- No formal performance methodology — performance problems were addressed reactively, often by adding hardware.
- No capacity planning process — storage was added when tablespaces filled up, not before.
- Inconsistent security practices — each DBA managed security differently, with no unified role-based access model.
- No community involvement — none of the four DBAs attended conferences, published articles, or participated in user groups.
The CTO had a vision: transform the database team from a reactive operations group into a proactive, standards-driven "DB2 Center of Excellence" (CoE) that would be recognized internally as a strategic partner and externally as a model for other credit unions.
Phase 1: Assessment and Vision (Months 1-2)
The Honest Assessment
The CTO hired a consultant (not unlike James from Case Study 37.1) to conduct an honest assessment. The findings were sobering but not unusual for an organization of NFCU's size:
| Area | Current State | Target State |
|---|---|---|
| Standardization | Each DBA has their own approach | Documented standards for all operations |
| Performance | Reactive (fix when users complain) | Proactive (baseline, monitor, tune continuously) |
| Security | Inconsistent, individual grants | Unified RBAC model with RCAC where needed |
| Availability | HADR on 2 of 8 databases | HADR on all production databases |
| Monitoring | Manual checks, no centralized dashboard | Automated monitoring with alerting |
| Documentation | Scattered, incomplete | Centralized knowledge base |
| Skills | Strong individual skills, no cross-training | Cross-trained team with shared expertise |
| Community | No external engagement | Active IDUG membership, conference attendance |
Defining the CoE
The consultant and the CTO defined the DB2 Center of Excellence as:
"A team of DB2 professionals who operate as a shared service, providing standardized, documented, measurable database services to the organization. The CoE establishes best practices, provides training to application teams, conducts proactive performance management, and represents the organization in the broader DB2 community."
The CoE would be measured on five metrics: 1. Availability: 99.95% uptime across all production databases. 2. Performance: 95% of production queries completing within SLA thresholds. 3. Security: Zero audit findings related to database access controls. 4. Efficiency: Reduction in reactive incidents by 50% year over year. 5. Team Development: Each team member earns at least one IBM certification within 18 months.
Phase 2: Standards and Consolidation (Months 3-8)
Database Configuration Standards
The team's first task was agreeing on standards. This was harder than any technical challenge. Four experienced DBAs, each with their own preferences, had to agree on a single way of doing things.
The team lead (the most senior DBA, who had been designated the CoE manager) facilitated the process with a simple rule: "We will base our standards on IBM's best practices. Where IBM does not prescribe a specific approach, we will test alternatives and choose based on data, not preference."
The resulting standards document covered:
- Naming conventions: Schema names, table names, index names, tablespace names, buffer pool names. Example: indexes are named IX_<table>_<purpose>_<sequence> (e.g., IX_MEMBER_ACCTLOOKUP_01).
- Buffer pool configuration: Separate buffer pools for data, index, temp, and LOB pages. STMM enabled for automatic sizing.
- Tablespace design: Automatic storage for all new tablespaces. DMS with auto-resize for legacy tablespaces.
- Logging: Standardized log file sizes, archive log retention (14 days online, 90 days archive), dual archive log destinations.
- HADR: All production databases configured with HADR NEARSYNC mode and Automatic Client Reroute.
- Security: Role-based access control for all application access. No individual user grants on application tables. SECADM separated from DBADM.
Database Consolidation
The eight databases were reviewed for consolidation opportunities. Analysis revealed: - Two databases could be merged (they served the same application and had been split historically due to storage constraints that no longer existed). - One database was a legacy system with zero active users — it was archived and decommissioned. - Five databases remained as independent instances with justified separation.
Net result: 8 databases reduced to 5, reducing operational overhead by 37.5%.
Configuration Migration
Each of the five remaining databases was reconfigured to match the new standards. This was done in a phased approach: non-production first, then production during maintenance windows. Each migration was documented with before/after comparisons and validated with regression testing.
Phase 3: Monitoring and Automation (Months 6-12)
Centralized Monitoring
The team deployed a monitoring infrastructure based on: - DB2 monitoring table functions (MON_GET_) for data collection. - Python scripts for data extraction and formatting (two of the four DBAs learned Python during this phase). - Grafana dashboards for visualization. - PagerDuty* for alert routing.
The dashboards covered: - Buffer pool hit ratios (per buffer pool, per database). - Lock wait time and deadlocks. - Tablespace utilization with trend lines. - HADR state and log gap. - Top SQL by CPU consumption (refreshed every 5 minutes). - Backup status (last successful backup, next scheduled backup).
For the first time, NFCU's leadership could see the health of all databases at a glance.
Automated Maintenance
The team automated routine maintenance tasks that had previously been manual: - RUNSTATS: Scheduled via cron for all tables with significant DML activity. Statistics currency was monitored on the Grafana dashboard. - REORG: Automated detection of tables needing REORG (based on overflow rows and clustering ratio). REORG jobs were automatically queued for the next maintenance window. - Backup verification: A Python script ran daily, checking backup history against the expected schedule. Any missed or failed backup triggered an alert.
The automation reduced the time each DBA spent on routine maintenance from approximately 8 hours per week to 2 hours per week — freeing 24 DBA-hours per week for proactive work.
Phase 4: Performance Methodology (Months 9-14)
From Reactive to Proactive
The most significant cultural change was shifting from reactive to proactive performance management. The old model:
User calls helpdesk: "The system is slow." Helpdesk escalates to DBA. DBA investigates. DBA finds and fixes the issue. User is satisfied (hours later).
The new model:
Monitoring detects a query exceeding its SLA threshold. Alert triggers. DBA investigates before users notice. DBA resolves the issue. Users experience no degradation.
Performance Baselines
The team captured performance baselines for each database using the methodology from Chapter 36. Baselines were refreshed quarterly. Any metric that deviated more than 20% from baseline triggered an investigation — even if no user complained.
SQL Review Process
The CoE established a SQL review process for new application features. Developers submitted their SQL to the CoE for review before deployment. The CoE analyzed EXPLAIN output, checked for missing indexes, and provided optimization recommendations.
Initially, developers resisted this process as a bottleneck. The CoE addressed this by: 1. Committing to a 2-business-day turnaround for SQL reviews. 2. Providing a self-service "SQL pre-check" script that developers could run against their own queries. 3. Conducting monthly "SQL Clinic" sessions where developers could bring queries for real-time analysis and learning.
Within six months, the SQL review process was accepted and even valued by the development teams. Incidents caused by poorly performing new SQL dropped by 73%.
Phase 5: Team Development (Months 1-18, Ongoing)
Certification Campaign
The CoE manager set a team goal: all four DBAs would earn at least one IBM certification within 18 months. The credit union funded exam fees and allocated 4 hours per week for study time.
Results: - DBA 1 (most senior): Earned the Advanced DBA — Db2 for LUW certification. - DBA 2: Earned the Professional DBA — Db2 for LUW certification. - DBA 3: Earned the Professional DBA — Db2 for LUW certification and began studying for the Advanced exam. - DBA 4 (most junior): Earned the Foundation (Database Associate) certification and began studying for the Professional exam.
Cross-Training
The team implemented a cross-training program. Each DBA had been the sole expert on "their" databases. If that person was on vacation, the database had no expert coverage. The cross-training program paired DBAs as buddies, with each pair responsible for documenting and teaching their databases to their buddy.
After 12 months, any DBA could manage any database in the environment — a critical improvement in operational resilience.
Conference Attendance
The credit union funded IDUG conference attendance for one team member per year (rotating). The first attendee returned with: - Three specific technical improvements to implement (configuration changes she learned from an IBM engineer's presentation). - A connection to a DBA at another credit union who shared their monitoring scripts. - A presentation proposal for the following year's conference (which was accepted — NFCU's first-ever conference presentation).
Knowledge Sharing
The team established a weekly "DB2 Brown Bag" lunch meeting. Each week, one DBA presented a 20-minute topic — a new feature they learned, a problem they solved, or a concept they wanted to teach. These sessions built the team's collective knowledge and presentation skills.
Results After 18 Months
| Metric | Before CoE | After CoE (18 Months) | Target |
|---|---|---|---|
| Availability | 99.82% | 99.97% | 99.95% |
| Queries within SLA | ~85% (estimated) | 96.4% | 95% |
| Audit findings (annual) | 7 | 0 | 0 |
| Reactive incidents / month | 34 | 11 | <17 (50% reduction) |
| Team certifications | 0 | 5 (across 4 DBAs) | 4 (one per DBA) |
| Conference presentations | 0 | 1 | 1 |
| Routine maintenance hours / week | 32 | 8 | <12 |
| Databases managed | 8 | 5 | — |
| Documented standards | 0 pages | 84 pages | — |
Every target was met or exceeded.
The Intangible Benefits
Beyond the metrics, the CoE transformation produced benefits that are harder to quantify:
-
Developer trust: Application teams now proactively consult the CoE during design, not just when things break. The DBA team went from being seen as gatekeepers to being seen as partners.
-
Career satisfaction: All four DBAs reported higher job satisfaction. The combination of professional development (certifications, conferences), reduced firefighting, and increased proactive work made the role more engaging.
-
Organizational recognition: The CTO presented the CoE model at a credit union industry conference as a case study in IT operational excellence. The DBA team went from being an invisible infrastructure function to being a recognized center of excellence.
-
Recruitment: When the team added a fifth DBA position (due to growth), the posting attracted strong candidates — partly because the CoE's conference presentation and community involvement had made the team visible.
Lessons for Other Organizations
-
Standards first, tools second. The monitoring dashboard is impressive, but the standards document is what made the team effective. Without agreed-upon standards, you are just monitoring chaos.
-
Automation frees humans for human work. Automating RUNSTATS and backup verification did not reduce the team size. It freed the team to do work that machines cannot: performance analysis, architecture design, developer mentoring, and strategic planning.
-
Certification is a team sport. Studying alone is hard. Studying as a team — with shared goals, dedicated time, and mutual support — is sustainable and builds team cohesion.
-
Community involvement pays dividends. The connections made at IDUG conferences, the scripts shared by peers, and the reputation built through presentations all had tangible operational benefits.
-
The CoE model works at any scale. NFCU has four DBAs. The CoE model works just as well with two DBAs or twenty. The principles — standardization, automation, proactive management, team development, community engagement — scale to any team size.
Discussion Questions
-
The CoE establishment process took 18 months. If you were asked to prioritize phases for a team that could only commit to 6 months of focused effort, which phases would you prioritize and why?
-
The SQL review process initially faced developer resistance. How would you handle a situation where a senior developer refuses to submit SQL for review, claiming it is unnecessary overhead?
-
The cross-training program paired DBAs as buddies. What are the risks of cross-training (e.g., knowledge breadth vs. depth), and how would you mitigate them?
-
The CTO funded conference attendance for one team member per year. How would you justify the business case for increasing this to two or three attendees per year?