30 min read

> Where you are: Part VII, Chapter 38 of 40 — the start of Administration and Career. Designing and querying a database is the beginning; operating one in production — backed up, monitored, tuned, recoverable — is the job that never ends. This...

Chapter 38: Database Administration — Backup, Monitoring, Tuning, and Keeping the Lights On

Where you are: Part VII, Chapter 38 of 40 — the start of Administration and Career. Designing and querying a database is the beginning; operating one in production — backed up, monitored, tuned, recoverable — is the job that never ends. This chapter is the operational craft.

Learning paths: 🏗️ DBA (core — this is your discipline); 💻 Developer (the operational reality behind your app); 📊 Analyst can skim. Many concepts build on Part IV's internals.


The job that never ends

A database isn't "done" when it's designed — it runs, indefinitely, serving an application that can't go down. Database administration is keeping it healthy: configured well, monitored for trouble, backed up so loss is recoverable, tuned for performance, and upgraded safely. Even if you're a developer (not a dedicated DBA), understanding these tasks makes you a better engineer — and on small teams, you are the DBA. This chapter is grounded in the internals of Part IV (the WAL, MVCC/VACUUM, the buffer pool, indexes) — administration is largely applying that understanding operationally.


Configuration

PostgreSQL's behavior is governed by postgresql.conf (and runtime ALTER SYSTEM/SET). The defaults are conservative (tuned to start anywhere); production needs deliberate settings. The most impactful are memory-related (rooted in the buffer pool and query execution, Ch. 28, 24):

Setting What it controls Rough guidance
shared_buffers the buffer pool (page cache in RAM) ~25% of system RAM
effective_cache_size the planner's estimate of total cache (OS + PG) ~50–75% of RAM (a hint, not an allocation)
work_mem memory per sort/hash operation modest (e.g., tens of MB); ×concurrency, so be careful
maintenance_work_mem memory for VACUUM, index builds larger (e.g., hundreds of MB)
max_connections concurrent connections allowed bounded (e.g., 100–200) — pool beyond it (Ch. 27)

shared_buffers (the buffer pool) and work_mem (sorts/hashes) directly affect the cache hits and disk-spilling sorts you saw in EXPLAIN (Ch. 24, 28). work_mem is per operation — set too high with many concurrent queries and you can exhaust RAM, so tune it against your concurrency. Tuning is iterative: change, measure, repeat (don't blindly copy settings from the internet).


Monitoring

You can't operate what you can't see. PostgreSQL exposes its state through statistics views and extensions:

  • pg_stat_activity — what's running right now: active queries, their state, how long they've run, what they're waiting on. Your first stop for "the database is slow/stuck." (Find that 9-day-old transaction from Chapter 28's bloat case here.)
  • pg_stat_statements (extension) — aggregated query statistics: which queries consume the most total time, run most often, are slowest. The tool for finding what to optimize (feeds Chapter 24's diagnosis).
  • pg_stat_user_tables — per-table activity, including n_dead_tup (dead tuples — Chapter 28's bloat signal) and last (auto)vacuum/analyze times.
  • pg_stat_user_indexes — index usage (idx_scan = 0 → unused index, Chapter 23's pruning case).
  • OS/infra metrics — CPU, memory, disk space and I/O, connection counts — via your monitoring stack (Prometheus + Grafana, cloud dashboards, pganalyze, etc.).

Monitor proactively: track slow queries, dead-tuple growth, connection counts, replication lag, and disk space before they become incidents. Set alerts (disk filling, replication lagging, connections near max, oldest-transaction age).


Backups and recovery (the most important responsibility)

If you do one thing as a DBA, it's ensuring recoverable backups. Loss happens — hardware failure, ransomware, an accidental DROP or missing WHERE (Chapter 13). Backups are the difference between an incident and a catastrophe (and a security control, Chapter 32). PostgreSQL offers:

  • Logical backups — pg_dump / pg_dumpall: export a database (or all) as SQL or an archive; restore with pg_restore. Flexible (restore a single table, move across versions), but slower for huge databases and a point-in-time snapshot only.
  • Physical backups — pg_basebackup: a binary copy of the data directory. Faster for large databases; the basis of replication and PITR.
  • WAL archiving + Point-In-Time Recovery (PITR): archive the write-ahead log (Chapter 28) continuously; combined with a base backup, you can restore to any moment — e.g., "to 14:59, one minute before the bad DELETE at 15:00." The gold standard for production.

Three rules that turn "we have backups" into "we're actually safe":

  1. 3-2-1: at least 3 copies, on 2 different media, with 1 off-site (and ideally immutable/offline against ransomware).
  2. Automate them — a backup that depends on someone remembering doesn't happen.
  3. Test your restores. A backup you have never restored is a hope, not a backup. Regularly restore to a scratch environment and verify. Untested backups failing at the worst moment is a classic disaster.

Backups and recovery, in depth

If database administration has one non-negotiable responsibility, it's recoverable backups — because everything else can be fixed, but lost data is gone, so understanding backups deeply is the most important operational knowledge. Loss happens (hardware failure, ransomware, an accidental DROP or missing WHERE), and backups are the difference between an incident and a catastrophe.

PostgreSQL offers a spectrum of backup approaches suited to different needs. Logical backups (pg_dump for one database, pg_dumpall for the whole cluster) export the data as SQL statements or a compressed archive that pg_restore reads back. They're flexible — you can restore a single table, move data across PostgreSQL versions, or inspect the human-readable SQL — but they're a point-in-time snapshot (the data as of when the dump ran) and slower for very large databases. Physical backups (pg_basebackup) make a binary copy of the entire data directory — faster for large databases, and the foundation for replication and the gold-standard recovery method. WAL archiving plus Point-In-Time Recovery (PITR) is that gold standard: by continuously archiving the write-ahead log (Chapter 28) alongside a base backup, you can restore to any moment in time — crucially, to just before a disaster. If a bad DELETE ran at 15:00, PITR lets you restore to 14:59, recovering everything up to the instant before the mistake. This ability to recover to an arbitrary point, not just the last nightly backup, is what makes PITR the production gold standard — it minimizes data loss to near-zero and lets you "rewind" past a disaster.

But having a backup mechanism is not the same as being safe, and three rules turn "we have backups" into "we're genuinely protected." First, the 3-2-1 rule: keep at least 3 copies of your data, on 2 different types of media, with 1 copy off-site (and ideally immutable or offline, so ransomware that encrypts your systems can't also encrypt your backups). A single backup on the same server as the data protects against almost nothing. Second, automate backups — a backup process that depends on someone remembering to run it will eventually be forgotten, precisely when it matters; scheduled, automated backups happen reliably. Third, and most violated, test your restoresa backup you have never successfully restored is not a backup, it's a hope. Regularly restore backups to a scratch environment and verify the data is complete and correct. The classic, devastating failure is discovering during a disaster that the backups are corrupt, incomplete, or unrestorable — the worst possible moment to learn your safety net has a hole. Test-restoring on a schedule turns the backup from a theoretical safety net into a proven one. These three rules — 3-2-1, automate, test-restore — are what separate organizations that survive data-loss events from those that don't, and they're the most important operational discipline a DBA maintains. Everything else in administration keeps the database running well; backups ensure that even when something goes catastrophically wrong, the data survives.


Configuration and monitoring, in depth

Two ongoing administration tasks — configuring the database for its workload and monitoring its health — deserve deeper treatment, because they're how you keep a database performing well and catch trouble before it becomes an outage. Both apply the internals understanding of Part IV operationally.

Configuration tunes PostgreSQL's behavior to your hardware and workload, and the most impactful settings are memory-related, rooted in the buffer pool and query execution (Chapters 28, 24). shared_buffers sizes the buffer pool (the page cache in RAM) — typically around 25% of system RAM, large enough to hold the hot working set so most accesses are fast cache hits. effective_cache_size tells the planner how much cache it can assume exists (OS plus PostgreSQL — a hint for planning, not an allocation), guiding it toward index scans when data is likely cached. work_mem sizes the memory each sort or hash operation can use before spilling to disk (Chapter 24's disk-spill problem) — set modestly because it's per operation and multiplies with concurrency (set it too high with many concurrent queries and you exhaust RAM). maintenance_work_mem (for VACUUM and index builds) can be larger since fewer of those run concurrently. max_connections bounds concurrent connections — kept sensible and supplemented by a pooler (Chapter 27) rather than raised to thousands. The crucial discipline is that tuning is iterative and workload-specific: change a setting, measure the effect, repeat — never blindly copy settings from the internet, because the right values depend on your hardware and workload. The defaults are conservative (tuned to start anywhere); production performance requires deliberate, measured tuning toward your actual situation.

Monitoring is how you see the database's state and catch trouble early, and PostgreSQL is richly observable (the self-describing principle of Chapter 2, applied to runtime). The key views, each tied to internals you understand: pg_stat_activity shows what's running right now (active queries, their state and duration, what they're waiting on) — your first stop for "the database is slow or stuck," and where you'd find the long-running transaction causing bloat (Chapter 28). pg_stat_statements aggregates query statistics (which queries consume the most total time, run most often, are slowest) — the tool for finding what to optimize, feeding Chapter 24's diagnosis. pg_stat_user_tables shows per-table activity including dead-tuple counts (n_dead_tup — Chapter 28's bloat signal) and last-vacuum times. pg_stat_user_indexes shows index usage (an index with zero scans is unused — Chapter 23's pruning). Plus OS-level metrics (CPU, memory, disk space, I/O, connection counts) via a monitoring stack. The discipline is proactive monitoring: track slow queries, dead-tuple growth, connection counts, replication lag, and disk space continuously, with alerts (disk filling, replication lagging, connections near max, oldest-transaction age) that warn you before trouble becomes an incident. Monitoring is internals knowledge applied to observation — the metrics are meaningful because you understand what they measure (dead tuples, cache hits, lag), and watching them lets you act (vacuum, tune, kill a transaction, add capacity) before users are affected. Configuration and monitoring together are the ongoing craft of keeping a database performing: configure it well for its workload, watch it continuously, and adjust as the workload evolves.


High availability and a 3 a.m. incident

Production databases must often stay available despite failures, and high availability (HA) — plus the reality of responding to incidents — is a core part of operations worth understanding, because "the database is down" is the incident a DBA most dreads and most prepares for. HA and incident response are where administration meets the pressure of keeping a critical system running.

High availability means the database survives the failure of a single server, achieved through replication and failover (Chapter 35). A standby replica continuously receives the primary's changes (via WAL streaming), and if the primary fails, the standby is promoted to become the new primary — failover — so the database comes back quickly on the standby rather than staying down until the primary is repaired. The key HA metrics are RPO (Recovery Point Objective — how much data you can afford to lose, determined by replication: synchronous = zero loss, asynchronous = up to the replication lag) and RTO (Recovery Time Objective — how quickly you must recover, determined by failover speed). HA setups range from manual failover (someone promotes the standby) to automatic failover (tools detect the primary's failure and promote a standby without human intervention, minimizing RTO). Managed cloud databases (RDS, Cloud SQL) typically provide automatic failover as a built-in feature — a major reason most teams use them, getting HA without operating it themselves. The HA design — replicas, sync or async, manual or automatic failover — is chosen by how much downtime and data loss the application can tolerate, balanced against cost and complexity.

But even with HA, incidents happen, and the 3 a.m. incident — the database alerting that something is wrong, at the worst hour — is the test of operational readiness. Picture it: an alert fires, the application is slow or down, and you must diagnose and resolve under pressure. This is where everything comes together. You check pg_stat_activity — is there a long-running query blocking others, or a transaction stuck "idle in transaction" causing bloat and locks? You check disk space — is the disk full (a common, preventable outage — a full disk halts the database because it can't write WAL)? You check pg_stat_statements and recent changes — did a new query or a data change suddenly degrade performance (stale statistics? a missing index after a big load)? You check replication — is a replica lagging or down, affecting failover readiness? You check the logs for errors. The internals knowledge of Part IV is what makes this diagnosis possible: you understand what pg_stat_activity shows, why a long transaction matters, why a full disk halts writes, why stale statistics cause bad plans. And the preparation — monitoring with alerts (so you're warned early), tested backups (so worst-case you can restore), a runbook (so you know the steps), HA (so a server failure is survivable) — is what turns a potential catastrophe into a managed incident. The 3 a.m. incident is why administration matters: a well-administered database (monitored, backed up, HA, with a runbook) makes the incident a brief, resolved event; a poorly-administered one makes it a prolonged outage with possible data loss. Preparing for the incident you hope never comes — through monitoring, backups, HA, and runbooks — is the essence of operational responsibility, and it's what lets a DBA sleep at night knowing that even at 3 a.m., the database (and the data) will be okay.


Routine maintenance: VACUUM and autovacuum

From Chapter 28: MVCC leaves dead tuples that VACUUM reclaims, and autovacuum does it automatically. The DBA's job is to ensure autovacuum keeps up:

  • Monitor bloat (n_dead_tup, table/index size vs. live rows). High-churn tables may need more aggressive autovacuum settings (per-table autovacuum_* parameters).
  • Watch for blockers — long transactions prevent cleanup (Chapter 28's bloat case); idle_in_transaction_session_timeout and monitoring the oldest transaction protect you.
  • Avoid transaction-ID wraparound — autovacuum handles "freezing," but a broken autovacuum can lead to an emergency; monitor age(datfrozenxid).
  • ANALYZE keeps planner statistics fresh (Chapter 24) — autovacuum does it, but run it manually after big bulk loads (Chapter 31).

Connection and disk management

  • Connections are bounded and expensive (Chapter 27). Set max_connections sensibly and use a pooler (PgBouncer) for high client counts — don't just raise max_connections to thousands (each is a process). Monitor active vs. idle connections.
  • Disk space — databases grow; WAL accumulates if archiving stalls; bloat consumes space. Monitor free disk and alert early — a full disk can halt the database (it can't write WAL). This is a common, avoidable outage.

Upgrades

  • Minor versions (e.g., 16.2 → 16.3) — bug/security fixes, same data format; a quick restart. Apply them promptly (security!).
  • Major versions (e.g., 16 → 17) — new features, different internal format; require pg_upgrade (fast, in-place) or dump/restore or logical replication (for near-zero downtime). Test on a copy first; major upgrades are a planned operation.
  • Stay reasonably current — old versions lose support and security patches; PostgreSQL releases yearly and supports ~5 years.

Managed vs. self-hosted

You can run PostgreSQL yourself or use a managed service (RDS, Cloud SQL, Aurora, AlloyDB):

  • Managed: the provider handles backups, replication, failover, patching, and much monitoring. You configure and tune; they operate the machinery. The right choice for most teams — less operational burden, built-in HA.
  • Self-hosted: full control and lower direct cost, but you do everything (backups, HA, upgrades, 24/7 ops). Justified when you need control the managed service doesn't offer, or at a scale where it's economical.

Either way, the concepts here apply — managed services don't remove the need to understand configuration, monitoring, backups, and tuning; they just operate some of it for you.


Security operations and upgrades

Two more operational responsibilities — keeping the database secure in operation and current through upgrades — round out the administration craft, and both connect to earlier chapters while adding the operational dimension. Security and upgrades are ongoing operational duties, not one-time setups.

Security operations is the operational side of the security principles from Chapter 32 — not just designing security (parameterized queries, least-privilege roles, encryption) but maintaining it in production. This includes managing roles and privileges as the system evolves (granting and revoking access as people and applications change, keeping least privilege as a living practice rather than a one-time grant), keeping the database patched against security vulnerabilities (applying minor versions promptly, since they include security fixes), managing the network exposure (the database behind a firewall, accessible only from authorized hosts via a restrictive pg_hba.conf), handling secrets (rotating credentials, keeping them out of code and in secret managers), and monitoring for security (audit logging via pgAudit, watching for unusual access patterns or failed-login spikes that might signal an attack). Security is not "set up once and done" but an ongoing operational concern: access needs change, vulnerabilities are discovered, threats evolve, and the DBA (or whoever operates the database) must continuously maintain the security posture. This is where Chapter 32's security design meets ongoing operation — the controls must be not just established but maintained, monitored, and updated. A database secured at launch but not maintained (unpatched, with stale over-broad grants, unmonitored) drifts toward insecurity; ongoing security operations keeps it protected.

Upgrades keep the database current — important for both features and, critically, security. PostgreSQL releases new major versions yearly (16, 17, …) and minor versions as needed (16.2, 16.3, …), and the upgrade discipline differs by type. Minor versions are bug and security fixes with the same data format — a quick restart, applied promptly because they include security patches you don't want to skip; falling behind on minor versions means running with known, unpatched vulnerabilities. Major versions add features but use a different internal format, requiring a real migration: pg_upgrade (fast, in-place), dump-and-restore, or logical replication (for near-zero-downtime upgrades of critical systems). Major upgrades are planned operations — tested on a copy first (to catch any behavior changes or compatibility issues), scheduled with appropriate care, with a rollback plan. The discipline is to stay reasonably current: PostgreSQL supports each major version for about five years, after which it stops receiving fixes (including security fixes), so a database on an out-of-support version is a growing risk. Staying within the supported window — applying minor versions promptly and major versions deliberately every few years — keeps the database secure, supported, and benefiting from PostgreSQL's continuous improvement (the features that, per theme #4, keep extending what PostgreSQL can do). Upgrades, like backups and security, are an ongoing operational responsibility: the database isn't a static artifact but a living system that must be kept current, secure, and supported throughout its life.


Managed services and operations as applied internals

Two final framings clarify the practice of administration: the role of managed cloud databases, and the recognition that administration is fundamentally the application of the internals understanding from Part IV. Both help you see operations clearly.

Managed cloud databases (Amazon RDS, Google Cloud SQL, Amazon Aurora, Google AlloyDB, Azure Database) have transformed how most teams operate PostgreSQL by automating much of the operational machinery — backups, replication, failover, patching, and substantial monitoring are handled by the provider. For most teams, managed PostgreSQL is the right default: it dramatically reduces operational burden (no manual backup setup, no manual failover configuration, automatic patching) and provides built-in high availability, letting a small team run a production database reliably without deep operational expertise or 24/7 staffing. Self-hosting (running PostgreSQL yourself) offers full control and lower direct cost but requires you to do everything — backups, HA, upgrades, monitoring, around-the-clock operations — justified only when you need control the managed service doesn't offer or operate at a scale where self-hosting is economical. The crucial point, though, is that managed services don't eliminate the need to understand administration — they operate some of the machinery, but you still configure and tune for your workload, monitor and respond to issues, design your backup and HA strategy, and diagnose incidents the automation doesn't catch. The concepts of this chapter — configuration, monitoring, backups, recovery, tuning — apply whether you run PostgreSQL yourself or use a managed service; the managed service just operates some of it for you. So understanding administration matters even (especially) when using managed databases: it's what lets you use them well, make sound operational decisions, and handle the problems that automation doesn't.

The deepest framing of administration is that it's applied internals — the operational application of the Part IV understanding (the WAL, MVCC and VACUUM, the buffer pool, indexes, transactions). Nearly every administration task is internals knowledge put to operational use. Backups and PITR are the WAL (Chapter 28) used for recovery. Tuning shared_buffers is sizing the buffer pool (Chapter 28) for the working set. Managing autovacuum is keeping the MVCC dead-tuple cleanup (Chapter 28) healthy. Monitoring dead tuples, cache hits, and long transactions is watching the internals (Chapter 28) in production. Diagnosing a slow query incident is reading plans and understanding scans/joins/statistics (Chapter 24). Choosing an HA strategy is applying replication (Chapter 35). This is why the chapter opened by grounding administration in Part IV: operations isn't a separate body of knowledge but the application of the internals understanding to keeping a database healthy in production. A practitioner who understands the internals deeply (Part IV) can operate a database competently (this chapter), because operations is those internals applied — they understand why a backup via WAL recovers correctly, why the buffer pool size matters, why long transactions cause bloat, why stale statistics slow queries, and so they can configure, monitor, tune, and troubleshoot with genuine understanding rather than cargo-culted recipes. This connection — administration as applied internals — is what makes the operational craft learnable and reliable: master the internals, and operations becomes their sensible application, not a separate mystery. It's theme #3 (understand the why) reaching all the way to production operations: the why you learned in Part IV is the foundation of operating databases well, which is the final technical skill of the database practitioner before the book turns, in its last chapter, to the career built on all of it.


Common mistakes

  • No tested backups — the cardinal sin; discovering at restore time that backups don't work.
  • Ignoring monitoring until an incident — flying blind; no early warning.
  • Letting the disk fill — a preventable outage (monitor free space!).
  • Blindly copying config from the internet — tune to your hardware/workload, iteratively.
  • Raising max_connections instead of pooling — exhausts memory (Chapter 27).
  • Neglecting autovacuum / long transactions — bloat (Chapter 28).
  • Falling far behind on versions — losing security patches and support.

Capacity planning and the maintenance mindset

Two ongoing aspects of administration — planning for growth and the routine-maintenance mindset — keep a database healthy over time rather than just at a moment, and understanding them completes the picture of operations as a continuous responsibility, not a one-time setup.

Capacity planning is anticipating growth and ensuring the database has the resources (storage, memory, CPU, connections) to handle it before it becomes a problem. Databases grow — more data accumulates, more users connect, more queries run — and a database that's fine today can hit a wall tomorrow if growth isn't planned for. The capacity-planning discipline is to monitor trends (disk usage growing how fast? connection counts trending where? query load increasing?) and project forward (at this rate, when do we run out of disk, hit the connection limit, exceed the cache?), so you can act ahead of the curve — add storage before the disk fills, add a read replica before the primary is overwhelmed, increase memory before the working set outgrows the cache. The alternative — reactive scaling, adding capacity only after hitting a limit — means outages and emergencies (the full disk that halts the database, the connection exhaustion that takes down the app). Proactive capacity planning, watching the trends and provisioning ahead of need, is what keeps growth from becoming crisis. This connects to the scaling ladder of Chapter 35: capacity planning tells you when you're approaching the need to climb a rung (vertical scaling, replicas, eventually sharding), so you climb deliberately ahead of the limit rather than scrambling after hitting it.

The maintenance mindset is recognizing that a database requires ongoing care — it's not "set up and forget," but a living system that needs continuous attention. Autovacuum must keep up with dead tuples (Chapter 28), or bloat accumulates; statistics must stay fresh (especially after bulk loads), or plans degrade; indexes may need occasional rebuilding on high-churn tables; long transactions must be watched (they block cleanup and cause bloat); minor versions must be applied (security patches); disk space must be monitored. None of this is dramatic, and most is automated (autovacuum, scheduled backups, monitoring alerts), but it requires ongoing attention — knowing it's happening, watching that it's working, and intervening when it isn't (autovacuum falling behind, a long transaction stuck, the disk filling). The maintenance mindset treats the database as something you tend continuously, like a garden, not something you build once and abandon. The DBA's quiet, ongoing work — checking that backups ran and restore, that autovacuum is keeping up, that disk space is healthy, that no long transactions are stuck, that replication is current — is what keeps the database reliably healthy, invisibly, so that the dramatic incidents are rare. This continuous care is the unglamorous heart of administration: not heroic firefighting (though that happens), but the steady, attentive maintenance that prevents most fires. A well-maintained database is a boring database — and in production, boring is exactly what you want.


The DBA role, DBRE, and operations as a discipline

It's worth situating database administration as a role and a discipline, because how database operations are done in organizations has evolved, and understanding the landscape helps you see where these skills fit in a career (the subject of Chapter 40). Operations is a substantial professional domain, and its modern forms are worth knowing.

The traditional DBA (Database Administrator) role is the specialist responsible for the databases — configuration, monitoring, backups, tuning, upgrades, the operational craft this chapter covers. In larger organizations, dedicated DBAs own the databases as their full-time discipline. But the role has evolved rather than disappeared, despite occasional "the DBA is dead" claims (which Chapter 40 addresses). The modern evolution is DBRE (Database Reliability Engineering) — applying the principles of site reliability engineering (SRE) to databases: treating database operations as a software-engineering problem, automating operational tasks (backups, failover, scaling, monitoring as code), defining reliability objectives (SLOs), and building self-healing, observable systems rather than manually administering databases. DBRE reflects that operations has become more automated and more engineering-driven — infrastructure as code, automated failover, comprehensive monitoring — but the underlying concerns (the database must be configured, monitored, backed up, tuned, recovered) are unchanged; they're just addressed through automation and engineering rather than manual administration. The cloud has accelerated this: managed databases (RDS, Cloud SQL, Aurora) automate much of the operational machinery (backups, failover, patching), shifting the human role from operating the machinery to configuring, tuning, and overseeing it. The DBA didn't die; the role moved up the stack, from manual operations toward engineering and oversight of automated operations.

For the practitioner, the implication is that operational skills remain valuable — even essential — whether you're a dedicated DBA/DBRE or a developer who must operate databases on a small team (where you are the DBA). Understanding configuration, monitoring, backups, recovery, tuning, and HA — even when much is automated or managed — is what lets you operate databases reliably, diagnose the incidents automation doesn't catch, and make sound operational decisions. The managed service handles the routine, but when something goes wrong (the 3 a.m. incident), or when you must tune for your workload, or when you must choose an HA strategy, the operational understanding this chapter provides is what you draw on. Operations is a genuine discipline — the craft of keeping databases healthy, available, and recoverable in production — and it's distinct from the design and querying skills of earlier parts (those create the database; operations runs it). Both matter: a beautifully-designed, well-queried database that isn't backed up, monitored, and maintained will eventually fail in production; one that is operated well serves reliably for years. This chapter is the operations discipline, and whether you become a DBRE, a developer-who-operates, or someone who works with a DBA, understanding it makes you more effective and more complete as a database professional — which is exactly the well-rounded competence the final chapters of the book aim to leave you with.


The operational runbook

A practice that ties administration together and makes it reliable — especially under pressure — is the runbook: documented procedures for the operational tasks and the incidents you might face, so that when something happens, you (or a teammate) follow a known, tested procedure rather than improvising under stress. The runbook is how operational knowledge becomes reliable, repeatable practice rather than tribal memory.

A good database runbook documents the routine procedures: how to take a backup and, crucially, how to restore one (with the restore tested, so the procedure is known to work); how to check the database's health (which views to query, what's normal versus alarming); how to apply a minor version upgrade; how to add a read replica. And it documents the incident procedures — what to do when "the database is slow" (check pg_stat_activity for blocking queries and long transactions, check pg_stat_statements for heavy queries, check disk space, check replication lag), what to do when "the database is down" (check the server, check for a full disk, fail over to a replica if the primary is dead and recovery will be slow), what to do when "a bad change was made" (assess the damage, restore via PITR to just before it if needed). The runbook turns the panicked, improvised response of an unprepared operator into the calm, procedural response of a prepared one — at 3 a.m., following a tested runbook beats trying to remember and reason under pressure. It also makes operations transferable: a documented runbook means the knowledge isn't locked in one person's head, so any team member (or a new hire) can operate the database, and the bus-factor risk of "only Sarah knows how to restore the database" is eliminated.

Building the runbook is itself a forcing function for good operations. To write "how to restore a backup," you must actually restore one (testing it — the test-restore discipline made concrete). To write "what to do when the database is slow," you must understand the diagnostic tools and what they show. To write "how to fail over," you must have a failover capability and know how to use it. So the act of creating the runbook ensures the procedures work and the capabilities exist — it surfaces the gaps ("we don't actually have a tested restore procedure," "we have no failover plan") while they're cheap to fix, rather than discovering them during an incident. A mature operations practice maintains its runbook as a living document, updated as the system evolves and as incidents reveal new scenarios to document (every incident should improve the runbook — "next time, check X first"). The runbook, then, is both a tool (the procedures you follow) and a discipline (the practice of documenting, testing, and improving them) — and it's the hallmark of operations done professionally rather than heroically. The unprepared operator improvises each incident anew, stressed and error-prone; the prepared one follows a tested runbook, calm and reliable. Writing the runbook for your own project's database — how to back up, restore, monitor, and respond to slowness — is exactly the operational discipline this chapter teaches, and it's the capstone-worthy artifact that demonstrates you can not just build a database but operate one responsibly.

That distinction — building versus operating — is the heart of why this chapter exists. The earlier parts taught you to create databases: model, design, query, optimize, secure, connect. This chapter teaches you to run them: configure, monitor, back up, recover, maintain, upgrade. Both are essential, and a database professional needs both — a beautifully-built database that's unbacked-up, unmonitored, and unmaintained will eventually fail in production, while a well-operated one serves reliably for years. Operations is where the database meets the unforgiving reality of production: real users, real failures, real data that can't be lost. Mastering it — grounded in the internals you understand and disciplined by practices like the runbook and the tested restore — is what completes you as a database practitioner, ready not just to build databases but to be trusted with them in production. And being trusted with data in production is, ultimately, what a database career is built on — which is exactly where the book turns next, after the capstone that assembles everything you've learned into one demonstrable piece of work.


Progressive project: operate your database

For your project (and your capstone):

  1. Set up a backuppg_dump your project database on a schedule; restore it to a scratch database and verify it works (the test-restore habit).
  2. Monitor — query pg_stat_activity (what's running), pg_stat_statements (top queries — enable the extension), and pg_stat_user_tables (dead tuples).
  3. Review config — note your shared_buffers, work_mem, max_connections; reason about whether they fit your workload.
  4. Check autovacuum — confirm it's running; look at n_dead_tup on a churned table.
  5. Write an operational runbook: how to back up, restore, monitor, and what to check if "the database is slow."

Summary

Database administration keeps a production database healthy and recoverable. Configuration (postgresql.conf) — especially memory (shared_buffers ≈ 25% RAM, work_mem per-operation, effective_cache_size) — should be tuned to your hardware/workload iteratively. Monitoring via pg_stat_activity (live queries), pg_stat_statements (top queries to optimize), and pg_stat_user_tables (dead tuples/bloat) lets you see and pre-empt trouble. Backups are the most important responsibility: pg_dump (logical), pg_basebackup (physical), and WAL archiving + PITR (restore to any moment) — following 3-2-1, automated, and test-restored (an untested backup is just hope). Routine maintenance ensures autovacuum keeps up (bloat, long-transaction blockers, fresh stats — Ch. 28). Manage connections (pool, don't over-raise max_connections) and disk space (monitor — a full disk halts the database). Upgrade minor versions promptly and major versions deliberately (pg_upgrade, tested). Managed services operate much of this for you and are the right default for most — but the concepts still apply.

You can now: - Tune the key configuration settings to your workload. - Monitor with pg_stat_activity/pg_stat_statements/pg_stat_user_tables. - Choose and perform backups (logical/physical/PITR), and test restores (3-2-1). - Keep autovacuum healthy and manage connections and disk space. - Plan minor/major upgrades and choose managed vs. self-hosted.

What's next. Chapter 39 — Capstone — bringing the whole book together: your complete, designed, queryable, indexed, secured, documented, application-connected database — a portfolio-ready piece that demonstrates real database engineering skill.


Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.