49 min read

> "Everyone has a plan until they get punched in the mouth."

Prerequisites

  • 22
  • 21
  • 19
  • 10
  • 2

Learning Objectives

  • Run the six-phase NIST SP 800-61 incident-response lifecycle and explain why response is a continuous loop, not a linear sequence.
  • Build an incident-response plan, severity-classification matrix, playbooks, runbooks, and a communications plan before an incident occurs.
  • Triage an alert into a declared incident and make containment, eradication, and recovery decisions under uncertainty and time pressure.
  • Facilitate a tabletop exercise and walk a ransomware scenario end-to-end from a defender's seat.
  • Conduct a blameless post-incident review that produces durable improvements rather than blame.

Chapter 24: Incident Response: Preparation, Detection, Containment, Eradication, Recovery, and Lessons Learned

"Everyone has a plan until they get punched in the mouth." — Mike Tyson

Overview

At 6:42 on a Saturday morning, a backup-monitoring job at Meridian Regional Bank emailed the on-call engineer to report that the previous night's snapshot of the loan-origination file server had failed integrity verification. That is the kind of alert that gets snoozed. Backups fail for boring reasons all the time — a full disk, a flaky agent, a network hiccup. The engineer, half-awake, almost acknowledged it and went back to sleep. Then a second alert arrived, then a third, and by 6:51 the endpoint-detection tool was screaming about a process named svc_host32.exe spawning vssadmin.exe delete shadows /all on four servers at once. The boring backup failure was not boring. It was the first visible symptom of ransomware deleting the bank's ability to recover before it began encrypting files.

What happened over the next eleven hours decided whether Meridian lost a weekend or lost its license. And almost none of what mattered was invented during those eleven hours. The decisions that saved the bank — who has authority to disconnect a production server, which systems to isolate first, when to involve the lawyers and the regulators, whether to pay, how to tell 2.5 million customers the truth without causing a panic — had all been rehearsed, on paper, months earlier, in a conference room with coffee and a printed scenario. That rehearsal is the subject of this chapter's centerpiece, and the eleven-hour incident is the subject of its first case study.

This is the chapter where everything you have built so far gets used in anger. The detections from Chapter 22 fire. The privileged-access controls from Chapter 19 either contain the blast or fail to. The logs from Chapter 21 become the ground truth of what actually happened. Incident response is the discipline of preparing for, detecting, containing, eradicating, recovering from, and learning from security incidents — and the brutal premise of the whole discipline is in this chapter's first section: it is not if you will have a serious incident, it is when. Prevention buys you time and reduces frequency; it never reaches zero. The mature organization is not the one that never gets breached. It is the one that detects fast, contains decisively, recovers cleanly, and gets measurably better afterward.

In this chapter, you will learn to:

  • Run the six phases of the NIST SP 800-61 incident-response lifecycle and explain why it is a loop, not a line.
  • Write the artifacts you must have before an incident: an IR plan, a severity matrix, playbooks, runbooks, and a communications plan — and staff the roles, including the incident commander.
  • Triage an alert into a declared incident, then make containment, eradication, and recovery decisions under uncertainty, knowing the tradeoffs of each.
  • Facilitate a tabletop exercise and walk a ransomware scenario from first alert to recovery as Meridian's team will.
  • Run a blameless postmortem that converts a bad day into permanent improvement.

Learning Paths

Incident response is core to the SOC track, central to GRC's regulatory and reporting obligations, and heavily tested on both certification exams.

🛡️ SOC Analyst: This is your chapter. §24.3 (triage and analysis) is the work you do every shift; §24.4 (containment/eradication/recovery) is the work you escalate into. Run the tabletop in §24.5 with your team — it is the single highest-value exercise in the book for an analyst. 🏗️ Security Engineer: Focus on §24.2 (you build the tooling and access paths that make fast containment possible) and §24.4 (the eradication and recovery mechanics live in systems you designed). 📋 GRC: §24.2's communications plan and §24.6's lessons-learned process are yours, along with the breach-notification obligations threaded through §24.4 and §24.5 — for a bank, the regulatory clock is a first-class part of the response. 📜 Certification Prep: The NIST lifecycle phases, the difference between an event and an incident, severity classification, and chain of command appear on both Security+ and CISSP. Memorize the six phases and their order; key-takeaways.md maps them to exam domains.


24.1 It's not if, it's when

Every chapter before this one has been about lowering the odds. You hardened the operating systems, segmented the network, deployed phishing-resistant authentication, vaulted the privileged accounts, stood up the SIEM, and built detections. Each control made an incident less likely and less severe. None of them made it impossible, because — as the asymmetry from Chapter 1 dictates — the attacker needs to be right once and you need to be right every time, forever. Multiply enough days by enough employees by enough internet-facing surface, and the probability of some incident over a multi-year horizon approaches one.

🚪 Threshold Concept: A security program is not judged by whether it gets breached. It is judged by what happens next. Two banks can suffer the identical intrusion; one detects it in twenty minutes, isolates three machines, and is back to normal by Monday, while the other discovers it ten weeks later from an FBI phone call, having lost its entire customer database. Same attack, opposite outcomes — and the difference is entirely incident-response capability. Once you internalize that response is a security control in its own right, you stop treating IR as the embarrassing thing that happens when "real" security fails, and start treating it as the layer that makes every other layer's inevitable failure survivable. This is Theme 4 — defense in depth assumes each layer fails — applied to the program itself.

Let us fix vocabulary, because precision here prevents panic later. Your monitoring generates a constant stream of events: a login, a file write, a firewall deny, a process launch. Most events are routine. A subset are security events — observable occurrences with security relevance, like a failed login or an antivirus hit. A still-smaller subset rise to the level of a security incident: a violation or imminent threat of violation of security policy, acceptable-use policy, or standard security practice — in plainer terms, an event (or a correlated set of events) that has actually harmed, or credibly threatens to harm, the confidentiality, integrity, or availability of your assets. The funnel matters operationally: a SOC drowns if it treats every event as an incident, and it dies if it treats a real incident as just another event. The whole front end of incident response, triage, is the act of deciding where on that funnel a given alert sits.

🔗 Connection: This funnel is why Chapter 21's work on taming alert fatigue and Chapter 22's work on high-fidelity detection engineering matter so much here. Every false positive that reaches an analyst is friction in triage; every false negative is an incident you never get to respond to. Good detection is the raw material of good response. The SIEM correlation rules you wrote in Chapter 21 are, functionally, the trip-wires of your IR program.

There is a cost to admitting "it's when, not if," and it is a healthy cost: it forces you to do the unglamorous, un-fundable-feeling work of preparation before there is a fire to justify it. The organizations that respond well to incidents are, without exception, the ones that prepared when nothing was wrong. The ones that respond badly are the ones that intended to "figure it out when it happens." You cannot improvise a chain of command, a communications plan, a legal-engagement process, and a containment strategy at 3 a.m. while your servers encrypt. You can only execute the plan you already have — or discover, expensively, that you do not have one.

That is the case for the rest of this chapter. We will spend the most time on preparation (§24.2), because it is where the leverage is, and on the tabletop (§24.5), because it is how you prove your preparation works before an attacker tests it for you.

24.2 Preparation: the plan, the roles, and the comms

Preparation is the first and most important phase of the NIST lifecycle, and it is the phase that pays for all the others. We will treat it in three parts: the plan (the governing document and its supporting playbooks and runbooks), the roles (who does what, and who is in charge), and the communications plan (who tells whom, and when — including the parts that have legal deadlines).

The six-phase lifecycle

First, the frame everything hangs on. The NIST SP 800-61 lifecycle is the United States' canonical model for incident handling, and it organizes response into a continuous loop of phases. The exact phrasing has evolved across revisions, but the working model every practitioner carries is:

        ┌──────────────────────────────────────────────────────────┐
        │                                                          │
        ▼                                                          │
 ┌─────────────┐     ┌──────────────────────┐     ┌──────────────┐ │
 │ 1. PREPARE  │────▶│ 2. DETECT & ANALYZE  │────▶│ 3. CONTAIN,  │ │
 │             │     │   (triage, scope)    │     │  ERADICATE,  │ │
 │ plan,       │     │                      │     │  RECOVER     │ │
 │ playbooks,  │     │  is it real? how     │     │              │ │
 │ tools,      │◀────│  bad? how far?       │     │ stop spread, │ │
 │ training,   │     └──────────┬───────────┘     │ remove,      │ │
 │ comms,      │                │  ▲              │ restore      │ │
 │ access      │                │  │ (new info    └──────┬───────┘ │
 └─────────────┘                ▼  │  re-scopes)         │         │
        ▲                  (loop back as you             │         │
        │                   learn the truth)             ▼         │
        │                                       ┌──────────────────┴───┐
        └───────────────────────────────────────│ 4. POST-INCIDENT     │
              feeds improvements back into       │   ACTIVITY           │
              preparation for next time          │ (lessons learned,    │
                                                 │  blameless review)   │
                                                 └──────────────────────┘

  Figure 24.1 — The NIST SP 800-61 incident-response lifecycle. The arrows that
  matter most are the ones that loop: detection and containment iterate as you
  learn the true scope, and post-incident lessons feed directly back into
  preparation. IR is a cycle, not a checklist you run once top to bottom.

Notice three things about Figure 24.1. First, preparation feeds and is fed by everything — the lessons you learn at the end become the preparation for next time, which is why a mature program's IR capability ratchets upward incident by incident. Second, detection/analysis and containment iterate: you will contain based on what you know, then learn the attacker was also on three machines you missed, and loop back to re-scope and re-contain. Treating these as a strict one-way sequence is a classic novice error that gets attackers re-entry. Third, the model is deliberately phase-based, not time-based — you can be containing one part of an incident while still analyzing another. We will walk each phase concretely in §24.3 and §24.4; here in preparation, the job is to build the machinery that makes those phases executable.

The incident-response plan

The incident-response plan (IR plan) is the single governing document that defines how your organization handles incidents. It is approved by leadership, reviewed at least annually, and — critically — short enough that someone can actually use it during a crisis. A 200-page plan no one has read is a compliance artifact, not a response capability. A good IR plan answers, crisply:

  • What is an incident, and who can declare one? Definitions and the severity matrix (below), plus the explicit statement that any employee can report and a named role can declare.
  • Who is on the incident-response team, and what are their roles? Including the chain of command (below).
  • How do we classify severity, and what does each level trigger? The matrix drives notification, escalation, and resourcing.
  • What are our notification and escalation paths? Internal (management, legal, executives) and external (regulators, law enforcement, customers, cyber-insurer) — with the legally mandated timelines called out.
  • Where are the playbooks, runbooks, contact lists, and tools? And where is the offline copy, because if your ransomware encrypts the file share that holds the IR plan, you have a problem that is darkly funny only in hindsight.

⚠️ Common Pitfall: Storing the IR plan, contact lists, and network diagrams only on the systems the IR plan exists to protect. During a ransomware or domain-compromise incident, those systems may be encrypted, untrusted, or deliberately taken offline. Keep an out-of-band copy — printed in the incident commander's binder, in a separate cloud tenant, on a USB drive in the on-call bag. The same logic applies to communications: if the attacker is in your email and Teams, you cannot coordinate the response in your email and Teams. Have an out-of-band comms channel (a separate Signal group, a conference bridge, a phone tree) decided in advance.

Playbooks and runbooks

The IR plan is the constitution; playbooks and runbooks are the operating procedures. The distinction is worth getting right because the words are often used loosely:

A playbook is a scenario-specific response procedure — the steps, decisions, and roles for handling a particular type of incident: a ransomware playbook, a phishing playbook, a business-email-compromise playbook, a data-exfiltration playbook, a lost-laptop playbook. A playbook is written at the level of decisions and coordination: "Declare severity. Convene the IR team. Isolate affected hosts. Engage legal. Decide on containment strategy." It is what the incident commander and team leads follow.

A runbook is the step-by-step technical procedure for a specific task within a response — concrete enough that a tier-1 analyst can execute it correctly at 3 a.m. under stress: "How to isolate a Windows host in our EDR console," "How to disable an Active Directory account and revoke its sessions," "How to pull and preserve memory from a suspect server," "How to block a domain at the DNS resolver and the proxy." A runbook is to a playbook what a precise recipe is to a menu. The playbook says isolate the host; the runbook says exactly which buttons to click, in which tool, with which approvals, and how to verify it worked.

🛡️ Defender's Lens: Attackers move fast — the time from initial access to ransomware deployment has compressed, in many reported cases, from days to hours. Your advantage is that you do not have to think during those hours if you have already thought. A runbook that turns "isolate the host" from a five-minute hunt-for-the-right-screen into a thirty-second known procedure is, in a fast-moving incident, the difference between containing three machines and containing three hundred. Speed of correct action is a defensive weapon, and pre-written runbooks are how you buy it. This is the operational reason Chapter 19's just-in-time privileged access and Chapter 22's saved hunting queries pay off here: the access and the queries are already staged.

You do not need a playbook for every conceivable incident on day one. Build them in priority order, by what is both likely and damaging (the risk thinking from Chapter 1): for most organizations that means ransomware, phishing/BEC, account compromise, and data exfiltration first. Meridian's Project Checkpoint in this chapter builds exactly that starter set.

Severity classification

You cannot respond proportionally if every incident is treated as either "ignore it" or "five-alarm fire." Severity classification is the rubric that maps an incident to a severity level, and the severity level drives everything downstream: how fast you respond, how far up the chain you escalate, how many people you pull in, and which external parties you notify. A workable matrix is small — three to five levels — and defined on axes the organization actually cares about. For Meridian:

Severity Definition (any one trigger) Response Notify
SEV-1 (Critical) Confirmed compromise of customer data, core banking, or a domain controller; active ransomware; any event with imminent regulatory-notification or major-outage stakes Full IR team activated immediately, 24/7; incident commander assigned; war room opened CISO + CIO + CEO; Legal; cyber-insurer; prepare regulator/law-enforcement engagement
SEV-2 (High) Compromise of a single privileged account or sensitive server; confirmed malware with lateral-movement potential; targeted attack contained but live IR lead + relevant engineers engaged within 30 min; commander assigned if escalation likely CISO; SOC manager; Legal on standby
SEV-3 (Medium) Single non-privileged account compromise; commodity malware on one endpoint, contained; policy violation with security impact Handled by SOC during business hours; documented SOC manager
SEV-4 (Low) Reconnaissance, blocked attack, isolated low-risk event requiring tracking Logged and trended; no activation (queue)

🔄 Check Your Understanding: 1. In your own words, what is the difference between a security event and a security incident, and why does that distinction matter operationally for a SOC? 2. A playbook says "isolate the affected host." Why is that instruction insufficient on its own at 3 a.m., and what artifact fills the gap? 3. Why must an IR plan, contact list, and key runbooks have an out-of-band copy?

Answers

  1. An event is any observable occurrence; an incident is an event (or correlated set) that violates or imminently threatens security policy and harms — or credibly threatens — confidentiality, integrity, or availability. Operationally it matters because a SOC must triage the flood of events down to the few real incidents; treating all events as incidents causes burnout, while treating an incident as a mere event misses a breach. 2. "Isolate the host" is a decision-level instruction; under stress an analyst needs the exact tool, screen, approvals, and verification steps — that is the runbook, the step-by-step technical procedure the playbook references. 3. Because the systems the plan protects may be encrypted, untrusted, or taken offline during the very incident when you need the plan; an out-of-band copy (printed, separate tenant, USB) guarantees access. The same applies to communications channels.

The roles and the chain of command

When an incident is declared, ad-hoc heroics fail. What works is a defined team with a clear chain of command centered on an incident commander (IC) — the single person with the authority to make and own decisions during the incident, coordinate the response, and serve as the point of contact, regardless of their day-to-day rank. The IC is borrowed, deliberately, from emergency-services incident command: in a crisis, someone must be able to say "we are disconnecting that server now" and have it happen, without a committee. The IC does not have to be the most technical person in the room — in fact they usually should not be, because their job is to coordinate and decide, not to type at a keyboard. Their job is to keep the response organized, the decisions made, the communications flowing, and the responders un-distracted.

A workable IR team structure, mapped to Meridian's people:

Role Responsibility Meridian
Incident Commander Owns the incident; makes/approves decisions; coordinates; single point of contact; runs the cadence Priya Nair (IR lead) for most; CISO Dana Okafor for SEV-1
Technical lead(s) Directs the hands-on investigation, containment, eradication Sam Whitfield (infra/cloud); SOC analysts
Scribe Maintains the incident timeline and decision log in real time A rotating SOC analyst
Communications lead Manages internal and external messaging per the comms plan Dana / Comms / Elena (GRC) for regulatory
Legal / compliance Advises on notification obligations, privilege, evidence, law enforcement Outside counsel + Elena Vasquez
Subject-matter experts Pulled in as needed (app owners, AD admins, cloud, vendor) As required

📟 War Story: A constructed but representative example. A mid-size firm suffered a serious intrusion and assembled twelve smart people in a room — with no one designated as in charge. For the first ninety minutes, three different engineers independently took containment actions: one pulled a server's network cable, another disabled the account the attacker was using, and a third started re-imaging a box — destroying, in the process, the memory evidence the first engineer had been about to capture and tipping off the attacker, who promptly switched to a backup foothold. None of them was wrong individually; collectively they were chaos. The lesson the firm wrote into its plan afterward was one line: declare an incident commander before taking any containment action. Coordination is not bureaucracy. In an incident, it is the difference between a response and a stampede.

The communications plan

The technical response is only half the incident; the other half is communications, and for a regulated organization it is the half with legal deadlines and the half most likely to turn a contained incident into a public catastrophe. A communications plan, decided in advance, answers: who needs to know, who decides what to say, what we say, and when the clock starts.

The audiences, roughly in order of immediacy:

  • Internal — the response team and leadership. Via the out-of-band channel, on a fixed cadence (e.g., the IC gives a status update every 30–60 minutes for a SEV-1, even if the update is "no change"). Predictable rhythm prevents the leadership-anxiety spiral where five executives each DM the responders for an update and grind the response to a halt.
  • Internal — the broader workforce. What do employees do? ("Do not turn off your computers. Do not discuss this externally. Report anything unusual to this number.") Said clearly, early, to prevent rumor and well-meaning interference.
  • Legal and the cyber-insurer. Often the first external calls. Legal may direct the investigation under privilege; many cyber-insurance policies require notification within a tight window and provide a pre-vetted IR firm and breach coach. Failing to notify the insurer promptly can void coverage — a brutal way to learn the comms plan mattered.
  • Regulators and law enforcement. For a U.S. bank, this is not optional and the timelines are statutory. The interagency banking guidance requires notifying the primary federal regulator as soon as possible and no later than 36 hours after determining a qualifying "computer-security incident" has occurred. State breach-notification laws impose their own deadlines for notifying affected consumers (commonly worded as "without unreasonable delay" with hard outer limits that vary by state). The FBI/CISA may be engaged for serious incidents, especially ransomware. The point for the responder: the regulatory clock can start before you fully understand the incident, so legal and GRC must be at the table early, and the determination of whether a reportable incident occurred is itself a decision the IC must drive on a deadline.
  • Affected customers and the public. What, when, and how — coordinated with legal and communications. Honesty and timeliness here protect trust; obfuscation and delay destroy it, often more than the breach itself.

⚖️ Authorization & Ethics: Notification is not only a legal obligation; it is an ethical one. The people whose data you hold have a right to know when it is at risk so they can protect themselves — freeze credit, change passwords, watch their accounts. There is a powerful institutional temptation to minimize, delay, or spin, especially before the facts are clear. Resist it. The organizations that emerge from breaches with their reputations intact are, overwhelmingly, the ones that told the truth promptly and treated affected people as stakeholders rather than liabilities. "What do we legally have to disclose?" is the floor (Theme 5); "what do the people we serve deserve to know?" is the standard.

🔄 Check Your Understanding: 1. Who is the incident commander, and why is it often a mistake for the most technical responder to also be the IC? 2. For a U.S. bank, name two distinct external notification obligations and why the "clock" is a problem for responders. 3. Why is an out-of-band communications channel part of preparation, not something to set up during the incident?

Answers

  1. The IC is the single person with authority to make and own decisions, coordinate the response, and act as the point of contact during an incident. The most technical person should usually not be IC because the IC's job is coordination and decision-making, not hands-on investigation — doing both badly serves neither, and a head-down technical responder cannot keep the whole response organized. 2. (a) The federal banking regulator must be notified within 36 hours of determining a qualifying computer-security incident; (b) state breach-notification laws require notifying affected consumers within statutory timelines. The clock is a problem because it can start before the incident is fully understood, forcing notification decisions under uncertainty. 3. Because the incident may compromise or require taking offline the normal channels (email, Teams) and the systems hosting the plan; if the attacker is in your email, you cannot coordinate the response in your email. The alternative channel must be agreed and tested beforehand.

24.3 Detection and analysis: triage

Preparation is the standing capability. Detection and analysis is where an actual incident begins — the phase in which a signal arrives and you decide, under uncertainty, whether it is nothing, something, or a catastrophe, and how far it has spread. The craft of this phase is triage.

Where incidents are detected

Incidents announce themselves through many channels, and a mature program watches all of them:

  • Automated detections — the SIEM correlation rules (Chapter 21), the EDR alerts, the IDS/IPS, the detection-engineering content and threat hunts from Chapter 22. This is the channel you most control and most invest in.
  • Human reports — an employee reports a phishing email or a weird pop-up; a user notices files renamed with a strange extension. Theme 3: the human is also a sensor. Meridian's whole IR posture began (Chapter 1) with a reported email.
  • Third-party notification — a partner, a customer, a researcher, a threat-intel feed, or, worst of all, law enforcement telling you that your data is for sale or your IP is attacking others. A distressing share of real breaches are discovered this way, which is itself a measure of detection gaps.

Triage: from alert to declared incident

Triage is the rapid initial assessment that answers four questions, fast, with incomplete information:

  1. Is it real? True positive or false positive? An alert is a hypothesis, not a verdict.
  2. How bad is it? What severity, by the matrix? What is at stake — which assets, which leg of the CIA triad?
  3. How far has it spread (scope)? One host or fifty? One account or the domain? This is the question novices under-ask and attackers exploit.
  4. What do we do right now? Escalate? Declare? Begin containment? Keep watching to learn more?

Triage is a decision under uncertainty, and the discipline is to make a proportional, reversible-where-possible first move quickly rather than a perfect move slowly. Here is a triage decision tree an analyst can carry, rendered as the kind of flow a runbook would formalize:

                         ALERT ARRIVES
                              │
                              ▼
                   ┌──────────────────────┐
                   │ Validate: real?      │── false positive ──▶ close,
                   │ (corroborate across  │                     tune the rule
                   │  log sources)        │                     (feed Ch.21)
                   └──────────┬───────────┘
                              │ likely true
                              ▼
                   ┌──────────────────────┐
                   │ Assess severity &    │
                   │ scope: which assets? │
                   │ priv account? data?  │
                   │ how many hosts?      │
                   └──────────┬───────────┘
                              │
              ┌───────────────┼─────────────────┐
              ▼               ▼                 ▼
        SEV-3/4 (low)    SEV-2 (high)     SEV-1 (critical:
        handle in SOC,   engage IR lead,  data/core/DC/
        document,        consider          ransomware)
        monitor          declaring         │
                              │             ▼
                              │     DECLARE INCIDENT:
                              └────▶ assign IC, open war room,
                                    start scribe/timeline,
                                    invoke the matching PLAYBOOK,
                                    notify per comms plan
                                            │
                                            ▼
                                  CONTAIN ◀──▶ keep ANALYZING
                                  (loop: scope grows as you learn)

  Figure 24.2 — A triage decision tree. The two judgment calls that most
  separate strong from weak responders: (1) corroborating across log sources
  before believing or dismissing an alert, and (2) asking "how far?" early,
  because under-scoping is what lets a contained-looking incident reignite.

🔗 Connection: Step one — corroborate across log sources — is exactly why Chapter 21's centralized SIEM and normalization matter. A single EDR alert is a hypothesis; the same activity confirmed by a firewall connection log, a Windows authentication event, and a proxy request is an incident. And step two's "which technique is this?" leans on Chapter 22: mapping the alert to a MITRE ATT&CK technique immediately tells you what the attacker likely did before this alert and what they will likely do next, which is how you scope intelligently instead of waiting to be surprised.

Scoping: the question that decides the incident

Of triage's four questions, scope — how far the attacker has reached — is the one that most often determines whether an incident is resolved or merely paused. New responders find the alerting host, clean it, and declare victory; the attacker, who established three footholds and an additional persistence mechanism, is back by morning. Disciplined scoping asks: Given this indicator, where else could the same attacker, technique, or tool be? Concretely:

  • Indicator pivoting. Take every known indicator of compromise — a file hash, a malicious domain, a source IP, an attacker-created account, a scheduled task — and search the entire environment for it (the ioc_match capability from Chapter 22). One malicious hash on one host becomes "that hash is on six hosts" in one query.
  • Identity pivoting. If an account is compromised, what did it touch? Where did it authenticate? What did it have access to (the access reviews from Chapter 18, the privileged inventory from Chapter 19)? A compromised privileged account is presumed to have reached everything it could reach until proven otherwise.
  • Timeline building. The scribe and analysts assemble a chronology from the logs (the ground truth) — first observed activity, lateral moves, persistence, objective. This both scopes the incident and seeds the Chapter 25 forensic investigation.

⚠️ Common Pitfall: Under-scoping (cleaning the one host you found and stopping) and tipping off the attacker (taking loud containment actions before you understand the scope, causing a sophisticated adversary to burn their known footholds and activate hidden ones). These two pull in opposite directions — you must scope before you contain, but scoping takes time the attacker is using to spread. There is no formula; this is exactly the judgment a tabletop builds. The general guidance: for fast, destructive incidents (ransomware mid-encryption) contain immediately and scope in parallel; for slow, stealthy intrusions (an APT that has been resident for weeks), invest in quiet, thorough scoping before a coordinated containment, so you evict the attacker everywhere at once rather than playing whack-a-mole.

🔄 Check Your Understanding: 1. What four questions does triage answer, and which one do novices most often under-ask? 2. Why is corroborating an alert across multiple log sources a better first move than acting on a single alert? 3. When should you contain immediately versus scope-before-containing, and why?

Answers

  1. (1) Is it real? (2) How bad (severity, stakes)? (3) How far has it spread (scope)? (4) What do we do now? Novices most under-ask scope ("how far?"), which is what lets a contained-looking incident reignite from a foothold that was never found. 2. A single alert is a hypothesis that may be a false positive or may be one symptom of a larger event; corroboration across sources (SIEM, firewall, auth, proxy) both confirms the alert is real and begins to reveal its true scope, preventing both wasted effort on false positives and under-scoping of real incidents. 3. Contain immediately for fast, destructive incidents (active ransomware encryption) where every minute of spread is irreversible damage; scope thoroughly before containing for slow, stealthy intrusions, so you can evict the adversary everywhere simultaneously rather than tipping them off and triggering hidden footholds.

24.4 Containment, eradication, and recovery

Once an incident is declared and being scoped, the response moves into the phase that visibly fights the attacker: stop the spread (containment), remove the attacker and their tools (eradication), and restore normal operations safely (recovery). These overlap and iterate with analysis — you contain what you have scoped, keep scoping, and contain more.

Containment

Containment is action taken to limit the scope and magnitude of an incident — to stop the bleeding before you treat the wound. It is almost always the most urgent decision in a response, and it is genuinely hard because it trades competing goods. Containment splits into two:

Short-term containment stops immediate damage and spread, fast, accepting that it may be crude: isolate a host from the network (but leave it powered on to preserve memory evidence — see Chapter 25), disable a compromised account and revoke its active sessions and tokens, block a malicious IP or domain at the firewall/proxy/DNS resolver, pull a network segment, suspend a runaway service. Short-term containment is reversible and buys time.

Long-term containment is the more durable holding pattern that lets the business limp along while you prepare eradication and recovery: rebuilding a clean system to take over from an infected one, applying temporary firewall rules and access restrictions, deploying additional monitoring, resetting credentials broadly. It is what keeps the lights on between "we stopped the spread" and "we have fully evicted the attacker and restored."

The containment decision is a tradeoff among at least four competing concerns, and naming them is how you decide rationally under pressure rather than by instinct:

Concern Pulls you toward
Stopping the damage Aggressive, immediate isolation — disconnect everything now
Preserving evidence Not powering off (lose memory), not wiping (lose forensics) — isolate, don't destroy
Maintaining business operations Surgical containment — keep critical services running if at all possible
Avoiding tipping off the attacker Quiet, coordinated containment — don't let a sophisticated adversary see you coming

A worked containment decision makes this concrete. Suppose triage at Meridian has confirmed ransomware actively encrypting files on three servers, spreading via a compromised domain-admin account, with the backup-deletion command already observed. Walk the four concerns: stopping the damage dominates utterly, because every minute is irreversible loss and the backups (your recovery path) are under active attack; evidence preservation matters but not at the cost of letting encryption continue (isolate the hosts, keep them powered, but do not delay); business operations must yield, because encrypted core systems are already not operating; tipping off the attacker is moot, because they are mid-objective and loud. Decision: immediate, aggressive containment — isolate the three servers at the network layer, disable the compromised domain-admin account and force-revoke its sessions/Kerberos tickets domain-wide, block the command-and-control domains, and consider an emergency segmentation cut (or, in extremis, a controlled disconnect of broader segments) to halt lateral spread. That is the opposite decision from a stealthy, weeks-resident APT, where you would scope quietly for days and contain everywhere at once. The incident type determines the containment posture — which is exactly the logic this chapter's bluekit module encodes.

🛡️ Defender's Lens: Notice how much faster the right containment is when the prerequisites were built earlier. "Disable the domain-admin account and revoke its sessions everywhere" is a thirty-second runbook if you have the privileged-account inventory from Chapter 19 and the identity tooling from Chapter 18 — and a frantic hour of "wait, how many domain-admin accounts do we even have, and how do we revoke Kerberos tickets?" if you do not. "Isolate the host" is one click if your EDR supports network isolation and your analysts have practiced the runbook — and a scramble to find the right switch port otherwise. Containment speed is mostly preparation speed.

Eradication

Eradication is the removal of the attacker, their access, and the artifacts of the compromise from the environment — so that recovery does not simply restore the breach. Eradication is where under-scoping comes home to roost: you can only eradicate what you found, so eradication is only as good as the scoping in §24.3. Typical eradication actions:

  • Remove malware and attacker tooling — but for anything seriously compromised, rebuild from known-good rather than "clean," because you can rarely prove you found every implant. The mantra for a deeply compromised host is wipe and reimage, do not disinfect.
  • Eliminate persistence mechanisms — the scheduled tasks, services, run keys, web shells, rogue accounts, and added SSH keys the attacker planted to survive a reboot. Miss one and the attacker returns.
  • Reset and rotate credentials — every credential the attacker could have accessed, presumed compromised. For a domain-controller compromise this can mean resetting the entire domain, including the krbtgt account (twice), because the attacker may hold forgeable Kerberos golden tickets. This is a heavy, disruptive action — and exactly the kind a bank rehearses, not improvises.
  • Close the initial access vector — patch the exploited vulnerability, fix the misconfiguration, remove the phished account's standing access. If you eradicate the attacker but leave the open door, you are merely waiting for re-entry.

Recovery

Recovery is the careful, verified restoration of systems to normal operation and the confirmation that the threat is truly gone. The temptation, after a hard incident, is to rush back online; the discipline is to restore deliberately. Recovery's components:

  • Restore from known-good backups — which is why the integrity and offline/immutable nature of backups (the thing the ransomware in our scenario tried to delete first) is a recovery prerequisite, and why "untested backups" was a CRITICAL risk all the way back in Chapter 1's case study. A backup you cannot restore from is not a backup.
  • Rebuild rather than restore where compromise is suspected, and validate the integrity of anything you bring back.
  • Restore in a prioritized, staged order — critical business functions first, per a business-impact analysis, not whatever is easiest.
  • Monitor intensively after recovery. The attacker may try to return, and a recovered-but-still-compromised environment is the worst of both worlds. Heightened monitoring on the affected systems and identities for a defined window is standard; many programs keep the incident formally open until this watch period passes clean.
  • Verify before declaring closure. Confirm the initial vector is closed, persistence is gone, credentials are rotated, and systems are clean and stable. Only then does the IC declare the incident contained, then closed.

🧩 Try It in the Lab: In your own VM sandbox, practice the recovery half safely without any malware. Take a snapshot of a clean VM, make some benign "changes" (create files, add a scheduled task, a local user), then practice (a) detecting those changes against your known-good baseline using the harden.py audit from Chapter 11, and (b) restoring the VM from the clean snapshot and verifying it matches the baseline. You will feel viscerally why a known-good baseline and a tested restore are the load-bearing parts of recovery — and why an organization that has never tested a restore is gambling its recovery on an untested assumption.

🔄 Check Your Understanding: 1. Distinguish short-term from long-term containment, and give one example of each. 2. Why is "wipe and reimage" preferred over "disinfect" for a seriously compromised host during eradication? 3. Why must backups be offline/immutable and tested for recovery to work — and which earlier chapter flagged untested backups as a top risk?

Answers

  1. Short-term containment stops immediate damage/spread quickly and reversibly (e.g., isolate a host from the network, disable a compromised account); long-term containment is a durable holding pattern that keeps the business running while you prepare eradication/recovery (e.g., stand up a clean replacement system, apply temporary firewall rules, broad credential reset). 2. Because you can rarely prove you found and removed every implant or persistence mechanism on a deeply compromised host; rebuilding from known-good media guarantees a clean state, whereas "cleaning" risks leaving a hidden foothold from which the attacker returns. 3. Ransomware (and destructive attackers) deliberately target backups to remove the recovery path — observed in our scenario as the shadow-copy deletion — so backups must be offline/immutable to survive the attack and tested so you know the restore actually works. Chapter 1's case study flagged untested backups as a CRITICAL risk.

24.5 The ransomware tabletop at Meridian

You cannot wait for a real ransomware attack to discover whether your IR plan works. You rehearse it. A tabletop exercise is a discussion-based simulation in which the response team walks through a realistic incident scenario step by step — making the decisions, invoking the playbooks, and stress-testing the plan, the roles, and the communications — without touching any production system. It is the cheapest, highest-leverage thing in this chapter: a conference room, a facilitator, an injected scenario, and the people who would actually respond. Meridian runs one quarterly. We will sit in on the ransomware tabletop, because ransomware against a bank is the scenario where every part of this chapter is exercised at once. (This walkthrough is constructed for teaching — Tier 3 — but built on the well-documented pattern of real critical-infrastructure ransomware such as the Colonial Pipeline incident, generalized.)

Setup. Priya Nair facilitates; she has a scripted scenario with timed injects (new pieces of information released as the exercise unfolds) that she does not show the team in advance. Present: Dana (CISO), Sam (engineering), two SOC analysts, Elena (GRC), and — joining by phone, because Meridian learned to include them — outside counsel and a representative from the cyber-insurer's breach team. The ground rule: respond exactly as you would in reality, using the actual plan, playbooks, and runbooks. "I would look it up" is not an answer; "I would open runbook IR-07 and isolate the host in the EDR console, which takes about thirty seconds and needs the on-call analyst's approval" is.

Inject 1 (T+0, Saturday 06:51). EDR alerts on four servers: a process svc_host32.exe is executing vssadmin.exe delete shadows /all. Backup-integrity jobs failed overnight. A SOC analyst is on call. — The team triages. Is it real? Four corroborating EDR alerts plus failed backups: yes. How bad? Shadow-copy deletion is a textbook ransomware precursor (an ATT&CK "Inhibit System Recovery" technique, per Chapter 22), on servers, attacking recovery — this is SEV-1. The on-call analyst declares an incident and pages Priya, who becomes incident commander and opens the war room and the out-of-band Signal channel. A scribe is assigned and starts the timeline. First real test passed: someone declared, fast, and a commander took the chair before anyone touched a server.

Inject 2 (T+8 min). Scoping shows the four servers were reached from a single account — svc_backup, a service account with domain-admin rights (a finding the team winces at — it should never have had that). The compromised account is a domain admin, so by the scoping discipline of §24.3 it is presumed to have reached everything it could reach. Scope is potentially the whole domain. Sam pulls the privileged inventory (Chapter 19) and confirms the blast radius. — Now the containment decision (the worked example from §24.4): active encryption is imminent, recovery is under attack, the account is loud and mid-objective. Decision: immediate aggressive containment. Runbook IR-07: isolate the four servers in EDR (keep powered, for evidence). Runbook IR-03: disable svc_backup, force-revoke its sessions and Kerberos tickets. Block the C2 domains at the resolver and proxy. Priya authorizes all three in parallel — because she has the authority to, and that authority was assigned in preparation.

Inject 3 (T+25 min). Encryption is confirmed on two of the four servers; files now carry a .MERIDIANLOCK extension. A ransom note appears: 75 Bitcoin, with a threat to leak exfiltrated customer data ("double extortion") if unpaid in 72 hours. This is the inject designed to test composure and the comms and legal machinery. Several things must happen at once, which is exactly why roles exist: (1) Technical — continue scoping (did data actually leave? check egress/proxy logs and DLP), confirm containment held, identify the initial vector. (2) Legal/insurer — counsel now directs the investigation under privilege; the insurer's breach coach is engaged; both were reachable because the comms plan listed them. (3) Regulatory clock — Elena flags that the 36-hour federal banking notification determination clock is now ticking and state breach-notification obligations may apply if customer data was exfiltrated; the IC must drive a determination on a deadline. (4) The pay-or-not question — surfaced now, decided by no one in the room alone.

⚠️ Common Pitfall: Treating "do we pay the ransom?" as a technical or even a purely financial decision to be made in the heat of the moment. It is a strategic, legal, ethical, and sometimes sanctions decision — paying certain sanctioned groups can itself be unlawful, and payment never guarantees a working decryptor or that exfiltrated data is actually deleted. The mature posture, set in preparation, is a written ransom policy and a default of "recover from backups, do not pay" that is only revisited by named executives with legal counsel when recovery is genuinely impossible. The tabletop's job is to make the team confront this before a real 72-hour clock is melting their judgment. Meridian's policy, validated here: do not pay if recoverable; if recovery is impossible, the decision rises to the CEO with counsel and the insurer, never to the responders.

Inject 4 (T+90 min). Containment is holding — no new encryption since T+30. Scoping finds the initial access: svc_backup's password was weak and reused, exposed in an earlier unrelated breach, and the account was internet-reachable through a misconfigured management interface. No evidence (yet) that bulk customer data was exfiltrated, though the attacker's note claims it. The team moves toward eradication and recovery planning while analysis continues. Eradication: rebuild the four servers from known-good media (not disinfect); rotate all privileged credentials and, given domain-admin compromise, plan the krbtgt double-reset; remove the management-interface exposure and any persistence found; fix the root cause — the over-privileged, weakly-authenticated service account. Recovery: restore the loan-origination data from the offline, immutable backups (which survived because they were offline — the very thing the attacker tried to delete only reached the online shadow copies); restore critical services first; heightened monitoring for two weeks. Elena drafts the regulator notification; counsel and comms draft holding statements for staff and, if exfiltration is confirmed, customers.

Inject 5 (T+6 hours, debrief in the exercise). Priya calls the scenario and runs the mini-retro: What worked? What broke? What do we change? The exercise surfaced four concrete gaps, which is the entire point of running it — a tabletop that finds no problems was facilitated too gently. Meridian's findings:

  1. svc_backup should never have been domain admin (a PAM finding → Chapter 19 remediation, tightened JIT and tiering).
  2. The out-of-band comms channel existed but half the team had not installed it — fixed by adding it to onboarding.
  3. The krbtgt reset runbook did not exist; nobody was sure of the exact procedure — a runbook was written the following week.
  4. The determination process for the 36-hour clock was unclear — Elena and counsel wrote a one-page decision aid.

📟 War Story: Why a bank rehearses ransomware specifically. The Colonial Pipeline ransomware incident in 2021 is the generalized model behind this scenario: an attacker gained initial access through a single exposed, weakly-authenticated account (a legacy VPN profile with a reused, no-MFA password), deployed ransomware that threatened core operations, and forced an agonizing set of decisions — including a controlled shutdown of the operational pipeline as a containment measure with enormous real-world consequences, a ransom payment made under duress (much of it later clawed back by the FBI), and a national-scale communications and regulatory crisis. The transferable lessons map exactly onto Meridian's tabletop: a single weak credential on an exposed interface was the initial vector; the hardest decisions were containment tradeoffs (what do we shut down, and what does that cost?) and the ransom question; and preparation — or its absence — shaped everything that followed. A bank that has walked this scenario on paper, quarterly, will not be making it up the morning it becomes real.

🔄 Check Your Understanding: 1. What is a tabletop exercise, and what does it mean to say "a tabletop that finds no problems was facilitated too gently"? 2. In the scenario, what made "immediate aggressive containment" the right call at T+8, and how did preparation make that containment fast? 3. Why is "do we pay the ransom?" not a decision for the responders in the war room to make on the spot?

Answers

  1. A tabletop is a discussion-based simulation in which the response team walks a realistic scenario step by step, exercising the plan, roles, playbooks, and comms without touching production. The point of running one is to find gaps in a safe setting; if it surfaces no problems, the facilitator did not inject enough realistic friction (timed injects, surprises, hard decisions). 2. Active/imminent encryption with the recovery path (backups) under attack and a loud, mid-objective adversary means every minute is irreversible loss and tipping off the attacker is moot — so stopping the damage dominates. Preparation made it fast because the privileged inventory (Ch.19), EDR isolation runbook, and account-revocation runbook already existed, and the IC already had the authority to order them in parallel. 3. Because it is a strategic, legal, ethical, and potentially sanctions-related decision (paying sanctioned groups may be unlawful; payment guarantees nothing) that must be governed by a pre-written ransom policy and escalated to named executives with counsel — not improvised under a melting clock by people whose job is the technical response.

24.6 Lessons learned without blame

The incident is contained, eradicated, recovered. The exhausted team wants to never speak of it again. That impulse, indulged, guarantees the same incident recurs — which is why the final phase of the lifecycle, post-incident activity, is non-negotiable, and why its central ritual is the blameless postmortem.

A blameless postmortem (or blameless post-incident review) is a structured retrospective that analyzes what happened and how the response went with the explicit ground rule that the goal is to improve systems and processes, not to assign individual blame. The premise, borrowed from aviation safety and site-reliability engineering, is empirical, not sentimental: people do not come to work to cause incidents. When something goes wrong, the productive question is almost never "who screwed up?" — it is "what about our systems, processes, defaults, and information made this failure possible, and even reasonable, for a competent person acting in good faith?" Blame produces silence: if responders fear punishment, they hide mistakes, near-misses go unreported, and the organization goes blind precisely where it most needs to see. Blamelessness produces truth, and truth produces durable fixes.

🚪 Threshold Concept: Blamelessness is not "no accountability" or "no consequences ever." It is the recognition that for learning, the individual is almost never the root cause, and treating them as one destroys your ability to find the real one. If a junior analyst clicked a phishing link, the blameless question is not "why were you so careless?" but "why was a single click able to do damage — where were the technical controls, the training, the email defenses, the least-privilege limits that should have made that click survivable?" The click is the trigger; the systemic gaps are the cause. Organizations that learn this stop punishing the people who report problems and start fixing the systems that produce them — and they get more reporting, more near-misses surfaced, and fewer repeat incidents. It is the single biggest cultural lever in security operations, and it is the seed of the SOC culture you will build in Chapter 37.

A good post-incident review, held within a week or two while memories are fresh, produces a written report and a short list of concrete, owned, deadlined action items. A workable agenda:

  1. Timeline. What happened, when, established from the logs and the scribe's record. Facts before opinions.
  2. What went well. Genuinely — name it, so you keep doing it. (At Meridian: the analyst declared fast; containment held.)
  3. What went poorly or got lucky. The gaps, the delays, the things that worked only because of luck. (The over-privileged service account; the missing runbook.)
  4. Root-cause analysis. Why did it happen and why did detection/response take as long as it did? Push past the trigger to the systemic causes — the classic "five whys" ends at a process or design gap, never at a person.
  5. Action items. Specific, assigned to a named owner, with a due date, and tracked to completion. "Improve security" is not an action item; "rewrite the svc_backup account to remove domain-admin rights and enforce JIT, owner: Sam, due: 30 days" is. These feed directly back into preparation — closing the loop in Figure 24.1.

⚠️ Common Pitfall: The postmortem that produces a beautiful document and zero change. Action items that are vague, unowned, undeadlined, or untracked evaporate, and the next incident re-teaches the same lesson at full price. The discipline is ownership and follow-through: each item has a name and a date, the list is reviewed until every item is closed, and the metrics improve (the mean-time-to-detect and mean-time-to-respond you will formalize in Chapter 36). A lesson is only "learned" when the system has changed so the failure cannot recur the same way.

There is also a metrics dimension, which we will develop fully in Chapter 36 but seed here: the post-incident review is where you capture mean time to detect (MTTD) and mean time to respond/recover (MTTR) for this incident, and track them across incidents over time. A program getting better detects faster and recovers faster, and the trend line of those two numbers is the most honest measure of whether your IR capability is actually maturing or just accumulating documents.

Project Checkpoint

Meridian's program reaches the artifact that makes every prior control survivable: an incident-response capability. The bluekit increment encodes the two decisions this chapter is built on — triage and containment posture.

Program increment — IR plan, playbooks, and the tabletop. Following the near-miss that started everything (Chapter 1) and now armed with detection (Chapter 22) and privileged-access controls (Chapter 19), Dana's team produces Meridian's first real IR program: a concise IR plan (definitions, the SEV-1–SEV-4 matrix, the chain of command with the incident-commander role, notification paths including the 36-hour banking clock), a starter set of four playbooks (ransomware, phishing/BEC, account compromise, data exfiltration) in risk priority, the supporting runbooks the tabletop proved were missing (host isolation, account disable/session-revoke, krbtgt reset), and an out-of-band comms plan. They validate it all by running the §24.5 ransomware tabletop and feeding its findings back into the plan. This artifact slots into the Chapter 38 capstone as the bank's response capability and underpins the SOC operating model of Chapter 37.

bluekit increment — ir.py. Two functions that turn this chapter's judgment into reusable code. triage(alert) maps an alert's signals to a severity and a recommended action (the §24.3 decision tree); containment(incident_type) returns the containment posture for an incident type (the §24.4 tradeoff logic). As always, the code is illustrative and never executed during authoring — the expected output is hand-traced.

# bluekit/ir.py  — Chapter 24 increment
"""Incident-response triage and containment helpers.

triage(alert):        signals -> (severity, action)   [the SS24.3 decision tree]
containment(type):    incident type -> posture        [the SS24.4 tradeoff logic]
"""

def triage(alert: dict) -> tuple[str, str]:
    """Map alert signals to (severity, recommended action). Highest trigger wins."""
    data_or_core = alert.get("affects") in {"customer_data", "core_banking", "domain_controller"}
    if alert.get("ransomware") or data_or_core:
        return ("SEV-1", "DECLARE: assign IC, open war room, invoke playbook, notify per comms plan")
    if alert.get("privileged_account") or alert.get("lateral_movement"):
        return ("SEV-2", "ENGAGE IR lead; consider declaring; begin scoping")
    if alert.get("malware") or alert.get("account_compromise"):
        return ("SEV-3", "HANDLE in SOC; document; monitor")
    return ("SEV-4", "LOG and trend")


def containment(incident_type: str) -> str:
    """Return the containment posture for an incident type (SS24.4 tradeoffs)."""
    fast_destructive = {"ransomware", "wiper", "active_exfiltration"}
    stealthy = {"apt", "persistent_intrusion", "insider_slow"}
    if incident_type in fast_destructive:
        return "IMMEDIATE aggressive containment: isolate now, scope in parallel (speed > stealth)"
    if incident_type in stealthy:
        return "QUIET thorough scoping first, then coordinated containment everywhere at once"
    return "PROPORTIONATE: contain the known footholds, keep scoping"


if __name__ == "__main__":
    print(triage({"ransomware": True, "affects": "core_banking"}))
    print(triage({"privileged_account": True}))
    print(triage({"malware": True}))
    print(containment("ransomware"))
    print(containment("apt"))

# Expected output:
# ('SEV-1', 'DECLARE: assign IC, open war room, invoke playbook, notify per comms plan')
# ('SEV-2', 'ENGAGE IR lead; consider declaring; begin scoping')
# ('SEV-3', 'HANDLE in SOC; document; monitor')
# IMMEDIATE aggressive containment: isolate now, scope in parallel (speed > stealth)
# QUIET thorough scoping first, then coordinated containment everywhere at once

Trace the first call by hand: {"ransomware": True, ...} → the first if is true → SEV-1 with the declare action. The second: no ransomware, affects is absent so data_or_core is false, but privileged_account is true → SEV-2. The third: only malware → falls to the third ifSEV-3. containment("ransomware") hits fast_destructive → the immediate posture; containment("apt") hits stealthy → the quiet-scope-first posture. The module is tiny, but it captures the two judgment calls — how bad is this, and how aggressively do I contain it? — that the rest of the chapter spent ten thousand words teaching you to make. You have written the decision logic of Meridian's response.

Summary

This chapter turned every prior control's inevitable failure into something survivable: an incident-response capability.

  • It's when, not if. Prevention lowers frequency and severity but never reaches zero; a program is judged by how it responds. An event is any observable occurrence; a security incident is an event that violates or imminently threatens security policy and harms (or credibly threatens) confidentiality, integrity, or availability.
  • The NIST SP 800-61 lifecycle is a loop: Prepare → Detect & Analyze → Contain, Eradicate, Recover → Post-Incident Activity → (back to Prepare). Detection/containment iterate as scope grows; lessons feed preparation.
  • Preparation is where the leverage is: an IR plan (definitions, severity matrix, escalation), playbooks (scenario-level decision procedures), runbooks (step-by-step technical tasks), a communications plan (internal, legal/insurer, regulators, customers — with statutory clocks like the 36-hour banking notification), the roles (a named incident commander with decision authority and a defined chain of command), and an out-of-band copy of all of it.
  • Severity classification (SEV-1–SEV-4) drives response speed, escalation, resourcing, and notification — proportional response, not all-or-nothing.
  • Triage answers four questions under uncertainty — is it real, how bad, how far (scope), what now — corroborating across log sources and asking "how far?" early. Under-scoping reignites incidents; tipping off the attacker burns known footholds.
  • Containment (short-term: fast/reversible; long-term: durable holding) trades stopping damage vs. preserving evidence vs. business continuity vs. stealth; the incident type sets the posture (immediate for ransomware; quiet-then-coordinated for stealthy APTs). Eradication removes the attacker, persistence, and access (wipe-and-reimage, rotate credentials, close the initial vector). Recovery restores from known-good, tested, offline backups, staged by priority, with heightened post-recovery monitoring.
  • The ransomware tabletop rehearses all of it on paper, quarterly; a tabletop that finds no problems was facilitated too gently.
  • Blameless postmortems find systemic causes, not individuals; they produce a short list of owned, deadlined, tracked action items that feed back into preparation. Track MTTD/MTTR (Chapter 36) as the honest measure of maturity.

Spaced Review

Retrieval practice across recent and older chapters. Answer before scrolling up.

  1. (Ch. 22) During triage you map an alert to a MITRE ATT&CK technique. How does identifying the technique help you scope the incident — what does it tell you about what the attacker likely did before and after this alert?
  2. (Ch. 22) Distinguish an indicator of compromise (IoC) from a behavioral detection. In scoping a declared incident, why do you pivot on every IoC across the whole environment rather than just cleaning the host that alerted?
  3. (Ch. 21) Why is centralized log normalization in the SIEM a prerequisite for the triage step "corroborate the alert across multiple log sources"?
  4. (Ch. 21) Your ransomware playbook depends on detection rules firing. Connect alert fatigue (Ch. 21) to incident response: what does a SOC drowning in false positives do to your detection phase?
Answers 1. ATT&CK techniques sit in a sequence of tactics (initial access → execution → persistence → privilege escalation → lateral movement → impact); identifying the technique of the alerting activity tells you which tactics likely preceded it (so you know where else to look — e.g., the persistence and credential-access steps that usually come before "inhibit system recovery") and which are likely to follow (so you can pre-empt them), which is how you scope intelligently instead of reactively. 2. An IoC is a concrete artifact of a *known* compromise (a hash, domain, IP, attacker account); a behavioral detection flags *patterns of activity* (e.g., a process spawning shadow-copy deletion) regardless of specific artifacts. You pivot on every IoC across the whole environment because the same attacker/tool is very likely present on more than the one host that happened to alert — searching environment-wide turns one indicator into the true scope and prevents under-scoping. 3. Because corroboration requires comparing events from different sources (EDR, firewall, authentication, proxy) in a common schema and timeline; without centralized collection and normalization those sources are siloed and time-mismatched, and an analyst cannot quickly confirm that one alert is the same activity another source saw. 4. Alert fatigue degrades detection: analysts overwhelmed by false positives tune out, respond slowly, or miss the real alert among the noise — so a high false-positive rate effectively lowers your detection capability and lengthens MTTD, which is why tuning (Ch. 21) and high-fidelity detection engineering (Ch. 22) are IR prerequisites, not separate concerns.

What's Next

You contained, eradicated, and recovered — but you also kept the compromised servers powered on, isolated rather than wiped, because you will need to know exactly what the attacker did, how they got in, and what they touched. That investigation is digital forensics, and Chapter 25 is its discipline for defenders: preserving evidence and chain of custody, acquiring disk and memory in the right order (order of volatility), reconstructing the attacker's timeline from artifacts, and scoping the breach with rigor — turning the messy aftermath of an incident into a defensible, evidence-based account of what truly happened. Incident response stops the bleeding; forensics establishes the truth, and the two are partners. The timeline your scribe started in this chapter is where the forensic investigation of the next one begins.