46 min read

For six days, an attacker had a foothold inside Meridian Regional Bank, and the evidence was sitting in the logs the entire time. A service account had logged into a server it had never touched before. Minutes later, that same account ran a command...

Prerequisites

  • 10
  • 7
  • 6

Learning Objectives

  • Explain why logs are the ground truth of a security program and identify the highest-value log sources to collect first.
  • Design a log collection and normalization pipeline that turns heterogeneous raw events into a common schema.
  • Write correlation rules and detection use cases that turn normalized events into high-fidelity alerts.
  • Query a SIEM in SQL, SPL, and KQL to investigate an alert and hunt across collected logs.
  • Diagnose and reduce alert fatigue by tuning, suppressing, and aggregating noisy detections.
  • Distinguish a SIEM from a data lake and from a SOAR platform, and explain how detection-as-code and dashboards mature a SOC.

Chapter 21: Security Information and Event Management (SIEM): Centralized Logging and Correlation

"In God we trust. All others must bring data." — W. Edwards Deming (widely attributed)

Overview

For six days, an attacker had a foothold inside Meridian Regional Bank, and the evidence was sitting in the logs the entire time. A service account had logged into a server it had never touched before. Minutes later, that same account ran a command to enumerate the bank's Active Directory groups. An hour after that, a domain controller recorded the account being added to a privileged group. Three separate systems — a Windows server, a domain controller, and the endpoint detection agent — each wrote a truthful record of exactly what happened, at the moment it happened. And nobody noticed, because each record lived in a different place, in a different format, on a different team's machine, and no one was looking at all three together.

That is the failure a Security Information and Event Management system exists to prevent. Individually, each of those three events was unremarkable — service accounts log in, administrators enumerate groups, group memberships change, thousands of times a day at a bank. The attack was invisible in any single log. It was only obvious when you put the three events next to each other, in order: a service account that should never log in interactively did so, then reconnoitered, then escalated its own privileges, all within ninety minutes. The signal was not in any one event. The signal was in the correlation. Meridian had all the data and none of the visibility, which is the most common — and the most preventable — way that a breach becomes a disaster instead of a footnote.

This chapter is about closing that gap. We have spent the last twenty chapters building defenses — firewalls, hardened hosts, identity controls, encrypted channels — and instrumenting the network to see traffic (Chapter 10). Every one of those controls produces logs. A firewall logs the connections it allows and denies; a domain controller logs every authentication; a cloud platform logs every API call; an endpoint agent logs every process that runs. Apart, they are a thousand disconnected diaries. A SIEM is the discipline and the technology of bringing them together — collecting them, giving them a common language, watching them in concert, and turning the patterns that matter into alerts a human will actually act on. It is the central nervous system of a Security Operations Center, and in this chapter you stand one up for Meridian and write its first ten detections.

In this chapter, you will learn to:

  • Explain why logs are the ground truth of every investigation, and choose which log sources to collect first when you cannot collect everything.
  • Build a pipeline that collects and normalizes heterogeneous logs into one schema so that "the user", "the source IP", and "the action" mean the same thing no matter which system produced the event.
  • Write correlation rules and frame detection use cases that combine events across systems and time into alerts with real fidelity.
  • Query a SIEM fluently in the three dialects you will meet in the field — SQL, Splunk's SPL, and Microsoft's KQL — to investigate an alert and hunt.
  • Diagnose alert fatigue, the operational disease that quietly disables most SIEMs, and tune detections so analysts trust their queue.
  • Place the SIEM in the larger picture: how it differs from a data lake and a SOAR platform, and how detection-as-code and dashboards turn a pile of logs into a measurable program.

Learning Paths

The SIEM is the heart of security operations, so this chapter is core for the SOC track and important for everyone who feeds or funds it.

🛡️ SOC Analyst: This is your home. Read every section, but live in §21.3 (correlation and use cases), §21.4 (querying — the skill you will use on every shift), and §21.5 (alert fatigue — the difference between a SOC that works and one that drowns). The Project Checkpoint's siem.py mirrors what your SIEM does under the hood. 🏗️ Security Engineer: Focus on §21.2 (collection and normalization architecture — you will build the pipeline) and §21.6 (data lake vs. SIEM vs. SOAR — an architecture decision you will own). You design where the logs flow and how they are stored and retained. 📋 GRC: Skim §21.1 (why logging is also a compliance obligation — PCI-DSS and others mandate it) and §21.6 (log retention as a policy and legal question). You will help write the logging standard this chapter produces. 📜 Certification Prep: SIEM, log correlation, normalization, and alert tuning appear across CompTIA Security+ and CISSP (Domain 7, Security Operations). The terms in §21.1–§21.5 map directly to exam objectives; key-takeaways.md has the crosswalk.


21.1 Logs are the ground truth

Begin with a hard truth about defense: prevention always eventually fails, and when it does, the only thing standing between you and a guess is your logs. You cannot interview a server about what happened to it last Tuesday. You cannot ask a firewall to remember a connection it did not record. Whatever the system wrote down at the time is what you have, and whatever it did not write down is gone forever. This is why, before we talk about fancy correlation, we have to talk about the humble, unglamorous fact that everything in security operations rests on: logs are the ground truth.

A log is a timestamped record that a system writes describing something that happened — an event. A log source is any system or application that produces those records: a Windows server, a Linux host, a firewall, a web server, a DNS resolver, an Active Directory domain controller, an AWS account, an identity provider, an endpoint agent, an application your own developers wrote. Every meaningful action in a modern environment can leave a log somewhere, which means that — in principle — every attack leaves traces. The attacker who phished a credential, logged in, moved laterally, and exfiltrated data did each of those things on a system that could have logged it. Whether it actually did, and whether anyone collected and kept that log, is the entire question.

Why trust logs above all else? Because, handled correctly, they are the closest thing security has to objective evidence. A log written by the authentication service at 09:14:02 saying that user jchen logged in from 198.51.100.23 is a fact about the world, recorded by the system itself, not an inference or an opinion. When you investigate an incident, when you respond to it (Chapter 24), and when you later perform forensics on it (Chapter 25), logs are the spine of the timeline. A breach investigation is, at its core, the reconstruction of a sequence of events from the records the systems left behind. No logs, no reconstruction — just speculation.

🚪 Threshold Concept: Logs are the ground truth. Every detection you will ever write, every investigation you will ever run, and every incident timeline you will ever reconstruct is built on records that systems wrote at the time things happened. This reframes a great deal of security engineering: a control that does its job but logs nothing is half a control, because when it fails — and everything eventually fails (Theme 4) — you will have no idea how, when, or how far. "Is it logged?" becomes a question you ask about every system you build. Internalize this and you will start designing for visibility, not just for prevention.

There is a second reason to care about logs that has nothing to do with catching attackers: you are very likely required to keep them. Compliance is the floor, not the ceiling (Theme 5), and the floor here is concrete. PCI-DSS requires organizations that handle cardholder data to log access to it and to review those logs; it even specifies a minimum retention period. The GLBA Safeguards Rule that governs Meridian expects monitoring. Many regulations require that you be able to detect and investigate unauthorized access — which is impossible without logs. So Meridian must collect and retain logs whether or not it ever writes a single detection rule. The opportunity, and the subject of this chapter, is to make those logs do double duty: satisfy the auditor and actually defend the bank.

What good logging looks like (and what an attacker does to it)

Not all logging is equal. A log is only useful if it captures the right events, with enough detail, with accurate and synchronized timestamps, and if it is stored somewhere the attacker cannot quietly erase. Each of those is a place attacks live.

Consider the attacker's perspective for a moment, because you cannot defend what you do not understand. An attacker who gets administrative access to a host can often turn off logging, clear the logs, or modify them to remove evidence of their presence — a family of techniques the MITRE ATT&CK framework (Chapter 2) groups under "Indicator Removal," including the very common move of clearing the Windows Security event log. If your logs live only on the machine that generated them, an attacker who owns that machine owns its memory of what they did. This is precisely why a core principle of secure logging is to ship logs off the box — forward them, in as close to real time as you can, to a central system the attacker does not control. Once an event has reached the SIEM, clearing it on the source host is too late: you already have the copy.

🛡️ Defender's Lens: Log clearing is itself a high-fidelity detection. On Windows, the act of clearing the Security log generates Event ID 1102; on Linux, gaps or truncation in auth.log or the systemd journal are suspicious. A mature SOC does not just collect logs — it alerts when logging stops. An attacker covering their tracks by disabling or clearing logs trips a wire by doing so. We turn this into one of Meridian's first ten use cases later in the chapter. The lesson generalizes: the attacker's attempt to become invisible is often the most visible thing they do.

Timestamps deserve special attention because correlation depends on them entirely. If your firewall's clock is four minutes off from your domain controller's clock, then events that happened together will appear minutes apart in the SIEM, and a correlation rule that looks for "a login within sixty seconds of a firewall allow" will silently miss real attacks. The fix is mundane and non-negotiable: every log source synchronizes its clock to a common time source (NTP), and the SIEM normalizes everything to a single timezone — universal coordinated time (UTC) is the universal choice, so that an analyst in one timezone and a log from another line up without arithmetic. We will treat UTC as the law for the rest of this chapter, and you should treat it as law in any real deployment.

🔄 Check Your Understanding: 1. Why is forwarding logs to a central SIEM, rather than reviewing them on each host, a security control and not merely a convenience? 2. An auditor asks Meridian to prove it can detect unauthorized access to cardholder data. Which property of "the ground truth" makes that possible, and what fails if logs are not retained long enough? 3. Two log sources disagree about when an event happened by three minutes. Name the root cause and the fix, and explain which detection capability this protects.

Answers

  1. Because an attacker who compromises a host can clear or alter its local logs; once the event has been forwarded to a central system the attacker does not control, the evidence survives even if the source is fully compromised. Centralization is integrity for your evidence. 2. Logs are objective records written by the systems themselves, so they can demonstrate access occurred (or that you would have seen it). If retention is too short, the records of an intrusion discovered months later — the common case — are already gone, so you can neither detect nor investigate it. 3. Clock drift between sources; fix by synchronizing all sources to a common time source (NTP) and normalizing to UTC. This protects time-based correlation, which depends on events from different sources lining up accurately.

21.2 Collecting and normalizing at scale

Knowing you need logs is easy. Getting all of them, from hundreds of different systems that each speak their own dialect, into one place where you can reason about them together — that is the engineering problem, and it has two halves: collection (getting the data in) and normalization (making it speak one language).

Collection: getting the logs in

Logs reach a SIEM by one of a few well-worn paths, and a real environment uses all of them:

  • Agents. A small program installed on each host reads local logs and forwards them. Endpoint agents and log shippers (the open-source Filebeat/Fluentd family; vendor agents) live here. Agents are reliable and can collect rich data, but they are software you must deploy and maintain on every machine.
  • Syslog. The lingua franca of network devices and Unix systems. Firewalls, switches, routers, and Linux hosts emit syslog messages over the network (UDP or TCP, ideally TLS-encrypted) to a collector. Cheap and universal; the message format, however, is famously loose.
  • API pulls. Cloud and software-as-a-service systems do not run your agents, so you collect their logs by calling their APIs — pulling AWS CloudTrail (Chapter 15), Microsoft 365 audit logs, or an identity provider's sign-in logs on a schedule. Identity is the new perimeter, and these identity and cloud logs are among the most valuable you can collect.
  • Direct integrations / streaming. High-volume sources push to a message bus (a Kafka-style stream) that the SIEM consumes, decoupling the firehose of events from the system that has to keep up with them.

The first strategic decision in any SIEM program is what to collect, because you cannot afford to collect and retain everything at full fidelity — both the storage and, in commercial SIEMs that license by data volume, the cost are real constraints. You cannot defend what you cannot see, but you also cannot pay to keep every packet forever. So you prioritize by detection value: which sources, if you had them, would catch the most attacks? A defensible priority order for a typical enterprise looks like this, and it is the order Meridian follows:

Log-source priority (collect from the top down):
  1. Identity/authentication  — AD, Entra ID, IdP sign-ins, VPN  (the new perimeter)
  2. Endpoint detection (EDR) — process creation, persistence, defense-evasion
  3. Cloud control plane      — CloudTrail / Azure activity / GCP audit
  4. Network edge             — firewall allow/deny, proxy, DNS, IDS/IPS  (Ch.7, 10)
  5. Servers (Windows/Linux)  — Security log, sudo/auth, key services
  6. Critical applications    — core banking, web app, database access logs
  7. SaaS/email              — M365/Google audit, mail gateway

Why this order? Because identity and endpoint telemetry catch the techniques attackers most reliably use — credential abuse, privilege escalation, and code execution — and they catch them across the whole kill chain, not just at the perimeter. Network logs (which you built the capacity for in Chapter 10) are essential context, but a clever attacker can do tremendous damage entirely inside the perimeter, visible only in identity and endpoint logs. Start where the attacks are.

⚠️ Common Pitfall: "Collect everything and figure it out later." Teams new to SIEM, especially when sold a tool that promises to ingest anything, pipe in every log they can find. The result is a system so expensive and so noisy that it gets defunded or ignored — a SIEM full of debug logs from a printer and HTTP 200s from a load balancer, with the one authentication log that matters buried underneath. Collection without a detection purpose is just expensive hoarding. Collect what you have a use case for (§21.3), keep cheaper copies of the rest in a data lake (§21.6), and resist the firehose.

Normalization: making everything speak one language

Here is the problem normalization solves, shown concretely. Three different systems record what is essentially the same kind of event — "a user did something from somewhere" — and each describes it completely differently. Here is a raw authentication failure from a Linux host's auth.log:

May 14 09:14:07 web01 sshd[20881]: Failed password for jchen from 198.51.100.23 port 52344 ssh2

Here is a (simplified) Windows failed-logon event, as a SIEM might receive it in key-value form:

EventID=4625 TimeCreated=2025-05-14T09:14:07Z TargetUserName=jchen IpAddress=198.51.100.23 LogonType=10 Status=0xC000006A WorkstationName=DC01

And here is a line a firewall might emit for a denied connection from the same address:

2025-05-14 09:14:07 DENY TCP 198.51.100.23:52344 -> 10.20.4.7:22 rule=42 iface=outside

To a human, the common thread is obvious: the same source IP, the same moment, the same kind of activity. To a computer, these are three unrelated blobs of text. The user is called jchen in one, TargetUserName=jchen in another, and is absent from the third. The source address is from 198.51.100.23 here, IpAddress=198.51.100.23 there, and 198.51.100.23:52344 in the firewall line. The timestamp format is different in all three. You cannot write a single rule across them, and you cannot ask one question — "show me everything involving 198.51.100.23 in the last hour" — and get all three back, until they share a vocabulary.

Normalization is the process of transforming raw log events from many sources into a single common structure — a shared schema — so that the same concept always has the same field name and format. Parsing is the lower-level step within it: extracting the meaningful fields out of a raw message (pulling jchen, 198.51.100.23, and the timestamp out of that sshd line). You parse to get the fields out; you normalize to give them common names and formats. After normalization, all three events above carry the same field names — timestamp, user, src_ip, action, outcome, source — and the query that was impossible becomes trivial.

Modern SIEMs lean on published schemas so that everyone's fields mean the same thing: the Elastic Common Schema (ECS), the Open Cybersecurity Schema Framework (OCSF), and Splunk's Common Information Model (CIM) are the common ones. You do not have to invent field names; you map each source's fields onto an agreed model. The principle, however, is universal and older than any of these standards: decide what the canonical fields are, and translate every source into them on the way in.

Here is what our three raw events look like after normalization into a small common schema — exactly the transformation siem.py will perform in this chapter's Project Checkpoint:

Normalized events (common schema):
  {timestamp:2025-05-14T09:14:07Z, source:linux_auth, user:jchen,
   src_ip:198.51.100.23, action:login, outcome:failure, host:web01}
  {timestamp:2025-05-14T09:14:07Z, source:win_security, user:jchen,
   src_ip:198.51.100.23, action:login, outcome:failure, host:DC01}
  {timestamp:2025-05-14T09:14:07Z, source:firewall, user:null,
   src_ip:198.51.100.23, action:connection, outcome:deny, host:fw-edge}

Now a single question — "everything from 198.51.100.23 in the last hour" — returns all three, and a single rule — "five failed logins from one src_ip in five minutes" — works whether the failures came from Linux, Windows, or both. Normalization is the unglamorous foundation that makes correlation possible. Skip it and every detection becomes a bespoke, brittle, source-specific hack.

🔗 Connection: This is the SIEM consuming the network-visibility layer you built in Chapter 10. The Zeek logs, NetFlow records, and firewall logs from that chapter are log sources here; summarize_flows and beacon_score produced the kind of network telemetry that, normalized into this schema, becomes one input among many that the SIEM correlates with identity and endpoint events. Chapter 10 gave you eyes on the wire; this chapter gives those eyes a brain that also sees identity, endpoints, and the cloud.

🔄 Check Your Understanding: 1. Distinguish parsing from normalization in one sentence each. 2. Why is collecting from identity and endpoint sources a higher priority than collecting every server's verbose debug log? 3. After normalization, why can a single correlation rule work across Linux and Windows authentication events when it could not before?

Answers

  1. Parsing extracts the meaningful fields out of a raw log message; normalization maps those fields onto a single common schema (consistent names and formats) shared across all sources. 2. Identity and endpoint logs catch the techniques attackers most reliably use — credential abuse, privilege escalation, code execution — across the whole kill chain, so they have far higher detection value per byte than verbose debug logs, which are voluminous and rarely security-relevant. 3. Because after normalization both sources express the same concept with the same field names and formats (e.g., user, src_ip, outcome), so one rule referencing those fields matches events regardless of which system produced them.

21.3 Correlation rules and detection use cases

We now arrive at the reason a SIEM is more than an expensive search box. A search box answers questions you already know to ask. A SIEM, properly configured, watches for trouble on its own and tells you when it sees it — and the mechanism is correlation.

A correlation rule is a piece of logic that examines events — often from multiple sources, often across a window of time — and fires an alert when a defined pattern occurs. The single most important idea here is the one from this chapter's opening: an attack is frequently invisible in any one event and obvious only when several events are seen together. Correlation is how you see them together, automatically, at machine speed, across millions of events a day.

A use case (in detection, sometimes called a detection use case) is the higher-level thing a correlation rule serves: a specific, named threat scenario you have decided to detect, together with the logic, the data sources it needs, the alert it produces, and the response it triggers. "Detect brute-force attacks against the VPN" is a use case; the correlation rule is its implementation. Mature SOCs think in use cases, not rules, because a use case forces you to ask the questions that make a detection actually useful: What attacker behavior am I trying to catch? Which log sources reveal it? What is the false-positive risk? What does an analyst do when this fires? A rule without a use case is a tripwire with no one assigned to answer it.

A taxonomy of correlation, from simple to powerful

Correlation rules range from trivial to sophisticated. It helps to see the ladder:

  1. Single-event match (atomic). The simplest "correlation" is no correlation at all: a single event is inherently bad. A Windows Security log clear (Event ID 1102). A login from a country the company does not operate in. These are easy and often high-fidelity, but limited — attackers mostly avoid doing single obviously-bad things.
  2. Thresholding (one source, count over time). More than 10 failed logins for one account in 5 minutes — a brute-force pattern. More than 50 failed logins from one source IP across many accounts in 5 minutes — password spraying. The power here is counting: one failed login is nothing; fifty in a minute is an attack.
  3. Sequence / temporal (ordered events across sources). A failed-login burst against an account, followed within 10 minutes by a success for that account — a brute-force that worked. The order and the time window carry the meaning. This is the kind of correlation that caught nothing at Meridian in our opening because nobody had written it.
  4. Cross-source correlation (joining different systems). An IDS alert for an exploit attempt against a host, followed by that host making an outbound connection to a new external address — exploitation followed by command-and-control. The detection spans the network sensor and the firewall, which is exactly why both had to be normalized into one schema.
  5. Stateful / behavioral baselining. This service account has never logged in interactively before, and just did — a deviation from a learned baseline. This shades into the user-and-entity behavior analytics and machine-learning techniques we develop in Chapter 34; classic SIEMs approximate it with "first seen" logic and lookups.

💡 Intuition: Why is correlation so much more powerful than single-event alerting? Because attacks are processes, not moments. The cyber kill chain (Chapter 2) is a sequence of stages, and each individual stage often looks like legitimate activity — a login, a query, a connection. It is the sequence and the combination that betray the attacker. Single-event detection forces you to find a moment that is unambiguously malicious, and skilled attackers specialize in making each moment look ordinary. Correlation lets you detect the shape of the whole attack even when no single piece is damning. This is why "logs are the ground truth" and "correlation" are inseparable: the ground truth is a sequence, and correlation reads sequences.

Worked example: brute force that succeeded

Let us make this concrete with the most classic detection use case there is, end to end. The threat: an attacker guesses or sprays passwords against Meridian's VPN until one works. In single events, this is a stream of ordinary failed and successful logins. As a sequence, it is unmistakable.

The use case, stated properly:

Use case:        VPN credential brute-force resulting in successful access
ATT&CK:          T1110 (Brute Force) -> T1078 (Valid Accounts)
Log sources:     VPN authentication (normalized: action=login)
Trigger logic:   >= 10 outcome=failure for one user within 5 min,
                 FOLLOWED BY >= 1 outcome=success for that same user within 10 min
Severity:        High (a successful unauthorized login is likely)
Response:        Disable account, force reset, hunt for post-login activity
False positives: A user fat-fingering a new password then succeeding (tune by
                 raising the failure threshold and checking source-IP novelty)

Normalized events the rule sees (illustrative; one account, one attacker source in the documentation range):

2025-05-14T02:01:10Z source=vpn user=mreyes src_ip=203.0.113.77 action=login outcome=failure
2025-05-14T02:01:14Z source=vpn user=mreyes src_ip=203.0.113.77 action=login outcome=failure
2025-05-14T02:01:19Z source=vpn user=mreyes src_ip=203.0.113.77 action=login outcome=failure
   ... (11 failures total in ~70 seconds) ...
2025-05-14T02:02:31Z source=vpn user=mreyes src_ip=203.0.113.77 action=login outcome=failure
2025-05-14T02:03:05Z source=vpn user=mreyes src_ip=203.0.113.77 action=login outcome=success

Eleven failures in seventy seconds, then a success — for an account whose owner was asleep, from an IP the bank has never seen. Each line is mundane. The pattern is an alarm.

The detection as a SQL-style correlation query. SIEM query languages differ, but the logic is the same everywhere; here it is in portable SQL against a normalized events table, which makes the correlation explicit:

-- Brute force followed by success, per user, within a 10-minute window.
WITH failures AS (
  SELECT user, COUNT(*) AS fail_count, MIN(timestamp) AS first_fail
  FROM events
  WHERE source = 'vpn' AND action = 'login' AND outcome = 'failure'
    AND timestamp >= NOW() - INTERVAL '15' MINUTE
  GROUP BY user
  HAVING COUNT(*) >= 10
)
SELECT s.user, f.fail_count, s.timestamp AS success_time, s.src_ip
FROM failures f
JOIN events s
  ON s.user = f.user
 AND s.source = 'vpn' AND s.action = 'login' AND s.outcome = 'success'
 AND s.timestamp BETWEEN f.first_fail AND f.first_fail + INTERVAL '10' MINUTE;
-- Returns: mreyes, 11, 2025-05-14T02:03:05Z, 203.0.113.77  -> raise High alert

Read it as the use case in code: first find accounts with ten or more failures in the recent window (the brute-force burst), then join to find a success for that same account inside ten minutes of the burst's start (the attack working). A row in the result is an alert. We will write the same logic in SPL and KQL in §21.4 so you can see how the dialects compare; the correlate() function you build in the Project Checkpoint is a tiny version of exactly this idea.

📟 War Story: Constructed, representative. A mid-size firm had bought a SIEM and proudly "had detections." But every detection was a single-event rule copied from the vendor's defaults, and the one that should have mattered — "successful login after many failures" — had been disabled months earlier because it was "too noisy" (it fired every time someone mistyped a new password and then got it right). When an attacker sprayed their VPN and got in, the SIEM saw all eleven failures and the success, and said nothing, because the rule that read the sequence was off and the rules that were on read only single events. The data was perfect. The correlation was missing. The fix was not a better tool; it was re-enabling the sequence rule and tuning it (raise the threshold, require a never-before-seen source IP) instead of deleting it — which is the entire subject of §21.5.

Detection-as-code: rules you can review, test, and version

There is a better way to manage detections than clicking them together in a vendor console where they live as un-versioned configuration nobody can review. Detection-as-code is the practice of writing, storing, reviewing, testing, and deploying detection rules the way software engineers manage code: as text files in version control, peer-reviewed before they go live, tested against sample data, and deployed through a pipeline. A correlation rule is logic; logic is code; code belongs in git.

The payoff is large. When detections live as code, you can see who changed a rule and why; you can review a new rule before it floods the queue; you can write a test that feeds the rule a known-malicious sample and a known-benign sample and confirms it catches one and ignores the other; and you can share rules across teams and even across organizations. The open Sigma format — a vendor-neutral YAML way to write detection rules that then compile to Splunk, Elastic, Microsoft Sentinel, and others — is the lingua franca of this movement, and you will meet it properly when we build a detection-engineering practice in Chapter 22. For now, hold the principle: a detection you cannot review, test, and version is a detection you cannot trust at scale.

🔄 Check Your Understanding: 1. Why does a sequence correlation rule (failures then a success) catch an attack that single-event rules on "failed login" and "successful login" each miss? 2. Distinguish a correlation rule from a use case. Which one forces you to define the analyst's response and the false-positive risk? 3. State one concrete benefit of managing detections as code (in version control) rather than as clicks in a vendor console.

Answers

  1. Individually, a failed login and a successful login are both ordinary and constant; neither alone indicates an attack. The sequence — many failures immediately followed by a success for the same account — is the signature of a brute-force that worked, and only a rule that reads the ordered combination across time sees it. 2. A correlation rule is the logic that fires on a pattern; a use case is the named threat scenario it serves, including the data sources, severity, false-positive risk, and analyst response. The use case forces you to define response and false-positive handling. 3. Any of: changes are reviewable and attributable (who changed what, why); rules can be tested against known samples before deployment; rules are versioned and can be rolled back; rules can be shared/reused (e.g., via Sigma).

21.4 Querying: SQL, SPL, and KQL

Writing correlation rules is one half of the SOC's craft; the other half is querying — asking the SIEM questions by hand to investigate an alert, scope an incident, or hunt for something a rule did not catch. When a correlation rule fires, an analyst's first move is almost always a query: show me everything that account did in the last 24 hours; show me every host that talked to that IP; show me whether this happened anywhere else. Fluency in your SIEM's query language is the single most-used skill on a SOC shift, so we will look at the three dialects you are most likely to meet and write the same investigation in each.

The three are: SQL (Structured Query Language), the database language some SIEMs and most data lakes use directly; SPL (Search Processing Language), the language of Splunk, built around a left-to-right pipeline of commands separated by |; and KQL (Kusto Query Language), the language of Microsoft Sentinel and Microsoft 365 Defender, also pipeline-based. They look different but express the same operations: filter events, then transform, aggregate, and sort them.

The investigation: an alert has fired on source IP 203.0.113.77. You want the top accounts it attempted, counted, most-attempted first, over the last hour — the bread-and-butter triage query. Here is the same question in all three languages, against the normalized schema from §21.2.

In SQL:

SELECT user, COUNT(*) AS attempts
FROM events
WHERE action = 'login'
  AND src_ip = '203.0.113.77'
  AND timestamp >= NOW() - INTERVAL '1' HOUR
GROUP BY user
ORDER BY attempts DESC;

In SPL (Splunk) — note the pipeline; each | passes results to the next command:

index=auth action=login src_ip="203.0.113.77" earliest=-1h
| stats count AS attempts by user
| sort - attempts

In KQL (Microsoft Sentinel) — also a pipeline, with summarize doing the aggregation:

Events
| where action == "login" and src_ip == "203.0.113.77"
| where timestamp >= ago(1h)
| summarize attempts = count() by user
| sort by attempts desc

Set them side by side and the structure is identical: a filter on action, source, and time; a group-and-count; a descending sort. SQL puts the aggregation up front (SELECT ... GROUP BY); SPL and KQL read top-to-bottom as a pipeline (filter | aggregate | sort). Learn the shape of an investigation — filter, aggregate, sort, sometimes join — and you can move between SIEMs by learning each one's spelling for those same verbs. The expected result of all three is one small table:

user      attempts
mreyes    11
dokafor    1
jchen      1

— which immediately tells the analyst that mreyes was the spray's main target (the eleven attempts), confirming the correlation alert and pointing to the next query: did mreyes then succeed, and what did the account do after?

🧩 Try It in the Lab: Stand up a free SIEM in your home lab — a single-node Elastic stack, Splunk Free, or Microsoft Sentinel on a trial tenant — and ingest a few days of your own machine's logs (your firewall, your SSH auth.log, your DNS). Then practice the same three investigations in whatever query language it speaks: (1) count events by source IP over the last day; (2) find the top processes or services that generated events; (3) write one threshold rule (e.g., "more than N failed logins from one source in 10 minutes"). You will learn more from one afternoon of querying real logs than from any amount of reading. Only ingest logs from systems you own.

⚠️ Common Pitfall: Querying without a time bound. The first filter in every SIEM query should constrain time. A query with no time window scans the entire dataset — potentially terabytes — which is slow, expensive, and on a busy SIEM can degrade the service for everyone. Notice that all three queries above lead with a time constraint (>= NOW() - INTERVAL '1' HOUR, earliest=-1h, ago(1h)). Make "what time range?" the first question you answer, before "what am I looking for?"

🔄 Check Your Understanding: 1. SPL and KQL are described as "pipeline" languages. What does that mean, and how does the structure differ from SQL? 2. Why should a time bound be the first filter in any SIEM query? 3. In the side-by-side investigation, what does the result table tell the analyst to query next?

Answers

  1. A pipeline language reads top-to-bottom, with each command's output feeding the next via | (filter, then aggregate, then sort). SQL expresses the same operations but leads with the projection/aggregation (SELECT ... GROUP BY ... ORDER BY) rather than a left-to-right flow. 2. Without a time bound, the query scans the entire dataset (potentially terabytes), which is slow, costly, and can degrade the SIEM for other users; constraining time first makes the query tractable. 3. That mreyes had 11 attempts (the spray target), so the analyst should next check whether mreyes then succeeded and what the account did afterward — pivoting from the brute force to its consequences.

21.5 Taming alert fatigue

We now confront the disease that kills more SIEMs than any attacker. You can collect every log, normalize it perfectly, and write a hundred correlation rules — and still fail completely, because your analysts have stopped trusting the alerts. This is alert fatigue: the desensitization and degraded performance that occur when analysts face a volume of alerts — especially false alarms — too high to investigate meaningfully. It is not a personal failing of tired analysts; it is a predictable, measurable systems failure, and it is the most common reason a real attack gets missed by a SOC that "had the alert."

The root of it is the false positive: an alert that fires when there is no actual malicious activity — the detection said "attack" and there was none. (Its evil twin, the false negative, is an attack that occurs and produces no alert — the detection that was missing in our opening; we develop the false-negative side in Chapter 22.) Every false positive costs an analyst real time to investigate and dismiss, and worse, each one erodes trust. A queue that is 95% false positives trains analysts, correctly and rationally, to assume the next alert is also noise. The day the real one arrives, it looks exactly like the ninety-nine cries of wolf before it.

The arithmetic is brutal and worth internalizing. Suppose a SOC of five analysts can each properly investigate, say, twenty alerts in a shift. If your rules generate eight hundred alerts a day and 97% are false positives, you have roughly twenty-four true positives buried in seven hundred and seventy-six false ones, and a team that can only touch a hundred alerts total. The true positives are not missed because anyone is lazy; they are missed because they are statistically un-findable in that much noise. The asymmetry cuts here too (Theme 2): the attacker needs you to ignore one alert; you need to investigate them all. A flood of false positives hands the attacker that one ignored alert for free.

🚪 Threshold Concept: A detection's value is not how many attacks it can catch — it is how many attacks it catches per false positive it generates. A rule that catches every real attack but fires two hundred false alarms a day is worse than useless: it will be disabled, or worse, left on to bury the alerts that matter. Fidelity, not coverage, is the currency of a SOC. This reframes detection engineering entirely: you are not trying to alert on everything suspicious; you are trying to alert on things that are worth a human's time, and ruthlessly suppressing the rest. Every senior detection engineer has internalized that a quiet, trustworthy queue is a feature, not a sign you are missing things.

How to tame the queue

Reducing alert fatigue is a craft with a handful of reliable techniques, applied in roughly this order:

  1. Tune thresholds and conditions. The crudest false-positive source is a threshold set too low. "5 failed logins" fires on every forgotten password; "20 failed logins in 2 minutes from one source against multiple accounts" fires on spraying and almost nothing else. Add conditions that exclude the benign case: require the source IP to be one not seen before for that account, exclude known scanners and service accounts, require the success-after-failure to come from a different IP than the user's normal one.
  2. Allowlist the known-benign. Some sources are noisy and legitimate: a vulnerability scanner that the security team itself runs will look exactly like an attacker probing every host. Maintain allowlists (the scanner's IP, the backup service account, the monitoring system) so their expected behavior does not page anyone. Do this carefully — an allowlist is a hole in your detection — and review it.
  3. Aggregate and deduplicate. A hundred identical alerts about the same event should be one alert that says "100 times." Group related alerts into a single incident rather than a hundred tickets. A burst of failed logins from one source is one event to investigate, not forty.
  4. Risk-score and prioritize, don't binary-alert. Instead of every rule producing an equal page, have detections contribute to a score for an entity (this user did three medium-suspicious things; that is now worth looking at) and surface the highest-scoring entities. This risk-based alerting approach — the lineage of the analytics in Chapter 34 — turns a flood of weak signals into a ranked, manageable few.
  5. Suppress and schedule. Known maintenance windows, expected batch jobs, and deployment activity generate predictable noise; suppress alerts during them rather than investigating the same false positive every night at 2 a.m.

The throughline is that tuning is not a one-time setup; it is continuous operations. Environments change, new applications appear, and a rule that was clean last quarter starts firing on a new legitimate behavior. A healthy SOC reviews its noisiest rules every week and asks of each false positive: can I add a condition that excludes this benign case without blinding myself to the real one? That question — narrow the rule to exclude the benign without losing the malicious — is the central skill of detection tuning, and it is the difference between a SIEM that defends and a SIEM that decorates.

⚠️ Common Pitfall: "Fixing" alert fatigue by disabling the noisy rule. This is the trap from the war story in §21.3, and it is everywhere. A rule fires too often, so someone turns it off — and now you have a false negative by choice: the attack that rule would have caught will sail through silently. Deleting a noisy detection trades a visible problem (too many alerts) for an invisible one (a blind spot), which is strictly worse because you will not know it is there until it is exploited. The discipline is to tune noisy rules — narrow them — not delete them. If a rule truly cannot be tuned to acceptable fidelity, that is a documented risk decision, not a quiet click.

🔄 Check Your Understanding: 1. Explain why a SIEM with excellent coverage (rules for every threat) can still completely fail to detect a real attack. 2. A brute-force rule fires forty times a day, almost always on users who mistyped a new password. Name two tuning changes that would cut the false positives without blinding you to a real spray. 3. Why is disabling a noisy rule usually worse than living with its noise?

Answers

  1. Because excellent coverage with poor fidelity produces a flood of false positives; analysts, swamped and desensitized, cannot find the true positives in the noise, so a real attack's alert is statistically un-findable even though the rule "caught" it. Fidelity, not coverage, determines whether attacks are actually seen. 2. Any two of: raise the failure threshold and shorten the window (e.g., 20 in 2 minutes); require the failures to span multiple accounts from one source (spraying) or require a never-before-seen source IP; require the subsequent success to come from a different IP than the user's normal one; allowlist known service accounts. 3. Disabling it creates a false negative by choice — a permanent blind spot you will not notice until it is exploited — whereas the noise is at least visible; tuning narrows the rule without creating the blind spot.

21.6 Dashboards, metrics, and the bigger picture (SIEM vs. data lake vs. SOAR)

A SIEM that only pages analysts is doing half its job. The other half is visibility for humans: showing the state of detection and operations at a glance, measuring whether the SOC is improving, and connecting to the systems that respond. This section zooms out from individual rules to the operational picture, and clarifies three terms that are constantly confused — SIEM, data lake, and SOAR — because choosing among and combining them is a real architecture decision.

Dashboards and the metrics they show

A dashboard is a visual, at-a-glance display of metrics and events, built from SIEM queries that run continuously and render as charts, counts, and tables. Dashboards serve different audiences at different altitudes, and conflating them is a classic mistake:

  • An operational dashboard is for the SOC: open alerts by severity, the oldest un-triaged alert, alert volume over time, the noisiest rules this week, log-source health (is any critical source not sending data right now?). It answers "what needs attention this minute?"
  • An executive dashboard is for leadership: trends, not events. Mean time to detect and respond, alert volume and false-positive rate over months, detection coverage against MITRE ATT&CK, the status of the program. It answers "is our security operation getting better, and where are the gaps?"

The metrics that matter most — mean time to detect (MTTD), mean time to respond (MTTR), detection coverage, false-positive rate — are the bridge from operations to the boardroom, and they are developed fully in Chapter 36, which turns them into the story a CISO tells leadership. For now, note the crucial connection: the SIEM is where these metrics are born. You cannot measure mean time to detect if you do not have a timestamped record of when the activity happened and when the alert fired — which is to say, you cannot measure your security operation without the very logs this chapter is about. Logging is not only how you detect; it is how you know whether your detection is working.

🔗 Connection: The dashboards and metrics here are the raw material for Chapter 36 (security metrics and board reporting). MTTD and MTTR — how fast you detect and respond — are computed from SIEM and incident data; detection coverage is your ATT&CK-mapped use cases (Chapter 22) counted against the framework. This chapter builds the instrument; Chapter 36 reads the dials and tells the board what they mean. A SOC that cannot show these numbers cannot defend its budget.

SIEM vs. data lake vs. SOAR

These three appear together constantly and do different jobs. Getting them straight is worth a careful paragraph each.

A SIEM is optimized for detection and investigation: it ingests, normalizes, correlates, alerts, and lets analysts query — but at a cost (storage and, in commercial products, licensing by data volume) that makes keeping everything forever impractical. It is the real-time brain.

A data lake is a large, low-cost store that holds vast amounts of raw or lightly-processed data — security logs included — cheaply and for a long time, in a flexible schema you apply when you read (schema-on-read) rather than when you write. Its strength is volume and retention at low cost; its weakness is that it does not, by itself, correlate or alert in real time. The modern pattern is to send high-value, detection-relevant logs to the SIEM and a fuller, cheaper copy of everything to a data lake, so you have real-time detection on what matters and a long, affordable archive for hunting (Chapter 22), forensics (Chapter 25), and compliance retention. The line between the two is blurring — some platforms query a lake directly with SIEM-like tooling — but the trade-off endures: fast and curated versus cheap and comprehensive.

A SOAR — Security Orchestration, Automation, and Response — is the platform that acts on what the SIEM detects. Where the SIEM raises an alert, a SOAR runs an automated or semi-automated playbook in response: enrich the alert (look up the IP's reputation, pull the user's recent activity), take an action (disable the account, isolate the host, block the IP at the firewall), and create the incident ticket — all without waiting for a human to do it by hand. SOAR is how a SOC scales response the way the SIEM scales detection; it directly attacks alert fatigue by automating the repetitive triage that consumes analysts. We introduce it here as the natural partner of the SIEM and return to its role in incident response (Chapter 24) and SOC operations.

🔗 Connection: SOAR sits at the boundary of this chapter and Chapter 24 (incident response). The SIEM detects; the SOAR orchestrates the response — running the containment playbook (disable account, isolate host) that IR defines. Think of the SIEM as the smoke detector and the SOAR as the sprinkler system wired to it: detection without automated response still depends entirely on a human being awake and available, which is exactly where alert fatigue does its damage. The two together are how a small team defends a large bank.

Here is the whole pipeline this chapter has assembled, end to end, as one picture:

        ┌──────────────────────────── LOG SOURCES ────────────────────────────┐
        │  Identity/AD   EDR/Endpoint   Cloud(CloudTrail)   Firewall/DNS/IDS   │
        │  Servers       Applications    SaaS/Email          (Ch.10 network)   │
        └───────┬───────────┬──────────────┬───────────────────┬──────────────┘
                │ agents     │ syslog        │ API pull           │ stream
                ▼            ▼               ▼                    ▼
        ┌─────────────────────────── COLLECT ───────────────────────────┐
        │   ingest pipeline (forwarders, collectors, message bus)       │
        └───────────────────────────────┬───────────────────────────────┘
                                         ▼
        ┌────────────────────── NORMALIZE / PARSE ──────────────────────┐
        │   raw events  ->  common schema  (timestamp,user,src_ip,      │
        │                                   action,outcome,host,...)     │
        └───────────────┬───────────────────────────────┬───────────────┘
                        ▼                                ▼
        ┌──────────── CORRELATE ────────────┐   ┌──────── DATA LAKE ────────┐
        │  rules / use cases:               │   │  cheap, long retention,    │
        │  threshold, sequence, cross-source│   │  schema-on-read, hunting,  │
        │  -> ALERT (scored, deduplicated)  │   │  forensics, compliance     │
        └───────────────┬───────────────────┘   └────────────────────────────┘
                        ▼
        ┌──────────── ALERT / TRIAGE ───────┐         ┌──────── SOAR ─────────┐
        │  analyst queue (tuned for         │ ──────▶ │  playbook: enrich,     │
        │  fidelity, not volume)            │         │  contain, ticket       │
        └───────────────┬───────────────────┘         └────────────────────────┘
                        ▼
        ┌──────────── DASHBOARD / METRICS ──────────────────────────────┐
        │  operational (open alerts, source health) + executive         │
        │  (MTTD/MTTR, coverage, FP rate)  -> Ch.36 board reporting       │
        └────────────────────────────────────────────────────────────────┘

Figure 21.1 — The SIEM pipeline, end to end: sources are collected, normalized into a common schema, then correlated into scored alerts (with a cheaper full copy diverted to a data lake); alerts feed a tuned analyst queue and a SOAR for automated response; everything rolls up into operational and executive dashboards. Each stage in this chapter is one band of this diagram.

🔄 Check Your Understanding: 1. In one sentence each, distinguish a SIEM, a data lake, and a SOAR by the job each does. 2. Why is the SIEM where metrics like MTTD are "born"? 3. How does a SOAR directly attack alert fatigue?

Answers

  1. A SIEM ingests, normalizes, correlates, and alerts on logs for real-time detection and investigation; a data lake stores vast amounts of raw log data cheaply for long retention, hunting, and forensics (schema-on-read, no real-time correlation by itself); a SOAR orchestrates and automates the response to alerts via playbooks (enrich, contain, ticket). 2. Because MTTD requires a timestamped record of when activity occurred and when the alert fired — exactly the logged events and alerts the SIEM holds; without those records you cannot measure detection at all. 3. By automating the repetitive enrichment and triage (and sometimes containment) that consume analyst time on each alert, so humans spend their attention only where judgment is required — shrinking the per-alert cost that drives fatigue.

Project Checkpoint

This chapter adds Meridian's logging and monitoring standard to the security program and the siem.py module to bluekit.

Program increment — the logging & monitoring standard. After the six-day foothold described in this chapter's opening (caught, in the end, only by luck during an unrelated audit), Dana Okafor made centralized logging the next program priority, and the team — Sam architecting the pipeline, Marcus's SOC defining detections — produced Meridian's logging and monitoring standard. Its core decisions: (1) a prioritized log-source list (identity and endpoint first, then cloud, network, servers, and the core-banking and online-banking applications), with a documented owner and onboarding plan for each; (2) normalization to a common schema (mapped to an industry model) and mandatory UTC timestamps with NTP synchronization on every source; (3) log retention of at least one year for security-relevant logs (longer for those PCI-DSS and GLBA require), with hot storage in the SIEM and a cheaper data-lake archive; (4) a first-ten use-case catalog (below); and (5) a standing tuning process — weekly review of the noisiest rules and the false-positive rate. The standard explicitly states that detections are managed as code, in version control, so they can be reviewed and tested. It is the foundation the detection and hunting program (Chapter 22) and the incident-response plan (Chapter 24) build on, and the source of the metrics the board will see (Chapter 36).

Meridian's first ten detection use cases — the starter catalog every new SOC needs:

 1. Brute force followed by success (T1110 -> T1078)      [sequence]
 2. Password spraying: one source, many accounts          [threshold]
 3. Impossible travel: same user, two distant locations   [sequence/geo]
 4. Security/audit log cleared (Win 1102; log gaps)        [single-event]
 5. New privileged-group membership added                 [single-event]
 6. Service account interactive logon (never-before-seen) [behavioral]
 7. Disabled/expired account login attempt                [single-event]
 8. MFA disabled or reset for a user                       [single-event]
 9. Outbound connection to a known-bad / new external IP   [cross-source]
10. Mass file access or deletion (possible ransomware)     [threshold]

bluekit increment — siem.py. We implement, in miniature, the two operations at the heart of this chapter: normalize(raw, source) turns a source-specific raw event into the common schema, and correlate(events, rule) applies a simple threshold-or-sequence rule to a list of normalized events. As always, the code is never executed during authoring; the expected output is hand-traced.

# bluekit/siem.py  — Chapter 21 increment
"""Minimal SIEM core: normalize raw events to a common schema, then correlate."""

def normalize(raw: dict, source: str) -> dict:
    """Map a source-specific raw event onto the common schema."""
    field_map = {                       # which raw key holds each canonical field
        "vpn":          {"user": "u",  "src_ip": "ip", "outcome": "res"},
        "win_security": {"user": "TargetUserName", "src_ip": "IpAddress",
                         "outcome": "Status"},
    }
    m = field_map[source]
    res = raw.get(m["outcome"], "")
    outcome = "success" if str(res) in ("success", "0x0", "0") else "failure"
    return {"timestamp": raw["ts"], "source": source, "action": "login",
            "user": raw.get(m["user"]), "src_ip": raw.get(m["src_ip"]),
            "outcome": outcome}

def correlate(events: list, rule: dict) -> list:
    """Threshold rule: alert if >= rule['count'] matching events for one user."""
    counts = {}
    for e in events:
        if all(e.get(k) == v for k, v in rule["match"].items()):
            counts[e["user"]] = counts.get(e["user"], 0) + 1
    return [{"alert": rule["name"], "user": u, "n": n}
            for u, n in counts.items() if n >= rule["count"]]

if __name__ == "__main__":
    raw = [{"ts": "…02:01:10Z", "u": "mreyes", "ip": "203.0.113.77", "res": "fail"},
           {"ts": "…02:01:14Z", "u": "mreyes", "ip": "203.0.113.77", "res": "fail"},
           {"ts": "…02:01:19Z", "u": "mreyes", "ip": "203.0.113.77", "res": "fail"}]
    evs = [normalize(r, "vpn") for r in raw]
    rule = {"name": "vpn_brute_force", "count": 3,
            "match": {"action": "login", "outcome": "failure"}}
    print(evs[0])
    print(correlate(evs, rule))

# Expected output:
# {'timestamp': '…02:01:10Z', 'source': 'vpn', 'action': 'login', 'user': 'mreyes', 'src_ip': '203.0.113.77', 'outcome': 'failure'}
# [{'alert': 'vpn_brute_force', 'user': 'mreyes', 'n': 3}]

Trace it: each raw VPN event is normalized — its res value "fail" is not in the success set, so outcome becomes "failure" and the field names become canonical. Then correlate counts events matching action=login, outcome=failure per user; mreyes has three, which meets the rule's threshold of three, so one alert is emitted. This twenty-five-line module is the conceptual core of a real SIEM: a normalizer and a correlator. Everything else a commercial SIEM adds — scale, more sources, richer rule logic, dashboards — is built on these two ideas you just implemented. You have written the heart of Meridian's monitoring.

Summary

This chapter built the central nervous system of security operations.

  • Logs are the ground truth. Every detection, investigation, and incident timeline rests on records systems wrote at the time; a control that does not log is half a control. Logging is also a compliance obligation (PCI-DSS, GLBA), but its real value is defense.
  • A SIEM collects logs from many log sources, normalizes them into a common schema (after parsing out the fields), correlates events into alerts, and lets analysts query and visualize. Ship logs off the box to a central system the attacker cannot erase; synchronize clocks (NTP) and standardize on UTC.
  • Collect by detection value, not exhaustively: identity and endpoint first, then cloud, network, servers, applications. "Collect everything" produces an expensive, noisy SIEM that gets ignored.
  • A correlation rule fires on a pattern — single-event, threshold, sequence, cross-source, or behavioral. A use case is the named threat scenario a rule serves, defining its sources, severity, false-positive risk, and analyst response. Attacks are sequences; correlation reads sequences that single events miss.
  • Detection-as-code manages rules as version-controlled, reviewable, testable text (Sigma is the portable format) rather than un-auditable clicks.
  • Querying in SQL, SPL, and KQL expresses the same operations — filter, aggregate, sort, join — in different syntax; lead every query with a time bound.
  • Alert fatigue — desensitization from too many alerts, mostly false positives — is the most common reason a SOC misses a real attack. Fidelity, not coverage, is the currency of a SOC. Tame the queue by tuning thresholds/conditions, allowlisting known-benign sources, aggregating duplicates, risk-scoring, and suppressing scheduled noise — and tune noisy rules rather than disabling them (which creates a silent false negative).
  • A data lake stores everything cheaply for long retention and hunting; a SOAR automates the response to alerts via playbooks. The pattern: high-value logs to the SIEM, a cheap full copy to the lake, response orchestrated by SOAR.
  • Dashboards (operational vs. executive) and the metrics born in the SIEM — MTTD, MTTR, coverage, false-positive rate — are the bridge to board reporting (Chapter 36).
  • Meridian gained a logging & monitoring standard and its first ten use cases, and bluekit gained siem.py (normalize, correlate).

Spaced Review

Retrieval practice across this chapter and earlier ones. Answer before expanding.

  1. (This chapter) Why does a sequence correlation rule catch a brute-force-that-succeeded when single-event rules on "failed login" and "successful login" both miss it?
  2. (Chapter 10) The SIEM consumes the network-visibility layer you built earlier. What did NetFlow/flow data and a beaconing score give you at the network layer that becomes one input the SIEM correlates with identity and endpoint events?
  3. (Chapter 6) Why does normalizing all sources to a single timezone (UTC) and synchronizing clocks (NTP) matter for correlation, and what network-layer fact from the fundamentals makes accurate timestamps non-trivial across many devices?
  4. (This chapter) Distinguish a false positive from a false negative, and explain why "fixing" alert fatigue by disabling a noisy rule trades one for the other.
  5. (Chapter 10) Both Chapter 10 and this chapter insist "you can't defend what you can't see." How is the SIEM's answer to that maxim broader than network monitoring's?
Answers 1. Individually, failed and successful logins are constant and ordinary; only the *ordered combination* — many failures immediately followed by a success for the same account — is the signature of a brute-force that worked, which a sequence rule reads and single-event rules cannot. 2. Flow data summarized who talked to whom at scale, and a beaconing score flagged the regular, automated callbacks characteristic of command-and-control; normalized into the SIEM, that network signal can be correlated with, say, the endpoint process and the identity that initiated it. 3. Correlation depends on events from different sources lining up in time; if clocks drift, co-occurring events appear minutes apart and time-windowed rules miss them. Many heterogeneous network devices each keep their own clock, so without NTP they drift apart — hence synchronize and normalize to UTC. 4. A false positive is an alert with no real attack; a false negative is a real attack with no alert. Disabling a noisy rule removes its false positives but guarantees a false negative for anything it would have caught — a silent, permanent blind spot, which is strictly worse than visible noise. 5. Network monitoring sees the wire; the SIEM correlates the wire *with* identity, endpoint, cloud, and application logs, so it can see attacks that are invisible on the network alone (e.g., an insider or credential abuse that never crosses a monitored link) by reading the full, cross-source sequence.

What's Next

You now have a SIEM that collects, normalizes, correlates, and alerts — and the discipline to keep its queue trustworthy. But the first ten use cases are a starting catalog, not a detection program, and a SIEM can only alert on patterns you have thought to define. Chapter 22 turns this instrument into a craft: threat detection and hunting. You will learn to turn threat intelligence and the MITRE ATT&CK framework into detections systematically, to write portable rules in Sigma (the detection-as-code format we previewed here), to hunt — to go looking, hypothesis in hand, for the attacker your rules did not catch — and to measure your detection coverage against the techniques real adversaries use. Where this chapter built the place the data lives and the engine that watches it, the next builds the intelligence about what to watch for — including hunting for the very SolarWinds-style beaconing that the best correlation rules are designed to surface.