Case Study 1: Standing Up Meridian's SIEM

"We had all the evidence and none of the visibility. The logs were telling the truth the whole time — to nobody." — Marcus Reyes, SOC Manager, Meridian Regional Bank (constructed)

Executive Summary

Six days. That is how long an attacker dwelt inside Meridian Regional Bank's network before an unrelated audit stumbled across the evidence — evidence that had been written, faithfully and in real time, by three different systems that never spoke to one another. The incident cost Meridian no money in the end, but it cost the security team something more useful: an argument they could not lose. Within a quarter, CISO Dana Okafor had funded a centralized Security Information and Event Management program, and engineer Sam Whitfield and SOC manager Marcus Reyes had to build it — deciding what to collect, how to normalize it, which ten detections to write first, and how to keep the alert queue trustworthy enough that the next six-day foothold would last six minutes.

This case study follows that build, from a blank architecture diagram to a working pipeline with its first ten use cases and a tuning discipline. You will watch the chapter's concepts stop being definitions and become decisions under real constraints: a fixed budget, a SIEM licensed by data volume, a small team, and a legacy environment that logs inconsistently. The scenario and all figures are constructed for teaching (Tier 3).

Skills applied: log-source prioritization; collection-method selection (agent / syslog / API); normalization to a common schema; writing correlation rules across the correlation ladder; framing detection use cases (sources, severity, response, false-positive risk); SIEM-vs-data-lake architecture; alert tuning; turning a near-miss into a logging & monitoring standard.

Background: the six-day foothold

Reconstructed after the fact, the intrusion was almost boring, which is the point. An attacker compromised the credentials of svc_app, a service account that an aging internal application used to talk to a database. Service accounts are supposed to run software, not sit at keyboards — but this one's password had been hard-coded in a configuration file (a problem Meridian's secrets-management work in Chapter 20 was meant to fix), and it had been over-privileged for convenience years earlier.

Here is what the systems recorded, in the order it happened (times UTC, normalized after the fact):

day1 14:02:10  win_security  user=svc_app  host=APPSRV9  action=login      outcome=success  logon_type=interactive
day1 14:09:33  edr           user=svc_app  host=APPSRV9  action=process    proc="net group /domain"
day1 15:12:04  win_security  user=svc_app  host=DC01     action=group_add  target_group="Domain Admins"  outcome=success
day3 02:40:17  edr           user=svc_app  host=APPSRV9  action=process    proc="rclone ... remote:exfil"
day6 11:55:02  (discovered by a quarterly access review: svc_app is in Domain Admins)

Three observations made by the post-incident review — each of which became a design requirement for the SIEM:

  1. Every step was logged. The interactive logon, the reconnaissance command, the privilege escalation, even the exfiltration tool's execution — all recorded by Windows and the endpoint agent at the moment they occurred. The data existed.
  2. No single event was an alarm. Service accounts log in. Administrators run net group. Group memberships change. The bank could not, and should not, page an analyst for any one of these in isolation. The signal was in the sequence.
  3. The logs lived in three silos. The Windows events were on the servers; the endpoint events were in the EDR console; nobody correlated them, and nobody was alerted when svc_app — a service account — logged in interactively for the first time in its life. There was no central brain.

🔗 Connection: Notice the kill-chain shape (Chapter 2): valid-account access → discovery → privilege escalation → exfiltration. Each stage was an opportunity to detect — and each was missed not for lack of data but for lack of correlation. A SIEM exists precisely to turn that sequence into an alert at stage one or two, not a discovery at stage four during an audit.

Dana's framing to the board was deliberately plain: "We were not blind because we lacked cameras. We were blind because no one was watching the monitors, and the monitors were in three different rooms." The program increment that followed is what this case study builds.

Phase 1 — Deciding what to collect

Sam's first instinct, shared by every engineer handed a shiny new SIEM, was to collect everything. Marcus, who would have to run the thing, talked him out of it. Their SIEM licensed by ingest volume; collecting every debug log in the bank would blow the budget on noise and bury the detections that mattered. So they applied the chapter's discipline: collect by detection value, top down.

They built a prioritized log-source list, and — crucially — assigned each source an owner and an onboarding method, because "we should collect AD logs" is a wish and "Sam onboards the domain controllers via the Windows agent by the 15th" is a plan:

Priority Source Method Owner Why first
1 Active Directory / Entra ID sign-ins agent + API (Entra) Sam The foothold was a credential; identity is the new perimeter
2 Endpoint detection (EDR) vendor agent Marcus Caught the recon and exfil tool — would have alerted at stage 2
3 AWS CloudTrail API pull Sam Cloud control plane; an attacker in AWS leaves traces only here
4 Firewall / proxy / DNS syslog (TLS) Sam Network edge context (Chapters 7, 10); outbound C2 detection
5 Windows / Linux servers agent / syslog Sam Where the interactive logon and group change were recorded
6 Core banking + online banking apps agent / app logs app teams Access to the crown jewels; high-value, custom detections
7 M365 / mail gateway API pull Marcus Phishing (the Chapter 1 near-miss) and account takeover

⚠️ Common Pitfall: Sam initially wanted the printer and badge-reader logs in "phase one" because they were easy syslog sources. Marcus pushed back with the question that governs every collection decision: "What use case does this serve, and is it higher-value than something we're not yet collecting?" The printers went to the backlog. Easy-to-collect is not the same as worth-collecting; collection effort should follow detection value, not the path of least resistance.

The everything-else problem did not vanish — Meridian still had compliance retention obligations and a desire to hunt (Chapter 22) and do forensics (Chapter 25) across data the SIEM did not keep hot. The answer, decided here and detailed in Phase 5, was a two-destination architecture: high-value logs to the real-time SIEM, a cheaper full copy to a data lake.

Phase 2 — Normalizing the chaos

The moment the first sources came online, the silo problem reappeared in a new form: every source described the world differently. The same concept — a user authenticating from somewhere — arrived as a Linux sshd line, a Windows Event 4624/4625, an Entra sign-in JSON blob, and a VPN log, each with different field names, value encodings, and timestamp formats.

Sam mapped them all onto a small common schema, choosing canonical field names and translating each source on ingest. A representative slice of the mapping:

Common schema:  timestamp | source | host | user | src_ip | action | outcome | extra{}

  linux_auth (sshd):   "Failed password for jchen from 198.51.100.23"
     -> user=jchen, src_ip=198.51.100.23, action=login, outcome=failure
  win_security (4625): TargetUserName, IpAddress, Status=0xC000006A
     -> user=<TargetUserName>, src_ip=<IpAddress>, action=login, outcome=failure
  entra_signin (JSON): userPrincipalName, ipAddress, status.errorCode
     -> user=<userPrincipalName>, src_ip=<ipAddress>,
        outcome = (errorCode==0 ? success : failure)
  firewall (syslog):   "DENY TCP 198.51.100.23:52344 -> 10.20.4.7:22"
     -> src_ip=198.51.100.23, action=connection, outcome=deny

Two decisions Sam made here mattered more than the field names:

  • UTC everywhere, NTP enforced. Several legacy servers had clocks minutes off true. Sam made NTP synchronization a hard requirement of the logging standard and normalized every timestamp to UTC, because — as the chapter stresses — sequence rules across sources are worthless if the sources disagree about when. The six-day-foothold reconstruction itself had required fixing one server's drift before the events lined up.
  • Map to an industry model, don't invent. Rather than inventing field names from scratch, Sam mapped onto an established schema so that, later, vendor and community detections (and the Sigma rules of Chapter 22) would work against Meridian's data without rewriting.

Normalization also surfaced a problem the team had not anticipated, and it is worth recording because it is universal: some of the most security-relevant fields were the hardest to extract. The Windows logon events, for instance, encode how a user logged in as a numeric "logon type" — interactive, network, service, remote-interactive — and that single field was the difference between "a service account did its normal job" and "a service account sat at a console," which was the entire signal in the six-day foothold. A naive normalization that mapped only user, IP, and outcome would have thrown that field away, blinding the very detection (use case #6) that mattered most. Sam learned to preserve source-specific high-value fields in an extra{} sub-object rather than flattening every event to a lowest common denominator. The lesson generalizes: normalize to a common schema for cross-source correlation, but do not let normalization lose the source-specific detail that powers your best detections. A common schema is a floor of shared fields, not a ceiling that discards everything else.

🔗 Connection: The firewall and DNS sources here are the same network telemetry Chapter 10 taught the team to capture; the beacon_score and flow summaries from that chapter become normalized inputs the SIEM can correlate with identity and endpoint events. Chapter 10 gave Meridian eyes on the wire; this phase gives those eyes a shared language with the eyes on identity and endpoints.

Phase 3 — The first ten use cases

With data flowing and normalized, Marcus's SOC wrote detections. They resisted two temptations: importing hundreds of vendor-default rules (a fast path to alert fatigue), and trying to detect everything at once. Instead they wrote ten use cases — the starter catalog — each specified properly, mapped to the correlation ladder, and aimed squarely at the kind of attack that had just dwelt in the network for six days.

 1. Brute force followed by success        [sequence]      T1110 -> T1078
 2. Password spraying (one src, many users)[threshold]     T1110.003
 3. Impossible travel                       [sequence/geo]  T1078
 4. Security/audit log cleared              [single-event]  T1070.001
 5. New privileged-group membership added   [single-event]  T1098 / T1078
 6. Service account interactive logon       [behavioral]    T1078.002  <-- would have caught the foothold
 7. Disabled/expired account login attempt  [single-event]  T1078
 8. MFA disabled or reset for a user        [single-event]  T1556
 9. Outbound to known-bad / new external IP [cross-source]  T1071 / C2
10. Mass file access or deletion            [threshold]     T1486 (ransomware)

Use case #6 was personal. Had it existed, the foothold would have alerted at day1 14:02:10, the moment svc_app logged in interactively for the first time. The team specified it carefully, because behavioral "first-seen" rules are powerful but noisy in their first weeks (everything is "first seen" at first):

Use case #6:   Service account interactive logon (never-before-seen)
ATT&CK:        T1078.002 (Valid Accounts: Domain Accounts)
Log sources:   win_security (logon_type=interactive/remote-interactive), Entra sign-ins
Trigger:       a user on the SERVICE-ACCOUNT list performs an interactive logon
               AND (host, account) not seen interactive in the prior 30 days
Severity:      High (service accounts should never sit at a console)
Response:      Confirm with the account owner; if unexpected, disable + hunt for
               post-logon activity (recon, group changes, outbound connections)
False pos.:    legitimate break-glass admin use of a service account; a NEW service
               account in its first 30 days -> tune via an explicit allowlist + a
               grace period; require interactive (not service/network) logon type

And here is use case #1 — the classic — implemented as the kind of correlation query the SIEM runs continuously, against the normalized events table:

-- Use case #1: brute force followed by success, per user, 10-min window.
WITH bursts AS (
  SELECT user, COUNT(*) AS fails, MIN(timestamp) AS t0
  FROM events
  WHERE action = 'login' AND outcome = 'failure'
    AND timestamp >= NOW() - INTERVAL '15' MINUTE
  GROUP BY user
  HAVING COUNT(*) >= 10
)
SELECT s.user, b.fails, s.src_ip, s.timestamp AS success_at
FROM bursts b
JOIN events s
  ON s.user = b.user AND s.action = 'login' AND s.outcome = 'success'
 AND s.timestamp BETWEEN b.t0 AND b.t0 + INTERVAL '10' MINUTE
 AND s.src_ip <> (SELECT usual_ip FROM user_baseline u WHERE u.user = b.user);
-- The src_ip <> usual_ip clause is a TUNING condition: a success from the user's
-- normal device after a few fat-fingered passwords is NOT this attack.

Just as important as the ten use cases the team wrote were the detections they deliberately deferred, because a starter catalog is defined as much by its scope as its contents. Marcus kept a short, explicit "not yet" list with the reason for each deferral, so that the gaps were chosen and visible rather than accidental:

Deliberately deferred (with reason and compensating coverage):
  - Data exfiltration via DNS tunneling   -> needs DNS log volume + analytics not yet
                                             tuned; partial cover from #9 (new-IP egress)
  - Insider bulk data access (no new IP)   -> needs per-user data-access baselines (Ch.34
                                             UEBA); accepted gap, flagged for the roadmap
  - Cloud (AWS) privilege escalation       -> CloudTrail onboarding in progress (source #3);
                                             use cases follow once data is flowing
  - Web-app attack patterns (SQLi/XSS)     -> owned by the app team's WAF logs; SOC backlog

The discipline here is the same one that governed collection: you cannot detect everything at once, so choose what to detect by risk, name what you are not detecting, and put the gaps on a roadmap rather than pretending they do not exist. A SOC that knows its blind spots can plan to close them; a SOC that does not know its blind spots simply has them. Writing the "not yet" list down turned Meridian's coverage from a vague hope into a managed backlog with owners and dates.

🛡️ Defender's Lens: Every one of these ten use cases is also the seed of a later chapter's capability. #1–#3 and #6 feed the detection-engineering and hunting practice of Chapter 22; #9 (C2 detection) ties to the network monitoring of Chapter 10 and the SolarWinds-style beaconing hunt of Chapter 22; #10 (ransomware) is the trigger that kicks off the incident-response tabletop of Chapter 24. A first detection catalog is, in effect, the table of contents of a security operations program.

🔄 Check Your Understanding: Use case #6 would have caught the foothold at stage one. Why did the team still specify a 30-day grace period and an allowlist for it, given how valuable it is? What attack does that grace period risk letting through, and how would you cover that gap? (Hint: think about what "first seen in 30 days" means for an attacker who waits, and what other use case — #5 — provides overlapping coverage.)

Phase 4 — The alert-fatigue reckoning

Two weeks after the detections went live, Marcus pulled the queue statistics and found the disease the chapter warns about, in miniature. The ten rules were generating about 220 alerts a day, and his SOC could properly investigate perhaps 60. The breakdown was familiar:

Daily alert volume by rule (week 2):
  #1 brute force-then-success      8   (mostly real-ish; keep)
  #2 password spraying            12   (real attack surface; keep)
  #3 impossible travel            70   <-- VPN exit nodes + corporate travel = noise
  #4 log cleared                   1   (high fidelity; keep)
  #5 new privileged group          6   (some legit IT changes; aggregate)
  #6 service-account interactive  95   <-- "first seen" firing on the 30-day backfill
  #9 outbound to new external IP  25   (CDN/SaaS churn; needs allowlists)
  others                           3
  ------------------------------------
  total                          220   (SOC capacity ~60)

The two offenders — #3 (impossible travel) and #6 (service-account logon) — were drowning everything else. Critically, Marcus did not disable them, because both catch attacks Meridian genuinely fears. He tuned them:

  • #6 service-account interactive logon (95 → ~3/day). The flood was the "first seen in 30 days" baseline firing on every service account during the initial backfill window. Fix: a one-time baseline build (don't alert during the first 30 days of data, only on genuinely new behavior thereafter), plus an explicit allowlist of the handful of service accounts that do legitimately log in interactively for break-glass operations, each with a documented justification.
  • #3 impossible travel (70 → ~5/day). The noise was corporate VPN exit nodes (a user appears to be in two cities because their traffic egresses from two data centers) and legitimate travel. Fix: allowlist the corporate VPN egress ranges, exclude cloud-provider IP space, and raise the "impossible" speed threshold so that plausible same-day flights do not trip it.
  • #9 outbound to new external IP (25 → ~6/day). Most "new" destinations were content-delivery networks and SaaS endpoints. Fix: allowlist known-good CDN/SaaS ranges and reputation-score the destination, so only genuinely unknown or known-bad IPs alert.

After tuning, the queue settled to ~35 alerts/day against a capacity of ~60 — a queue the team could actually work, which is the entire goal.

🚪 Threshold Concept: Marcus's instinct to tune rather than disable is the dividing line between a SOC that defends and one that decorates. Every alert he eliminated was a false positive he excluded by narrowing a condition — not a detection he blinded himself to. The discipline scales: he scheduled a standing weekly review of the noisiest rules and the false-positive rate, because tuning is not setup, it is operations. A SIEM is a garden, not a statue.

⚖️ Authorization & Ethics: One tuning decision raised a flag worth noting. Allowlisting the break-glass service accounts in #6 creates a detection hole — by design, those accounts can now log in interactively without alerting. Marcus documented each allowlist entry with a justification and an owner, set a quarterly review, and added a compensating detection (#5, privileged-group changes) so that even an allowlisted account's escalation would still fire. An allowlist is a deliberate, documented risk acceptance, never a quiet convenience.

Phase 5 — The architecture and the standard

To square real-time detection against retention and cost, Sam finalized the two-destination architecture the team had sketched in Phase 1:

   sources ──► collect ──► normalize ──┬──► SIEM (hot, ~90 days)
                                        │      real-time correlation, alerting,
                                        │      analyst queries, dashboards
                                        └──► DATA LAKE (cold, ~13 months+)
                                               cheap full copy: hunting (Ch.22),
                                               forensics (Ch.25), PCI/GLBA retention

High-value, detection-relevant logs (identity, endpoint, security-relevant server and app events) go to the SIEM, kept hot for about ninety days for fast investigation. A fuller, cheaper copy of everything goes to the data lake, retained well over a year to satisfy PCI-DSS and GLBA and to support hunting and forensics across data the SIEM does not keep hot. The split accepts one documented risk: correlation in real time runs only over what is in the SIEM, so a detection that needs a source kept only in the lake must either be promoted to the SIEM or run as a slower, scheduled lake query.

The reasoning behind the ninety-day hot window deserves a note, because it is a question every SIEM owner must answer and the wrong answer is expensive in one of two directions. Make the SIEM window too short and you cripple investigation: the average intrusion is discovered weeks or months after it begins (the six-day foothold was, in the end, found by an audit), and an investigator who needs to ask "when did this account first behave strangely?" must have those older events somewhere queryable. Make the SIEM window too long, on a platform that licenses by retained volume, and the cost balloons for data that is rarely touched. The hot/cold split resolves the tension: ninety days hot covers the overwhelming majority of investigations at speed, while the data lake holds thirteen-plus months cheaply for the rarer deep historical query and the compliance mandate. The decision rule Sam wrote into the standard was simple — a log's hot-retention period is set by how far back a real-time detection or a routine investigation needs it; everything beyond that goes cold but stays retained.

🔗 Connection: The "~90 days hot, ~13 months cold" choice is a log retention decision driven by two forces this book keeps returning to: detection/investigation need (operational) and compliance (Theme 5 — PCI-DSS Requirement 10 and GLBA set a floor you cannot go below). The same retained logs become the raw material for hunting (Chapter 22), forensics (Chapter 25), and the metrics a CISO reports to the board (Chapter 36). Retention is not a storage detail; it is the time-depth of your ability to see the past.

This all became Meridian's logging and monitoring standard, the program artifact this chapter contributes: the prioritized source list with owners; mandatory normalization to a common schema and UTC with NTP; one-year-plus retention with the hot/cold split; the first-ten use-case catalog; detections managed as code in version control; and a standing weekly tuning review. Dana now had an answer to the board's inevitable question — "would we catch it next time?" — that was not a hope but a documented capability with metrics behind it.

Discussion Questions

  1. Marcus refused to import the vendor's hundreds of default detection rules. Argue both sides: what is gained and what is risked by starting with ten hand-written use cases instead of a large prebuilt library?
  2. Use case #6 (service-account interactive logon) would have caught the foothold at stage one but was the noisiest rule in week two. How do you weigh a detection's value against its initial noise, and when is a high-value-but-noisy rule worth the tuning effort versus a different approach entirely?
  3. The team allowlisted break-glass service accounts in #6, creating a deliberate detection hole. What governance makes that acceptable rather than reckless, and what compensating control covers the gap?
  4. Sam wanted easy-to-collect sources (printers, badge readers) in phase one; Marcus prioritized by detection value. In your environment, name one "easy but low-value" source and one "harder but high-value" source, and defend the ordering.
  5. The SIEM-vs-data-lake split accepts that real-time correlation runs only over the hot SIEM data. When would that trade-off bite, and how would you decide whether to promote a lake-only source into the SIEM?

Your Turn

Take a small organization you know (or invent one) and reproduce Phases 1–3. (1) Write a prioritized log-source list of at least seven sources, each with a collection method and a one-line detection-value justification, ordered top-down. (2) Pick two raw log formats you have actually seen and write their normalized form in a small common schema. (3) Write three use cases from different rungs of the correlation ladder (one single-event, one threshold, one sequence), each fully specified (sources, trigger logic, severity, response, false-positive risk). For one of them, write the correlation query in SQL, SPL, or KQL. Keep it to two pages. If you cannot name the use case a source serves, that source belongs in your backlog, not your phase one.

Key Takeaways

  • A breach is rarely a failure to collect data; it is usually a failure to correlate it. Meridian had every event of a six-day foothold logged across three silos with no central brain.
  • Collect by detection value, top-down (identity and endpoint first), with an owner and an onboarding method per source. Easy-to-collect ≠ worth-collecting.
  • Normalize to a common schema and UTC, enforce NTP. Sequence correlation across sources is worthless if the sources disagree on field names or on when.
  • Start with a small, well-specified first-ten use-case catalog spanning the correlation ladder, not a flood of vendor defaults. Each use case names its sources, severity, response, and false-positive risk.
  • Tune, don't disable. The two noisiest rules were the most valuable; narrowing their conditions (baselines, allowlists, reputation, thresholds) cut a 220-alert queue to a workable ~35 without creating blind spots. Tuning is standing operations, reviewed weekly.
  • An allowlist is a documented risk acceptance with an owner, a review date, and a compensating control — never a quiet convenience.
  • The SIEM-vs-data-lake split (hot/curated vs. cheap/comprehensive) reconciles real-time detection with retention, hunting, forensics, and compliance — and the trade-off it accepts should be stated explicitly.
  • The whole effort becomes a logging & monitoring standard, the program artifact, and the source of the metrics the board will later see (Chapter 36).