Case Study 1: Meridian's Anomaly-Detection Pilot on Authentication Logs

DataField.Dev

Case Study 1: Meridian's Anomaly-Detection Pilot on Authentication Logs

"We are not buying an AI. We are testing whether one transparent detector, on one log source, catches a class of attack we currently miss — and whether my SOC can work its output without drowning." — Dana Okafor, CISO, Meridian Regional Bank (constructed)

Executive Summary

Meridian's detection program (Chapters 21–22) was good at finding known bad — indicator matches, signatures, correlation rules for specific scenarios. Its structural blind spot was an attacker using valid credentials: phish a password, log in "normally," and generate no signature and no indicator of compromise. To test whether unsupervised behavioral anomaly detection could cover that gap, CISO Dana Okafor authorized an eight-week pilot, run by junior SOC analyst Theo Brandt with engineer Sam Whitfield on the data plumbing. The mandate was deliberately humble: one log source (authentication), a narrow set of high-value entities (service and privileged accounts), the most transparent possible model (a per-feature z-score, not a black box), and a success metric of precision the analysts actually experienced, not the model's abstract accuracy.

This case study follows the pilot end to end: scoping it to dodge the base-rate trap, building a per-entity baseline that resists poisoning, hand-tracing the z-score that fired on a real attack, living through the false-positive load, and writing an honest verdict into the security program. You will see every concept from the chapter — supervised vs. unsupervised, the false-positive economics, baseline contamination as a poisoning risk, explainability, and human-in-the-loop — stop being theory and become operational decisions. The scenario and all figures are constructed for teaching (Tier 3).

Skills applied: scoping an ML detection pilot against the base rate; per-entity baselining; hand-computing z-score anomalies; tuning a threshold to SOC capacity; recognizing and mitigating baseline poisoning; enrichment to raise precision; distinguishing "anomalous" from "malicious"; writing a defensible, limits-honest program section.

Background

Meridian Regional Bank runs the hybrid environment you know: an on-prem Active Directory domain and a legacy core, plus AWS and a Microsoft Entra ID / M365 tenant. Authentication events from all of it already flow into the SIEM that Marcus Reyes's team stood up in Chapter 21, normalized into a common schema. The Chapter 22 detection-engineering work added IoC matching, Sigma rules, and a hunting practice on top.

The gap that prompted the pilot surfaced during a tabletop. Priya Nair, the incident-response lead, posed a scenario: "Suppose an attacker phishes a teller's password — but the teller has only a weak second factor, or the attacker session-hijacks past it. The attacker logs into our systems with a real credential. Walk me through which of our detections fire." The uncomfortable answer was none reliably would, until the attacker did something that matched a known indicator (connected to a known-bad IP, dropped known malware). An attacker disciplined enough to use valid credentials and fresh infrastructure could operate in a detection shadow. That is exactly the gap unsupervised behavioral detection is built to cover — and exactly the gap that no signature, by definition, can.

Dana set three guardrails in the kickoff, each a direct application of the chapter:

"Transparent before clever." The pilot would start with a z-score the team could explain on a whiteboard, not a vendor black box. If a detector cannot justify an alert in an incident review, it does not belong in a regulated bank's SOC.
"Lead generator, never an oracle." Output feeds human triage. No automated action — no auto- disable, no auto-block — would ever be wired to the model. (This guardrail would matter enormously in §34.4's poisoning and evasion threat model.)
"Measure what the analysts feel." Success is precision and alert volume the SOC can sustain, not the model's accuracy on a slide.

The Pilot

Phase 1 — Scoping to beat the base rate

Theo's first instinct, which he caught himself making, was to point the detector at all authentication events — roughly 1,000,000 per day across the bank. Sam stopped him with the §34.3 arithmetic on a napkin: even a strong detector (99% true-positive, 1% false-positive) run against a million events where ~100 are malicious yields a queue that is ~99% false positives. "Run it on everything," Sam said, "and we mute it in week one. Shrink the denominator first."

So the pilot's scope became the highest-value, lowest-volume slice:

Scoping decision	Choice	Why
Entities	~80 service accounts + ~150 privileged human accounts (~230 total)	Highest impact if taken over; small enough to keep precision sane
Log source	Authentication only (already in the SIEM)	No new collection; reuse Chapter 21 pipeline
Excluded	The other ~1,650 standard user accounts	Deliberately out of scope for the pilot — base-rate discipline
Pre-filter	Only events for in-scope entities	Shrinks the denominator before the model runs

🛡️ Defender's Lens: Notice that the most important "model tuning" decision was made before any model ran — it was a scoping decision. Narrowing from 1,000,000 events to the few thousand involving 230 high-value accounts attacks the base-rate problem more powerfully than any algorithm improvement could. This is the §34.3 lesson in practice: precision is won by shrinking the haystack, not only by sharpening the needle.

Phase 2 — Choosing features and building a poison-resistant baseline

Theo selected four phase-1 features, each chosen because an account takeover would plausibly disturb it:

Failed-login count per night (a brute-force or credential-spray leaves a spike).
Distinct source IPs per day (a hijacked account often logs in from new places).
Login-hour profile (a service account that runs at 02:00 looking suddenly busy at 14:00 is odd).
First-time-system-access flag (an account touching a system it has never touched before).

For the baseline, Theo proposed a 30-day rolling per-entity window. Sam raised the concern that turns out to be the heart of §34.4: "If an attacker is already in one of these accounts when we start learning, their behavior is baked into the baseline, and we'll call it normal forever. And if they ramp slowly, the rolling window will keep absorbing them." This is data poisoning of an unsupervised model — not by an external dataset, but by the live environment the baseline learns from.

Their mitigation, written into the design:

Freeze and human-review the baseline weekly rather than letting it drift unchecked. A person eyeballs each entity's baseline for suspicious creep before it is accepted.
Use a robust spread where data is spiky. For features with occasional legitimate bursts, prefer median-based statistics so one big night does not inflate $\sigma$ and desensitize the detector.
Cross-check the initial baseline against known-clean periods and against peer accounts, so a single contaminated account stands out against its peer group.

⚠️ Common Pitfall: A naive "self-tuning" anomaly detector that continuously re-learns normal from whatever it sees is the easiest possible target for a patient attacker: every small malicious increment is folded into tomorrow's notion of normal. The fix is not cleverer math; it is slower, vetted adaptation and human-reviewed baselines — provenance and validation applied to the data the model learns from.

Phase 3 — The z-score that fired (hand-traced)

Three weeks in, the detector flagged the service account svc-reconcile, which runs the nightly account-reconciliation job and normally fails to authenticate a small, steady number of times against a flaky downstream system. Its failed-login baseline over the prior ten nights:

$$x = [\,2,\ 3,\ 2,\ 4,\ 3,\ 2,\ 3,\ 4,\ 2,\ 5\,]$$

Theo traced the arithmetic the same way you can:

Sum $= 30$, so the mean $\mu = 30/10 = 3.0$.
Deviations from the mean: $-1, 0, -1, 1, 0, -1, 0, 1, -1, 2$; squared: $1,0,1,1,0,1,0,1,1,4$; sum $= 10$.
Population variance $= 10/10 = 1.0$; standard deviation $\sigma = \sqrt{1.0} = 1.0$.

That night, svc-reconcile recorded 9 failed logins:

$$z = \frac{9 - 3.0}{1.0} = 6.0.$$

Six standard deviations above its own normal — far past the pilot's tuned threshold. The alert carried its own explanation, which is the whole point of starting transparent: "svc-reconcile failed-login z = 6.0 (9 tonight vs. baseline mean 3.0, sd 1.0)." An analyst could read that and immediately understand what the model was claiming.

  ALERT: svc-reconcile  feature=failed_logins_per_night
  baseline (10 nights): [2,3,2,4,3,2,3,4,2,5]   mean=3.0  sd=1.0
  tonight: 9            z = (9-3.0)/1.0 = 6.0   >= threshold(3)  ANOMALY
  enrichment: account=service  privilege=high  recent_change_ticket=NONE
  -> escalate to Tier 2 (high-value entity, no change ticket explains it)

Theo did not stop at the anomaly. Following the right half of the pipeline (Figure 34.2), he enriched: was there a change ticket explaining new behavior? (No.) Was the account high-value? (Yes — service account with access to reconciliation data.) He pulled the successful-auth events and source IPs and found logins from a new IP shortly after the failures. Investigation confirmed it: an attacker who had earlier phished a different employee had discovered the svc-reconcile credential (which, per the Chapter 19 finding, had never been rotated) and was brute-forcing toward it. The anomaly was a real, weak signal of a real attack — exactly what the pilot was meant to find, and exactly what no signature would have caught.

🔄 Check Your Understanding: The alert fired on a single feature (failed logins). A full UEBA system would have fused this with the new source IP and any first-time-system access into one risk score. Would fusion have made this particular detection more confident, and what would it have cost in explainability? (Hint: weigh the §34.2 trade — fused scores are stronger but harder to justify.)

Phase 4 — Living with the false positives

The pilot was not all clean wins, and the honest part is the instructive part. Over eight weeks the detector surfaced 140 candidate anomalies. The large majority were benign:

A new nightly batch job Sam's own team had deployed shifted a service account's login-hour profile.
A service account legitimately migrated to a new host, changing its source IPs overnight.
An administrator working an unannounced maintenance window logged in at an odd hour from a new location — legitimate, but it should have been ticketed.
A dormant service account suddenly authenticated; investigation showed a forgotten but legitimate integration had been re-enabled.

None of these were attacks. All of them were, correctly, anomalies — which is precisely the threshold concept made painful: anomalous is not malicious. The detector did its job (flagging the unusual); the humans did theirs (judging intent). The false-positive load taught the team a concrete improvement: many benign anomalies coincided with a change ticket or a known maintenance window. So they added an enrichment step — automatically cross-reference each anomaly against the change-management system and the maintenance calendar before alerting — which cut the noise roughly in half without losing the three real findings. That single enrichment did more for the analysts' experience than any model change would have, illustrating §34.3's lever: enrich to raise the true-positive yield the team can afford to surface.

📟 War Story: The clarifying moment came when a board member, briefed on the pilot, asked Dana why the bank didn't "just turn on the AI and let it block the bad logins automatically." Dana's answer became a teaching line inside the SOC: "Because three of our 140 alerts were real and 137 were a backup job, a server migration, and a forgotten integration. If we'd auto-blocked, we would have broken reconciliation, taken down a migrated service, and disabled a working integration — to catch attacks a human caught anyway. The model finds leads. People close cases. Wiring a corruptible model to an automatic action is how you turn a detection tool into a self-inflicted denial of service — or a target an attacker pokes until it blocks something you need." That is guardrail #2, defended in the boardroom.

Phase 5 — The honest verdict, written into the program

At eight weeks, Theo wrote the pilot up for the program document with a verdict Dana insisted be honest to a fault:

Verdict: Behavioral anomaly detection on authentication logs is useful as a lead generator for high-value accounts and worthless as an unattended oracle. It caught three real findings the existing tooling structurally could not (an attack the SOC actioned, plus two process gaps worth knowing about). It was only viable because (a) scope was narrow, defeating the base-rate trap; (b) the baseline was human-reviewed weekly, blunting gradual poisoning; (c) every alert was explainable; and (d) a human stayed in the loop with no automated action wired to the model. Recommendation: graduate the pilot to production for service and privileged accounts only, keep the guardrails, add change- ticket enrichment as standard, and do not expand to all employees until precision at the larger scope is demonstrated — the base rate will be far less forgiving there.

This verdict, the design, and the guardrails became the Analytics and Behavioral Detection section of Meridian's security program (this chapter's Project Checkpoint), feeding the Chapter 38 capstone.

🔗 Connection: The pilot is the operational payoff of Chapters 21 (the SIEM that supplies the telemetry) and 22 (the detection-and-hunting program the anomaly detector complements). The anomaly layer does not replace IoC matching and Sigma rules; it covers the behavioral gap they cannot, and both feed the same triage queue. That is defense in depth applied to the detection function itself — known- bad signatures plus unknown-but-unusual behavior, with a human at the center of both.

Discussion Questions

The pilot's most consequential decision — scoping to 230 accounts — was made before any model ran. Argue why scoping is "model tuning," and identify another security problem where shrinking the denominator would beat improving the algorithm.
Sam insisted the baseline be human-reviewed weekly rather than continuously self-tuning. What attack does continuous self-tuning invite, and what is the cost of the weekly-review mitigation (think staffing and staleness)?
The team started with a transparent z-score instead of a more powerful black-box model. When, if ever, would Meridian be justified in trading explainability for accuracy here? What would have to be true first?
Of the 140 anomalies, 137 were benign. Was the pilot therefore a failure? Defend your answer using the difference between "anomalous" and "malicious" and the role of enrichment.
The board member wanted auto-blocking. Beyond the operational risk of breaking legitimate activity, what adversarial-ML risk (from §34.4) would auto-wiring a model to an action create?

Your Turn

Take an environment you can model (your lab, your team, or a constructed small business) and design a two-week anomaly-detection pilot for one log source. Specify: (1) the scope and why it dodges the base rate; (2) two or three features and what each would reveal; (3) the baseline window and how you would keep it from being poisoned; (4) the threshold and how you would tune it to a realistic analyst capacity; and (5) the guardrails (lead generator vs. oracle; human in the loop). Then hand-compute a z-score on a made-up baseline of 8–10 points plus one test value, and write the alert the way Theo did — with its own explanation. Finish with an honest one-paragraph verdict of what your pilot would and would not catch.

Key Takeaways

Scope beats algorithm. Narrowing an anomaly detector to high-value, low-volume entities defeats the base-rate trap more powerfully than any model improvement — precision is won by shrinking the haystack.
Per-entity baselines adapt the threshold to each account's own history, which is exactly what makes an attacker abusing a specific account stand out.
The baseline is a trust boundary. A continuously self-tuning baseline invites gradual data poisoning; freeze it, review it by a human, use robust statistics, and check against peers.
Transparent first. A z-score that explains itself on a whiteboard is defensible in an incident review and to a regulator; start there and earn the right to anything more complex.
Anomalous ≠ malicious. The detector finds leads; humans close cases. 137 benign anomalies out of 140 is the normal, correct behavior of an honest unsupervised detector — enrichment, not the model, fixes the analyst's experience.
Lead generator, never an oracle. Wiring a corruptible model to an automatic action risks both self- inflicted denial of service and an adversarial-ML attack surface; keep a human in the loop.
A first ML pilot's verdict should be measured in precision and volume the SOC experiences, written honestly — including what it cannot catch.