Case Study 1: The Key in the Repository

DataField.Dev

Case Study 1: The Key in the Repository

"We didn't get phished. We didn't get exploited. We just left a key under the mat for six years and someone finally picked it up." — Sam Whitfield, Security Engineer, Meridian Regional Bank (constructed)

Executive Summary

A live Amazon Web Services (AWS) access key belonging to one of Meridian Regional Bank's oldest automated jobs was discovered on a contractor's personal laptop — not because the contractor was malicious, but because a routine clone of an internal Git repository carried the key out of the bank's control and into a third-party developer tool. No human credential was phished, no software was exploited, and no firewall failed. A secret had quietly sprawled into committed source code years earlier and finally leaked. This case study follows engineer Sam Whitfield and the SOC, led by Marcus Reyes, through detection, scoping, the only response that actually works (rotation), and the remediation that replaced the hard-coded key with a vault and — better still — eliminated the secret entirely by adopting workload identity. You will watch the chapter's terms become operational: secret, secret sprawl, secret leak, secrets vault, dynamic secrets, service account, workload identity, and the iron rule that a leaked secret must be rotated, not deleted. The scenario and all figures are constructed for teaching (Tier 3).

Skills applied: secret leak detection and scoping; reading cloud audit logs for misuse; emergency credential rotation; secrets-vault migration; workload-identity (IAM role) redesign; writing a secrets-management standard; distinguishing "delete the file" theater from real remediation.

Background

Meridian's environment, you will recall, is a museum and a startup in the same building: a legacy on-prem core, a twenty-year-old Active Directory domain, and a five-year-old AWS footprint bolted on top. The job at the center of this case lived in that bolted-on cloud layer. In 2019, an engineer named Devraj — long since moved to another team — needed a nightly process to copy a database snapshot from the on-prem world into an AWS bucket for disaster-recovery purposes. He wrote a short Python script, gave it an AWS access key so it could write to the bucket, and committed the script, key and all, to the bank's internal GitLab. The job ran flawlessly every night at 02:00 for six years.

The key was a service account credential in everything but name: a long-lived secret used by an automated process, with no human attached. And it had every disease the chapter warns about. It was static — the same value for six years. It was over-privileged — to save time in 2019, Devraj had attached a broad policy that granted not just write access to the one backup bucket but read access to every bucket in the account, including buckets that later came to hold exported reporting data with customer information. It had no owner after Devraj moved on. And it had sprawled: committed to source control, the key now existed in the repository's full history, in every developer's local clone, and in the continuous-integration cache that had built the job dozens of times.

🔗 Connection: Notice that nothing here is a software vulnerability in the Chapter 1 sense — there is no missing patch, no injectable input. The weakness is entirely a matter of machine identity hygiene: a secret stored where it should never live, scoped far beyond its need, owned by no one. This is why §20.1 insists that securing human authentication (Chapters 16–19) leaves you only half-defended if machine identity is a sprawl of hard-coded keys.

The Analysis

Phase 1 — Detection: how the leak surfaced

The leak did not announce itself. It surfaced through two unrelated threads that the SOC connected.

The first thread was a routine control Meridian had only recently switched on. As part of an early secrets-hygiene effort, Sam had enabled secret scanning across the bank's GitLab repositories — both current state and full history. The scanner, running the kind of pattern-matching the chapter describes, flagged the 2019 backup script: an AWS access key ID matching AKIA followed by sixteen uppercase characters, sitting in plain sight in committed code. By itself this was an internal finding: the repository was private, so the immediate exposure was "every Meridian developer," which is bad but bounded.

[secret-scan] repo: ops/dr-backup   ref: history (commit a91f… , 2019-03-12)
  file: backup_snapshot.py:14
  finding: aws_access_key_id  AKIA................   (value redacted in report)
  severity: HIGH  — long-lived AWS key in committed source; in history + clones
  status: UNRESOLVED — assign owner, ROTATE, migrate to vault/role

The second thread arrived three days later from the cloud side. Marcus's SOC had a CloudTrail-fed detection (the kind of behavioral rule §20.5 recommends and Chapter 21 will formalize in the SIEM) watching for AWS access keys used in ways that did not match their normal pattern. The backup key's normal pattern was unmistakable and boring: once a night, at 02:00 UTC, from a known on-prem egress address, calling exactly two API actions against exactly one bucket. The alert that fired was anything but boring.

ALERT  aws.key_behavior_anomaly  severity: HIGH
  principal: AKIA....  (svc: dr-backup job)
  baseline:  02:00 UTC daily; src=198.51.100.10 (on-prem egress);
             actions={s3:PutObject, s3:ListBucket} on dr-backup-bucket
  observed:  14:22 UTC; src=203.0.113.77 (unrecognized, non-corporate);
             actions={s3:ListAllMyBuckets, sts:GetCallerIdentity}
  note: key used outside scheduled window from new source; reconnaissance-shaped calls

Read that alert as Theo and Marcus did. A key that has, for six years, done one tiny job at one time from one place is suddenly being used in the middle of the afternoon, from an address nobody recognizes, to enumerate the account — ListAllMyBuckets and GetCallerIdentity are the calls an attacker makes first to answer "what is this key, and what can it see?" This is the §20.1 payoff stated as an incident: machine behavior is supposed to be boring, so a workload doing something it has never done is a stark, high-confidence signal. A human analyst staring at a person's logins might agonize over whether a 14:22 login from a new city is a traveling employee. There is no such ambiguity here. The backup job does not travel.

Phase 2 — Scoping: how far did it go?

With two threads joined — a known-leaked key and that key now being exercised by an unknown party — Priya's incident-response instincts took over. The scoping question is always the same: what could this identity reach, and what did it actually touch?

What it could reach was the bad news. Because the 2019 policy granted read access to every bucket, the leaked key could read not only the disaster-recovery snapshots but the reporting-export buckets that, in the years since, had accumulated files containing customer account data. The blast radius of the over-privilege was far larger than the job's actual function — a textbook illustration of why least privilege for machines (§20.3) is not bureaucratic nicety but blast-radius control.

What it actually touched was knowable, because every API call a credential makes is logged. The team pulled the full CloudTrail history for the principal and built a timeline:

14:22:01  sts:GetCallerIdentity        src=203.0.113.77   (who am I?)
14:22:09  s3:ListAllMyBuckets          src=203.0.113.77   (what can I see?)
14:23:40  s3:ListBucket  reporting-exports  src=203.0.113.77   (probing a juicy bucket)
14:24:05  s3:GetObject   reporting-exports/2026-05-statements.csv.gz   src=203.0.113.77
14:24:report stops — key disabled at 14:26 (see Phase 3)

The honest finding: the attacker had listed the reporting-export bucket and retrieved one object before the key was killed — a real, if narrow, data exposure that triggered the bank's breach-assessment process (the legal and notification machinery is Chapter 28's territory, and forensic confirmation of exactly what left is Chapter 25's). The point for this chapter is that the audit trail of a single machine identity turned "we may have lost something unknowable" into "we know precisely which one object was read, by which key, from where, and when." That precision is a direct dividend of treating machine access as something to be logged and watched.

⚠️ Common Pitfall: The team's first impulse, voiced by a junior responder, was "delete the key from the repo and force-push to clean the history — then we're safe." Sam stopped it cold. Rewriting history does not invalidate the key; it only hides the evidence and breaks everyone's clones. The key was still valid and still in the attacker's hands. The only action that severs the attacker's access is to rotate — disable the leaked key and issue a new credential. Cleaning history is worth doing afterward for hygiene, but it is not the response. The leak is the alert; rotation is the response.

Phase 3 — Containment and the only response that works

At 14:26, four minutes after the anomalous calls began, Sam disabled the leaked access key in AWS. That single action — invalidating the credential — instantly ended the attacker's access, regardless of how many copies of the key existed on how many machines. This is the chapter's iron rule in practice: a leaked secret cannot be un-leaked, so you make the leaked value worthless by rotating it.

But disabling the key broke the legitimate backup job too, which is exactly why rotation of unmanaged static secrets is the dreaded fire drill the chapter warns about. Sam had to: provision a new mechanism for the job (see Phase 4), reconfigure the job, and confirm the next backup ran — all under incident pressure. Had the secret been short-lived and vault-managed from the start, "rotate it" would have been routine rather than an emergency. The cost of the fire drill became Sam's strongest argument for the remediation that followed.

Containment actions, in order:

Disable the leaked key (14:26) — severs attacker access immediately.
Preserve evidence — export the full CloudTrail history for the principal before anything changes, establishing exactly what was accessed (hand off to forensics/IR, Chapters 24–25).
Hunt for the same pattern elsewhere — Marcus's SOC swept for any other access key being used outside its baseline, on the reasonable theory that an attacker who found one leaked key may have found others. (None were active, but the sweep is the right reflex.)
Open the breach-assessment process — one customer-data object was read; GRC and legal engage.

Phase 4 — Remediation: from hard-coded key to no key at all

Now the engineering. Dana's instruction was the one from §20.6: "Fix it so this can't happen again." Sam had three escalating options, and chose the strongest the platform allowed.

Option A — vault the secret. Move the key out of source and into a secrets vault; have the job fetch it at runtime. This eliminates the sprawl (no copy in code, clones, or CI) and adds audit and rotation. It is a large improvement.

Option B — dynamic secret. Have the vault issue a short-lived, scoped credential each night, valid just long enough for the job to run. Now even a stolen credential expires within the hour. Better still.

Option C — eliminate the secret. Re-architect the job to run on AWS compute with an attached IAM role, so it retrieves temporary, auto-rotating credentials from the platform and holds no key at all. There is nothing to hard-code, nothing to leak, nothing to rotate manually. Best, because the most dangerous secret is the one that does not exist.

Sam chose Option C for the backup job, which now runs as a scheduled task on a role-bearing compute resource, scoped — this time correctly — to only write access to only the disaster-recovery bucket. He used Option A/B (vault with dynamic secrets) for the cases where a true secret was unavoidable, such as an API key for an external payment vendor that does not support federated identity.

BEFORE (2019)                          AFTER (2026)
─────────────                          ───────────
backup_snapshot.py:                    backup job runs on role-bearing compute:
  aws_key = "AKIA....EXAMPLE"   <-- in   - NO key in code (workload identity)
  (read access to ALL buckets)            - temp creds auto-rotated by platform
  static, 6 yrs, no owner, in git         - scoped: write-only to dr-backup-bucket
                                          - owner: Platform team; reviewed quarterly
vendor API key (no federation):        vendor API key:
  (would also be hard-coded)              - stored in vault, fetched at runtime
                                          - dynamic where supported; short TTL
                                          - access logged; rotation automated

🛡️ Defender's Lens: The before/after captures the chapter's whole thesis. The "before" column is a secret: static, sprawled, over-privileged, ownerless. The "after" column for the backup job has no secret at all — the attack that started this case is now literally impossible to repeat, because there is no key to find. For the one credential that could not be eliminated (the vendor key), the fallback is the next-best thing: vaulted, short-lived, scoped, logged. Defense in depth means you reach for "no secret," and where you cannot, you fall back to "best-managed secret."

Phase 5 — Turning the incident into a standard

The last move is the one that scales a single fix into program-wide protection: Sam wrote Meridian's secrets-management standard (the §20.6 nine-rule table) and the SOC permanently added the machine behavioral detections that caught the leak's use:

service account / access key used outside its baseline window, source, or geography;
any service account logging in interactively;
a sudden spike in vault secret requests, or a request for a secret a workload has never fetched;
first use of a long-dormant credential.

Elena mapped the standard to PCI-DSS's key-protection and access-restriction requirements so the work also served the next audit (Chapter 28). And the asset that began the case — the answer to "what machine identities do we have, what can each access, where is each secret, and when does it expire?" — now exists where before there was only a six-year-old script nobody remembered.

Discussion Questions

The leak was caught by two controls — secret scanning of the repo and behavioral detection of the key's use. Which would you implement first if you could only build one this quarter, and why? What does each catch that the other misses?
Sam chose to eliminate the secret (Option C) for the backup job but vault it (Option A/B) for the vendor key. Defend that asymmetry. When is "best-managed secret" genuinely the ceiling rather than a compromise?
The over-privileged 2019 policy turned a backup key into a customer-data exposure. Argue why least privilege for machines deserves at least as much rigor as least privilege for humans, even though machines "won't abuse access on purpose."
The team explicitly rejected "delete the commit and force-push" as the response. In your own words, why is that step worse than useless as a containment action, even though it feels productive?
Where in this case did the audit trail of a single machine identity change the outcome? What would the investigation have looked like with no per-credential logging?

Your Turn

Take a service or automated job you know (a backup, a deployment pipeline, a monitoring agent, a scheduled report) and trace its machine identity. (a) What secret does it use, and where does that secret live — in code, in config, in an env var, in a vault, or nowhere because it uses workload identity? (b) What can that identity access, and is it scoped to need? (c) If its credential leaked today, what is your rotation plan, and how long would it take? (d) Could you eliminate the secret entirely with a platform-provided identity? Write a one-page "before/after" like the diagram in Phase 4. If you cannot answer (a) for some job in your environment, you have just found the same blind spot Meridian had.

Key Takeaways

A breach can start with no phishing and no exploit — only a secret that sprawled into code and finally leaked. Machine-identity hygiene is its own attack surface.
Machine behavior is boring, so a credential used outside its baseline (new time, source, or action) is a high-confidence detection — the §20.1 insight as an incident.
The only response to a confirmed leak is to rotate (disable + reissue). Deleting the commit or rewriting history hides evidence and breaks clones but leaves the secret valid in attackers' hands.
Over-privilege turns a small leak into a large one. Least privilege for machines is blast-radius control, not bureaucracy.
The remediation hierarchy: eliminate the secret (workload identity / IAM role) > dynamic, short-lived vaulted secret > vaulted static secret > hard-coded key. Reach for the top; the most dangerous secret is the one that does not exist.
A single incident becomes program-wide protection when it is turned into a standard plus standing detections — and into the asset inventory of machine identities the organization never had.