Case Study 2: Raj's Anomaly Hunt — AI-Assisted Log Analysis

The Problem

It is a Thursday morning at 8:47 AM when Raj's phone lights up with a Slack message from the on-call engineer: "Latency is elevated on payment-processor. Not alert-level but the trend is not good. Three days running."

Raj pulls up the monitoring dashboard. The P95 latency for the payment processing service is sitting at 380ms — below their 500ms SLA threshold, but 45% higher than the 260ms baseline from the previous two weeks. It has been trending upward for seventy-two hours without triggering an alert, which means it is real and sustained, not a transient spike.

The payment processor is a critical path service. Latency elevation here means users are experiencing perceptibly slower checkout flows. The on-call engineer has already checked the obvious suspects — CPU, memory, network — and found nothing unusual at the infrastructure level.

Raj has approximately five days of logs. He estimates he has 2.3GB of JSON-formatted log entries. Parsing, aggregating, and visualizing that volume manually in pandas would take him most of the day. He decides to use AI to write the analysis pipeline.

Building the Analysis Pipeline

Raj opens a new conversation in Claude and begins with a representative log entry:

{
  "timestamp": "2024-11-14T02:14:33.421Z",
  "trace_id": "7f3a9c21-4b8e-4d12-b6c4-9e8f2a1c0d7b",
  "service": "payment-processor",
  "endpoint": "/api/v2/transactions/validate",
  "method": "POST",
  "status_code": 200,
  "latency_ms": 412,
  "upstream_service": "fraud-check",
  "upstream_latency_ms": 287,
  "client_ip": "203.0.113.47",
  "user_agent": "PaymentSDK/2.3.1"
}

His prompt:

"I have structured JSON logs from a distributed payment processing service. Each entry has the structure shown above. I need to analyze 5 days of logs to understand a latency elevation that's been building for 3 days. Write Python code using pandas that: (1) loads and parses all JSON log files from a directory, (2) parses timestamp as datetime, (3) calculates P50, P95, and P99 latency by hour for the main service and by endpoint, (4) identifies 4-hour windows where P95 latency exceeds 350ms (our elevated threshold), (5) creates visualizations of the latency percentile trends over the 5 days, and (6) produces a summary CSV with hourly P95 stats."

Claude returns approximately 80 lines of Python code. Raj reads every line before running it.

He catches two issues:

Issue 1: The code uses json.loads() line-by-line assuming one JSON object per line. His log format actually has one JSON object per line (NDJSON format), so this is correct — but only because he happened to know that. He confirms it before proceeding.

Issue 2: The P-percentile calculation uses df.groupby('hour')['latency_ms'].quantile([0.5, 0.95, 0.99]). This is technically correct, but the groupby result structure will require an additional .unstack() call to produce the expected DataFrame shape for the CSV export. The code does not include this call, meaning the CSV export step would fail. Raj adds the .unstack().

He also notices that the timezone handling is implicit — timestamps are parsed as UTC but not labeled. Since his SLA thresholds and business understanding are in Eastern time, he adds explicit UTC-to-Eastern conversion.

He runs the corrected script. It processes 2.3GB of logs in approximately four minutes and produces the hourly percentile DataFrame and charts.

The Pattern Emerges

The latency percentile chart is immediately revealing. P95 latency shows a consistent pattern: elevated between approximately 02:00 and 04:00 UTC (10 PM to midnight Eastern) on each of the three affected nights. Baseline P95 is around 260ms. During the elevated windows, P95 reaches 380-420ms. Outside those windows, latency is perfectly normal.

This is not a gradual degradation — it is a nightly, time-bounded event.

Raj submits his findings to Claude for hypothesis generation:

"I've found that P95 latency elevation is occurring in a 02:00-04:00 UTC window each night for 3 nights. The elevated windows show P95 latency 45-60% above baseline. Outside those windows, performance is normal. The service makes upstream calls to a fraud-check service. Generate 5 plausible hypotheses for what is causing this pattern."

The five hypotheses: 1. A scheduled batch job competing for database connection pool resources at that time 2. External API rate limiting — the upstream fraud check service may have a daily request quota that resets nightly 3. Certificate rotation or scheduled TLS handshake overhead 4. Garbage collection pauses in a JVM-based component triggered by nightly memory patterns 5. Scheduled database maintenance, backup, or replication lag in a downstream service

Raj works through elimination. He checks the batch job schedule: nothing runs at 02:00 UTC. He checks the database maintenance window: it runs at 05:00 UTC, outside the affected window. GC patterns and TLS certificate rotations would produce different signatures in the traces. That leaves hypothesis 2: external API rate limiting.

He pulls up the fraud check vendor's API documentation. Buried in the rate limiting section: "Daily request quotas reset at 02:00 UTC." He checks the service's fraud check call volume: approximately 12,000 calls per day. He checks the vendor's rate limit: 10,000 calls per day on their current contract tier.

The service has been exceeding its daily quota since approximately 02:00 UTC each night. When the quota is exceeded, the vendor returns 429 (Too Many Requests) errors. His service's retry logic handles these errors by waiting and retrying with exponential backoff — which looks like elevated latency rather than errors.

This explains the three-day onset: the traffic has been growing as the holiday season approaches, and three days ago it crossed the 10,000-call threshold. It will get worse.

The Security Discovery

Raj asks for a second analysis: the twenty slowest individual requests from the elevated windows.

"Write code to filter logs to the three elevated time windows (02:00-04:00 UTC on Nov 11, 12, and 13), then extract the 20 slowest individual requests by latency_ms. For each, provide: trace_id, timestamp, endpoint, latency_ms, upstream_latency_ms, status_code, client_ip, and user_agent."

The code runs. Raj examines the resulting table.

Seventeen of the twenty slowest requests come from five IP addresses. The user agent on these requests is not a standard browser or mobile app — it is "curl/7.68.0". The endpoints they are hitting are /api/v2/transactions/validate and /api/v2/payment-methods/verify — endpoints that would be useful for probing payment validation logic.

Raj looks at the request rate from these IPs. They are submitting requests at 2-3 per second, sustained over the two-hour window — not organic user behavior.

The rate limiting event is acting as a forcing function: when the legitimate fraud check quota is exhausted, the fraudulent probing requests — which also trigger fraud checks — are taking longer because the quota is already depleted. The latency elevation is a side effect of rate limit exhaustion, but the proximate cause of the slow requests is the probing traffic.

Raj escalates to the security team with his findings and trace data. He also opens a ticket with the fraud check vendor to upgrade the rate limit tier.

The Trust-But-Verify Postmortem

In his incident postmortem, Raj documents the role of AI-generated code in the investigation. Three points stand out:

The bugs he caught before running. The .unstack() omission would have caused a silent data structure error in the CSV export — not a runtime failure, just a badly structured output. The timezone handling would have made the time-window analysis incorrect. Both errors were caught by reading the code.

The hypothesis AI generated that he would have thought of anyway. Raj acknowledges that hypothesis 2 (external API rate limiting) was the first thing he would have investigated after seeing the nightly pattern. The AI enumeration of hypotheses did not add new insights here — it added speed and structure.

The numerical claims he verified. The rate limit threshold (10,000 calls), the traffic volume (approximately 12,000 calls), and the elapsed days of elevation were all verified against source data before he stated them in the incident report. He did not include any number in the report that he had not checked.

The security finding was not AI-generated — it was pattern recognition by a human who had domain knowledge about what a command-line HTTP client probing financial endpoints looks like. AI surfaced the top-20 slowest requests; a human recognized what they represented.

Resolution and Outcome

Within 24 hours of the investigation: - Fraud check rate limit tier upgraded to 25,000 calls/day - Security team blocked the five probing IP addresses at the WAF level - A rate limit monitoring alert added to the payment processor service - An alert added for patterns matching sustained curl-agent traffic to sensitive endpoints

The latency incident closes. The security incident is escalated to a formal investigation of what the probing traffic was attempting to accomplish.

Total analysis time for the log investigation: approximately four hours. Raj estimates the same investigation without AI code generation would have taken two full days — not because the analysis was conceptually harder, but because writing the log parsing and aggregation code from scratch is genuinely time-consuming.

The code Raj produces with AI assistance is reviewed, understood, and corrected before running. Every numerical claim in his incident report has a trace to verified data. The human judgment — what the pattern means, what the probing traffic represents, how to escalate — is entirely his.