Case Study 4.2: Raj's Security Audit — When Copilot Got the Crypto Wrong

Background

Raj was a senior software developer at a B2B SaaS company building a project management platform. With seven years of experience and a reputation as one of the stronger technical reviewers on the team, he was trusted to make technology decisions on features within his scope and to mentor junior developers on code quality.

His company had been an early adopter of GitHub Copilot. Raj had been using it for six months, and his overall experience was positive. He estimated it saved him somewhere between 20 and 40 percent of time on routine development tasks — boilerplate, standard patterns, unit test scaffolding, documentation. He had become comfortable with it and trusted it significantly.

The incident that changed his calibration happened during a sprint in which the team was building a user authentication module for a new enterprise tier of their product. Enterprise customers would have tighter compliance requirements, so this module needed to be built to a higher security standard than the existing basic authentication system.

The Development Session

Raj was building the password management component — specifically, the functions responsible for hashing passwords during account creation and verifying them during login. These are security-critical operations. Getting them wrong does not just introduce a bug; it creates a vulnerability that can expose every user's credentials if the database is ever breached.

He started with Copilot enabled, as was his habit. He wrote the function signature:

def hash_password(password: str) -> str:
    """Hash a password for storage. Returns a hex string."""

Copilot suggested a completion:

def hash_password(password: str) -> str:
    """Hash a password for storage. Returns a hex string."""
    import hashlib
    return hashlib.md5(password.encode()).hexdigest()

The code was syntactically correct. It would run. It would produce a deterministic hex string output. It looks like a password hashing function. Raj saw the suggestion, recognized hashlib as a standard library, saw that the structure was clean, and accepted the suggestion with a tab keypress.

He moved on to the verify function. Copilot suggested:

def verify_password(password: str, hashed: str) -> bool:
    """Verify a password against its hash."""
    import hashlib
    return hashlib.md5(password.encode()).hexdigest() == hashed

Again: syntactically correct, logically consistent with the hash function, looks clean. Accepted.

Raj continued developing the rest of the module — account creation flow, session management, token generation. Copilot was helpful throughout, and most of its suggestions in those areas were sound. By the end of the session, he had a complete authentication module that worked as intended in testing. He committed the code and submitted a pull request.

The Code Review

The pull request was reviewed by Priya, another senior developer on the team, as part of their standard review process. Priya was methodical in her reviews and had a habit of specifically examining any cryptographic or security-sensitive code line by line.

She flagged the hash_password function within ten minutes of opening the review.

Her comment on the pull request read:

"The password hashing here uses MD5. This is a critical security issue. MD5 is not a password hashing algorithm — it's a general-purpose cryptographic hash designed for speed, not security. A modern GPU can compute billions of MD5 hashes per second, making brute-force and rainbow table attacks trivially feasible. For this enterprise tier with compliance requirements, we need bcrypt, scrypt, or Argon2 — functions specifically designed for password hashing that incorporate salting and tunable cost factors. Please update before this merges."

Raj read the comment, recognized immediately that Priya was correct, and felt a particular kind of discomfort: he knew better than this. MD5 for password hashing was not a close call. It was definitively wrong, and he would have caught it instantly if he had been writing the code from scratch rather than accepting a Copilot suggestion.

He updated the implementation to use bcrypt:

import bcrypt

def hash_password(password: str) -> bytes:
    """Hash a password for storage using bcrypt."""
    salt = bcrypt.gensalt()
    return bcrypt.hashpw(password.encode(), salt)

def verify_password(password: str, hashed: bytes) -> bool:
    """Verify a password against its bcrypt hash."""
    return bcrypt.checkpw(password.encode(), hashed)

The bcrypt version is fundamentally different: it generates a cryptographically random salt automatically, incorporates the salt into the hash, and is intentionally computationally expensive — slowing brute-force attacks by orders of magnitude. The Copilot suggestion was not just weaker; it was categorically different in its security properties.

The pull request was approved after the update.

Post-Incident Reflection

Raj spent time thinking about why he had accepted the MD5 suggestion without catching the problem. Several factors were at play.

The familiarity effect. Raj had accepted hundreds of Copilot suggestions over six months. The vast majority had been fine. The act of reviewing a suggestion had, over time, become faster and less critical — a quick scan rather than a careful evaluation. The routine of accepting Copilot code had reduced his scrutiny threshold across the board.

The syntax vs. semantics gap. The MD5 code was syntactically correct. It used a standard library. It did exactly what it said it would do — hash a password. The semantic error — that this is the wrong kind of hash function for this purpose — required security-domain knowledge to identify, not general code review skills. Raj had that knowledge, but his review mode had shifted to "does this look right" rather than "is this correct."

The scope blindness problem. Raj was in flow on a large module. He was focused on the overall architecture, the session management logic, the token generation. The hash functions were a small piece of a large task. The mental context was "checking all these pieces work together" rather than "scrutinizing each security-critical line individually."

The confidence of the suggestion. Copilot did not flag any uncertainty about the suggestion. There was no indicator that the approach was debated or that alternatives existed. The suggestion was presented with the same confidence as all its other suggestions. The tool gave no signal that this was a security-sensitive decision point.

The Deeper Pattern: When AI Tools Know What to Write But Not What is Safe

The MD5 suggestion illustrates a specific, important category of AI code error: code that is technically correct but embeds a security anti-pattern.

Copilot is trained on large amounts of code from public repositories. Those repositories include a lot of code that uses MD5 for various purposes — checksums, file integrity verification, non-sensitive identifiers. They also include older code that used MD5 for password hashing, from an era before current security guidance was widely understood and implemented. When Copilot sees a function signature suggesting password hashing, it pattern-matches to code that performs hashing, and MD5 is a common hash function in its training data. It does not have a concept of "this particular use of MD5 is dangerous in this context."

This is a general property of AI code generation: the model learns to produce code that looks like the code in its training data. Training data contains both good patterns and bad patterns, current practices and deprecated practices, secure implementations and vulnerable ones. The model does not reliably distinguish between them. It produces plausible code, not necessarily secure code.

What Raj Changed

The incident produced several lasting changes in how Raj used Copilot:

The Security Code Slow-Down Protocol. Raj made a rule for himself: any function touching authentication, authorization, cryptographic operations, session management, or data access control gets reviewed at full attention, line by line, independent of Copilot's confident suggestion. No flow-state quick-scan for these functions.

The OWASP Cross-Reference Habit. For cryptographic choices specifically, Raj added a step: after writing any function that makes a cryptographic algorithm choice, check the OWASP Cryptographic Storage Cheat Sheet and Password Storage Cheat Sheet to confirm the choice is current and appropriate. This takes three minutes and has caught three additional issues in the year since the incident.

The Security Review Flag in Comments. Raj started adding a comment above any security-sensitive function: # SECURITY CRITICAL: Review against OWASP before merge. This is visible to reviewers and creates a shared expectation that this code gets special attention.

The Junior Developer Conversation. Raj added the MD5 incident to his onboarding conversation with junior developers: "Copilot is very helpful for most things. For anything involving authentication, cryptography, or authorization, do not accept its suggestions without a deliberate security review. I have a specific example of why."

The Updated Copilot Mental Model. Before this incident, Raj had a general sense that Copilot needed more scrutiny for complex or unusual tasks. After it, he added a second axis: security sensitivity. Simple, well-understood, non-security-sensitive code can be accepted with light review. Complex code needs more review. Security-sensitive code needs deep review regardless of simplicity.

The Counterfactual: What If Priya Had Not Caught It?

If this code had been deployed to production with the enterprise tier, the risk would have been concrete:

Any breach of the database that exposed the password hash table would immediately expose all enterprise user passwords to brute-force cracking, because MD5 hashes are computationally trivial to invert for common passwords.
Enterprise customers with compliance requirements (SOC 2, ISO 27001, HIPAA) would have been at direct risk of compliance failure.
The company's enterprise SaaS contract likely contained security warranties that would have been violated.
Discovery of the vulnerability — whether through a breach or a security audit — would have caused severe reputational and contractual damage.

The code passed automated testing because it was functionally correct. Without Priya's deliberate security review, it might have been deployed. The catch happened because of human expertise applied at the right time — not because of any automated process or the AI itself.

Calibration Lessons

Security-sensitive code is a specific, identifiable category requiring elevated review. It is not enough to have a general sense that "complex code needs more scrutiny." Security-sensitive code can be syntactically simple and still be critically wrong. The trigger for elevated review should be security sensitivity, not complexity.

AI tools do not distinguish safe from unsafe patterns. Copilot generated MD5 password hashing because it appears frequently in training data, not because it is appropriate. The model does not have a concept of deprecated security patterns or context-appropriate algorithm selection. The developer must supply that judgment.

Flow state and habit reduce scrutiny. Six months of accepting Copilot suggestions had shifted Raj's mental mode to "scan and accept" rather than "evaluate carefully." Security-critical code requires a deliberate gear shift back to full-attention evaluation, independent of how many previous suggestions have been fine.

Good code review is the last line of defense. This incident was caught not by automated tools, not by the AI itself, not by Raj's own review, but by a skilled human reviewer who specifically knew to look for cryptographic issues. Building and maintaining strong code review practices is essential precisely because AI tools can generate confident, syntactically correct, but semantically dangerous code.

The Red Flag, specifically. "Copilot uses MD5/SHA-1 for password hashing." This is now on Raj's permanent Red Flag list, and he mentions it in any context where password hashing comes up with other developers. Personal, specific failure patterns are more memorable and actionable than general guidelines.