Case Study 1: The 50-Message Conversation

How a Long Vibe Coding Session Went Off the Rails---and How to Fix It


Background

Marcus is a junior developer building his first serious side project: a personal finance tracker called BudgetBuddy. It is a command-line application that reads bank transaction CSVs, categorizes spending, and generates monthly reports. Marcus has been using an AI coding assistant for about two weeks and is comfortable with basic prompting, but he has not thought much about context management.

On a Saturday afternoon, Marcus opens a new conversation with his AI assistant and begins coding. Over the next three hours, he sends 50 messages. By the end, he is frustrated, the code is full of inconsistencies, and he has spent more time fighting the AI than writing features.

This case study analyzes what went wrong and shows how Marcus could have achieved better results with proper context management.


The Conversation: What Actually Happened

Messages 1-5: A Promising Start

Marcus begins with a brief prompt:

Message 1: "Help me build a CLI personal finance tracker in Python.
It should read CSV files from my bank and categorize transactions."

The AI asks clarifying questions about the CSV format, categories, and output. After a few exchanges, it produces a solid Transaction dataclass and a CSVParser class. The code is clean, well-typed, and well-documented.

Verdict: This phase goes well. The AI is in its "ramp-up to peak" phase, and the task is focused enough to produce good results.

Messages 6-15: Feature Creep Without Context

Marcus starts adding features one at a time:

Message 6:  "Add a categorization engine using keyword matching."
Message 8:  "Now add a monthly summary report."
Message 10: "Add support for multiple bank CSV formats."
Message 12: "Add a budget feature where I can set limits per category."
Message 14: "Add a savings goal tracker."

Each feature is requested in isolation, without reference to how it should integrate with the existing code. The AI generates each feature as a new class, but the integration points become increasingly unclear. By message 15, there are six classes that Marcus is not sure how to connect.

Problem identified: Marcus is using an implicit "progressive disclosure" pattern, but without providing integration context. The AI does not know how these features should work together because Marcus has not described the overall architecture.

Messages 16-25: The First Signs of Trouble

Marcus asks the AI to build the CLI interface that ties everything together:

Message 16: "Now create a CLI using argparse that connects all these
features together."

The AI produces a CLI, but it uses different import paths than the earlier code. It also creates new helper functions that duplicate logic already present in the categorization engine. When Marcus points this out:

Message 18: "You're duplicating the categorization logic. Use the
CategorizeEngine class we built earlier."

The AI apologizes and rewrites the CLI, but now it uses a method signature from the categorization engine that was the initial version---the AI has lost track of the fact that Marcus modified the categorization engine in message 10 to support multiple bank formats.

Problem identified: The conversation history is now long enough that the AI is experiencing the "lost in the middle" effect. The modifications from messages 8-14 are in the low-attention zone.

Messages 26-35: Spiraling Confusion

Marcus tries to fix the integration issues by pasting error messages and stack traces:

Message 26: "I get this error when running the CLI:
TypeError: CategorizeEngine.__init__() got an unexpected
keyword argument 'bank_format'"

The AI fixes this specific error but introduces a new one. Marcus pastes the new error. The AI fixes that one. This whack-a-mole cycle continues for ten messages. Each fix is locally correct but globally inconsistent because the AI no longer has a coherent picture of the full codebase.

Problem identified: The conversation is now in full degradation mode. The AI is responding to the most recent message (the error) without maintaining consistency with the full codebase context. Marcus is not including the relevant file contents in his messages, so the AI is working from memory of code that it has partially forgotten.

Messages 36-45: Changing Direction Mid-Stream

Frustrated with the bugs, Marcus decides to pivot:

Message 36: "Actually, let's switch from argparse to Click for the CLI."
Message 38: "And let's use SQLite instead of CSV files for storage."
Message 40: "Actually, keep CSV for import but add SQLite for the
processed data."

Each of these messages represents a fundamental architectural change. The AI gamely tries to accommodate, but it is now working with a context that contains: the original CSV-only design, the argparse CLI, the Click migration, the SQLite-only storage, and the CSV+SQLite hybrid. All five versions are in the conversation history, and the AI is unclear about which represents the current desired state.

Problem identified: Major architectural changes mid-conversation create enormous context pollution. The conversation now contains more outdated information than current information, and the AI cannot reliably distinguish between them.

Messages 46-50: The Breaking Point

Message 46: "The code you just generated uses print() for output
but earlier you were using the logging module. Which is it?"

Message 48: "You just created a function called `parse_transactions()`
but we already have a method called `parse_csv()` in the CSVParser
class that does the same thing."

Message 50: "I give up. The code is a mess and nothing is consistent
anymore. Let's start over."

The conversation ends with Marcus having spent three hours and produced code he cannot use. The AI's final responses were contradictory, inconsistent, and low quality---not because the AI is bad, but because the context was degraded beyond recovery.


Analysis: What Went Wrong

Reviewing Marcus's 50-message conversation, we can identify six specific context management failures:

Failure 1: No Initial Architecture Prompt

Marcus started with a vague "help me build X" message instead of establishing the architecture, constraints, and conventions upfront. If he had front-loaded a detailed first message (Section 9.3), the AI would have had a coherent foundation to build upon.

Token cost of the failure: Approximately 8,000 tokens wasted on clarifying questions and code that had to be rewritten.

Failure 2: Feature-by-Feature Without Integration Context

Each feature was requested in isolation. Marcus never described how the features should work together or provided an architecture diagram. The AI optimized each feature locally without the ability to optimize globally.

Token cost of the failure: Approximately 12,000 tokens in duplicate logic and inconsistent interfaces.

Failure 3: No Anchor Messages

Marcus never restated his constraints or summarized the current state. By message 20, the critical information from messages 6-14 was deep in the low-attention zone.

Token cost of the failure: Approximately 6,000 tokens in corrections and fixes caused by the AI "forgetting" earlier decisions.

Failure 4: Error-Message Whack-a-Mole

Instead of providing the AI with the relevant file contents when debugging, Marcus only pasted error messages. The AI had to guess at the current state of the code, leading to fixes that introduced new problems.

Token cost of the failure: Approximately 10,000 tokens in the 10-message debugging cycle (messages 26-35).

Failure 5: Architectural Changes Without Fresh Start

Switching from argparse to Click, and from CSV to SQLite, were fundamental changes that warranted a fresh conversation. Instead, Marcus made these changes mid-stream, polluting the context with conflicting designs.

Token cost of the failure: Approximately 15,000 tokens in conflicting context and resulting confusion.

Failure 6: Never Starting Fresh

The most critical failure was continuing a single conversation for 50 messages. The conversation should have been split into at least three sessions (data models, CLI, and storage) with fresh starts and proper priming between them.

Total estimated waste: Approximately 51,000 tokens---nearly half the conversation---was spent on context management failures rather than productive code generation.


The Redesigned Conversation

Here is how Marcus could have accomplished the same goals in approximately 25-30 total messages across three focused conversations.

Conversation 1: Core Architecture and Data Model (8 turns)

Turn 1 (Priming + First Task):

I'm building a CLI personal finance tracker called BudgetBuddy.

Tech stack: Python 3.12, Click (CLI), SQLite (storage), CSV (import)
Architecture:
- CSVParser: reads bank CSV files, normalizes to Transaction objects
- TransactionStore: SQLite-backed storage for processed transactions
- Categorizer: keyword-based transaction categorization
- BudgetTracker: tracks spending against per-category budgets
- ReportGenerator: monthly/weekly spending reports
- CLI: Click-based interface tying everything together

Conventions:
- Dataclasses for data objects, type hints everywhere
- Google-style docstrings
- Logging module (not print) for all output
- pathlib.Path for file handling

Create the core data models first: Transaction, Category, Budget,
and MonthlyReport dataclasses.

Turns 2-7: Implement CSVParser and Categorizer using the scaffold-then-fill pattern.

Turn 8: Ask the AI to summarize the session and produce the final versions of all files.

Conversation 2: Storage and Business Logic (10 turns)

Turn 1 (Priming):

Continuing work on BudgetBuddy. Here's where we left off:
[Paste summary from Conversation 1]
[Paste final data model code]
[Paste CSVParser and Categorizer interfaces]

Now implement the TransactionStore (SQLite-backed) and BudgetTracker.

Turns 2-9: Implement storage and budget features.

Turn 10: Summary and artifact collection.

Conversation 3: CLI and Integration (8-12 turns)

Turn 1 (Priming):

Final session for BudgetBuddy. Here's the complete current state:
[Summary of all components]
[All interfaces]
[Key implementation details for integration points]

Build the Click CLI that ties everything together. Commands needed:
import, categorize, budget, report, status.

Turns 2-10: Build and test the CLI, fixing integration issues with the actual code context available.

Result Comparison

Metric Original (1 session) Redesigned (3 sessions)
Total messages 50 26-30
Estimated tokens ~120,000 ~60,000
Wasted tokens ~51,000 ~5,000
Code consistency Poor High
Final code usable? No Yes
Developer frustration High Low
Time spent ~3 hours ~2 hours

Key Lessons

  1. Front-load your architecture. Spending 5 minutes writing a detailed first message saves hours of rework later.

  2. Plan your sessions. Break complex projects into focused conversations of 8-12 turns each, with clear handoff summaries between them.

  3. Include file context when debugging. Do not just paste error messages---paste the relevant code so the AI can see the current state.

  4. Major architectural changes demand a fresh start. If you change your tech stack or architecture, start a new conversation. The old context will confuse more than it helps.

  5. Use anchor messages every 5-8 turns. Briefly restate your constraints and the current state to keep the AI aligned.

  6. Monitor for degradation signals. When the AI starts forgetting constraints, duplicating logic, or contradicting itself, it is time to summarize and start fresh---not time to push through another 20 messages.


Reflection Questions

  1. At what point in the original conversation would you have recommended Marcus start a new session? Why?

  2. Marcus's priming message in the redesigned Conversation 1 includes a full architecture outline. How would the conversation differ if he had discovered the need for SQLite storage midway through instead of planning it upfront?

  3. How would you handle a situation where you are in Conversation 2 and realize the data model from Conversation 1 needs a significant change?

  4. Marcus estimated three sessions. Could this project have been done in two? In four? What factors would influence that decision?