25 min read

In Chapter 8, you learned the fundamentals of writing effective prompts---how to be clear, specific, and well-structured in your requests to AI coding assistants. But a single prompt, no matter how well crafted, is only part of the story. Most real...

Chapter 9: Context Management and Conversation Design


Learning Objectives

After completing this chapter, you will be able to:

  • Remember the key characteristics and practical limits of context windows in current AI coding assistants
  • Understand how conversations function as data structures that shape AI behavior and output quality
  • Apply strategic information placement techniques to get better results from AI coding tools
  • Apply multi-turn conversation patterns to guide AI through complex coding tasks
  • Analyze when to continue a conversation versus starting fresh, based on context degradation signals
  • Evaluate the effectiveness of different context priming and management strategies for various coding scenarios
  • Create context budget plans for coding sessions that maximize AI output quality within token constraints

Introduction

In Chapter 8, you learned the fundamentals of writing effective prompts---how to be clear, specific, and well-structured in your requests to AI coding assistants. But a single prompt, no matter how well crafted, is only part of the story. Most real vibe coding work happens across conversations---multi-turn exchanges where you and the AI collaborate over minutes or hours to build something meaningful.

The quality of those conversations depends enormously on something most developers never think about explicitly: context management. Context is everything the AI can "see" when generating a response---your current message, the conversation history, any files or system instructions you have provided, and the instructions baked into the tool itself. Managing this context well is the difference between an AI that feels like a brilliant collaborator and one that seems to forget what you said five minutes ago.

This chapter teaches you to think of your conversation with AI as a carefully designed data structure. You will learn what fits inside the AI's working memory, how to place information strategically, how to structure multi-turn interactions for maximum effectiveness, and how to plan your "context budget" before you start coding. These skills will transform your vibe coding sessions from frustrating guesswork into deliberate, productive collaboration.

Intuition

Think of a context window like a whiteboard in a meeting room. You and a colleague are solving a problem together, but the whiteboard has a fixed size. As you fill it up, you have to start erasing older content to make room for new ideas. If you erase the wrong thing---the problem statement, a key constraint, an important diagram---your colleague will make mistakes because they can no longer see that information. Context management is the art of deciding what stays on the whiteboard and what gets erased.


9.1 Understanding Context Windows in Practice

In Chapter 2, we explored context windows from a technical perspective---how transformers process sequences of tokens and how the attention mechanism works over those sequences. Now we need to understand what context windows mean in practice for your day-to-day vibe coding work.

What Is a Context Window, Really?

A context window is the total amount of text the AI can consider at once when generating a response. It includes everything:

  • The system prompt (hidden instructions from the tool itself)
  • Your conversation history (every message you have sent and every response the AI has given)
  • Any files, code, or documents you have attached or pasted
  • Your current message
  • The AI's response as it generates it

The context window is measured in tokens, which roughly correspond to word fragments. As a practical rule of thumb:

Measure Approximate Token Count
1 word ~1.3 tokens
1 line of code ~10-15 tokens
1 page of text ~300-400 tokens
1 typical Python file (200 lines) ~1,500-2,500 tokens
1 full conversation turn (prompt + response) ~500-3,000 tokens

Current Context Window Sizes

As of 2025, common context window sizes are:

Model / Tool Context Window Practical Capacity
Claude 3.5 Sonnet / Claude 4 200K tokens ~150K usable (after system prompts)
GPT-4o 128K tokens ~100K usable
Gemini 1.5 Pro 1M+ tokens ~800K usable
GitHub Copilot (chat) Varies by model Typically 32K-128K
Cursor Varies by model Tool manages context automatically

These numbers might seem large, but they fill up faster than you expect.

The Practical Reality

Here is the surprise that catches most new vibe coders: having a large context window does not mean the AI uses all of it equally well. Research and practical experience show several important patterns:

The "Lost in the Middle" effect. AI models tend to pay the most attention to information at the beginning and end of their context window. Information buried in the middle of a long conversation can be effectively "forgotten" even though it is technically still in the window. This has been documented in research from Stanford and other institutions and has direct implications for how you structure conversations.

Attention degrades with distance. Even within the usable portion of the context, the AI pays less attention to information that appeared many turns ago. A constraint you mentioned in message 3 of a 30-message conversation may be functionally invisible by message 25.

More context is not always better. Stuffing the context window with everything you can think of actually hurts performance. Irrelevant information creates noise that the AI must filter through, increasing the chance it will miss or misinterpret the important parts.

Common Pitfall

A frequent mistake is pasting your entire codebase into the conversation and saying "here is all my code, now help me with X." This approach wastes tokens on irrelevant files, can push important context out of the window, and makes the AI's job harder by forcing it to figure out which parts matter. Instead, provide only the files and sections directly relevant to your current task.

A Mental Model for Token Budgets

Think of your context window as a fixed financial budget. Every piece of information you include has a cost:

Total Budget: ~150,000 tokens (for a 200K model)
- System prompt overhead:        ~5,000 tokens  (tool's built-in instructions)
- Reserved for AI response:     ~4,000 tokens  (the code/text it will generate)
- Your available budget:       ~141,000 tokens

Typical allocations:
- File context (3-5 files):     ~8,000 tokens
- Conversation history:         ~15,000 tokens  (after 10-15 turns)
- Current prompt:               ~500 tokens
- Remaining buffer:             ~117,500 tokens

Early in a conversation, you have plenty of room. But after 30-40 exchanges, the conversation history alone can consume 50,000-80,000 tokens, severely limiting what else you can include.

Real-World Application

In Claude Code (the CLI tool), you can see an estimate of your context usage as you work. Other tools like Cursor show a context indicator in their interface. Pay attention to these indicators---when you see context usage climbing above 60-70%, it is time to think about whether to continue or start a fresh conversation.


9.2 The Conversation as a Data Structure

Most people think of a conversation with AI as a casual chat. Expert vibe coders think of it as a data structure that they are deliberately constructing to produce optimal outputs.

The Structure of a Conversation

Every conversation with an AI coding assistant has a specific structure:

Conversation = {
    system_prompt: str,          # Hidden instructions from the tool
    messages: [
        {role: "user", content: "..."},      # Your first message
        {role: "assistant", content: "..."},  # AI's first response
        {role: "user", content: "..."},      # Your second message
        {role: "assistant", content: "..."},  # AI's second response
        ...
    ]
}

This is not a metaphor---it is literally how the data is structured when sent to the AI model's API. Understanding this structure gives you power over the conversation.

Key Properties of This Data Structure

It is append-only. In most tools, you cannot go back and edit earlier messages. Each new message is appended to the end of the list. This means mistakes and misunderstandings accumulate unless you actively correct them.

It is ordered. The sequence of messages matters. The AI processes them in order, and information placement within this sequence affects how strongly it influences the output (more on this in Section 9.3).

It has a fixed capacity. When the conversation exceeds the context window, older messages are typically truncated or summarized. Different tools handle this differently---some drop old messages silently, others summarize them, and others warn you.

Every message has a cost. Both your messages and the AI's responses consume tokens. A verbose AI response that you did not need wastes budget that could have been used for more context.

Thinking in Conversation Graphs

A linear conversation is the simplest case, but real development sessions often have a more complex logical structure:

Start
  |
  v
[Define the project] -----> [Set up data models]
                                    |
                                    v
                            [Build API endpoints] ---> [Add auth]
                                    |                      |
                                    v                      v
                            [Write tests]          [Fix auth bugs]
                                    |                      |
                                    +-------> [Integration testing]

The problem is that your conversation is linear even when your development process is not. When you context-switch from "build API endpoints" to "fix auth bugs" and back, the AI has to reconstruct the mental context for each branch from the linear message history. This is expensive and error-prone.

Best Practice

When your development work branches into parallel tracks (for example, building a feature while also fixing a bug), consider using separate conversations for each track rather than switching back and forth in one conversation. This keeps each conversation focused and avoids forcing the AI to context-switch.

Information Density and Conversation Quality

Not all messages are created equal. Compare these two conversation histories:

Low-density conversation (wasteful):

User: Can you help me write a function?
AI:   Of course! What kind of function would you like?
User: A sorting function.
AI:   Sure! What language? What should it sort?
User: Python. It should sort a list of dictionaries.
AI:   What key should it sort by?
User: The "name" key.

This conversation uses 6 messages (and hundreds of tokens) to convey information that could have been expressed in one.

High-density conversation (efficient):

User: Write a Python function that sorts a list of dictionaries by
      the "name" key, case-insensitive, with None values sorted last.
AI:   [produces complete, correct function]

The high-density version uses a fraction of the tokens and produces a better result because the AI has all the information it needs in a single message. This is not about being terse---it is about being information-rich.

Intuition

Think of each message in your conversation as a row in a database. You want high information density per row and minimal redundancy across rows. A well-designed conversation, like a well-designed database, stores the right information in the right place with minimal waste.


9.3 Strategic Information Placement

Where you place information within your conversation has a measurable impact on how effectively the AI uses it. This section covers the science and practice of strategic information placement.

The Primacy and Recency Effects

Research on large language models confirms two effects that human memory researchers identified decades ago:

  • Primacy effect: Information presented first is remembered well. The AI pays strong attention to the beginning of the context.
  • Recency effect: Information presented last (most recently) is also remembered well. The current message and the most recent exchanges receive high attention.
  • Middle neglect: Information in the middle of a long context receives the least attention.

This has direct implications for how you structure your conversations:

HIGH ATTENTION ZONE    <-- System prompt / first messages
    |
    | ... decreasing attention ...
    |
LOW ATTENTION ZONE     <-- Middle of long conversation
    |
    | ... increasing attention ...
    |
HIGH ATTENTION ZONE    <-- Most recent messages / current prompt

Front-Loading Critical Information

Because of the primacy effect, your first message in a conversation is prime real estate. Use it to establish:

  1. The project context --- What are you building? What is the tech stack?
  2. Key constraints --- What rules must the AI follow throughout?
  3. Coding standards --- Naming conventions, style guidelines, patterns to use
  4. Architecture decisions --- The structure of the codebase, key design patterns

Here is an example of a well-crafted first message:

I'm building a REST API for a recipe management application.

Tech stack: Python 3.12, FastAPI, SQLAlchemy 2.0, PostgreSQL, Pydantic v2
Architecture: Clean architecture with separate domain, service, and API layers
Testing: pytest with pytest-asyncio for async tests
Style: PEP 8, Google-style docstrings, type hints everywhere

Key constraints:
- All endpoints must be async
- Use dependency injection for database sessions
- Return Pydantic models, never raw dicts
- All errors must return RFC 7807 problem detail responses

The codebase structure:
app/
  domain/     # Domain models and business logic
  services/   # Application services
  api/        # FastAPI routers and schemas
  db/         # SQLAlchemy models and session management

This message costs about 200 tokens but will save thousands of tokens over the conversation by preventing the AI from asking clarifying questions or making assumptions that violate your constraints.

Restating Key Information

Because of middle neglect, you should periodically restate critical constraints and context in later messages. This does not mean copying your entire first message---instead, include brief reminders:

Now add the DELETE endpoint for recipes.
Remember: async, dependency injection for the session,
RFC 7807 error responses.

This 30-token reminder is much cheaper than fixing an AI response that forgot your constraints.

The "Sandwich" Pattern

For individual prompts within a longer conversation, the sandwich pattern places critical information at both the beginning and end of your message:

[CONTEXT/CONSTRAINTS at the top]

[Details, code, examples in the middle]

[RESTATE KEY REQUIREMENTS at the bottom]

For example:

I need you to refactor this function to be async and use our
error handling pattern (RFC 7807 responses).

Here's the current function:
[... 50 lines of code ...]

Make sure the refactored version:
1. Is fully async
2. Uses RFC 7807 error responses
3. Keeps the same public interface

Advanced

Some AI coding tools support "pinned" context or project-level instructions (like Cursor's .cursorrules file or Claude Code's CLAUDE.md file). These mechanisms inject information at a high-attention position in the context for every message, effectively making certain instructions permanent. Use these features for project-wide constraints that should never be forgotten.


9.4 Multi-Turn Conversation Patterns

Most vibe coding happens across multiple turns. The pattern of those turns---how you sequence your requests---has an enormous impact on the quality of the results. Here are the most effective multi-turn patterns.

Pattern 1: Progressive Disclosure

Instead of dumping all requirements at once, reveal them progressively in a logical order:

Turn 1: "Build a User data model with name, email, and password fields."
Turn 2: "Add email validation and password hashing to the model."
Turn 3: "Add a method to generate a JWT token for this user."
Turn 4: "Add rate limiting metadata---track login attempts and lockout time."

When to use: When building something complex where each layer depends on the previous one. This pattern lets the AI focus on one concern at a time and produces cleaner code.

When to avoid: When the requirements are interdependent and knowing all of them upfront would change the design. For example, if the rate limiting requires a different data model structure, the AI would need to refactor its earlier work.

Pattern 2: Scaffold-Then-Fill

Ask the AI to create a skeleton first, then fill in the details:

Turn 1: "Create the file structure and class skeletons for a
         payment processing module. Include class names, method
         signatures, and docstrings, but use `pass` for all
         method bodies."

Turn 2: "Now implement the PaymentProcessor.process_payment() method."
Turn 3: "Now implement the RefundHandler.process_refund() method."
Turn 4: "Now implement the PaymentValidator class methods."

When to use: For complex modules where you want to review the architecture before committing to implementations. This also helps when you want to control the order in which things are built.

Pattern 3: Test-First Conversation

Write tests first, then ask the AI to implement the code:

Turn 1: "Here are the tests for a MarkdownParser class:
         [paste test file]
         Implement the MarkdownParser class that makes all
         these tests pass."

Turn 2: "Three tests are failing. Here are the error messages:
         [paste errors]
         Fix the implementation."

Turn 3: "All tests pass. Now add tests and implementation for
         nested blockquote support."

When to use: When you have clear specifications that can be expressed as tests. This pattern produces highly reliable code because the AI has unambiguous success criteria.

Pattern 4: Review-and-Refine

Generate code, then iteratively improve it through review cycles:

Turn 1: "Write a function to parse CSV files with custom delimiters."
Turn 2: "This looks good, but add error handling for malformed rows
         and support for quoted fields."
Turn 3: "The error handling is good. Now optimize it for large files
         using a streaming approach instead of loading everything
         into memory."
Turn 4: "Add type hints and a docstring that includes usage examples."

When to use: When you are not sure exactly what you want upfront, or when you want to explore the solution space before committing to a final design.

Pattern 5: Parallel Exploration

Use the conversation to explore multiple approaches before choosing one:

Turn 1: "I need to implement a caching layer. Show me three
         different approaches:
         1. Simple dict-based cache with TTL
         2. LRU cache using functools
         3. Redis-backed cache
         For each, show the core implementation and list pros/cons."

Turn 2: "I'll go with approach 1 (dict-based with TTL) but I want the
         interface from approach 3 so I can swap in Redis later.
         Build the full implementation with that interface."

When to use: When there are multiple valid approaches and you want the AI to help you evaluate tradeoffs before committing.

Real-World Application

Professional developers often combine these patterns within a single session. A typical workflow might look like: scaffold the module structure (Pattern 2), progressively implement each component (Pattern 1), review and refine each one (Pattern 4), then write tests to verify everything (Pattern 3). The key is being intentional about which pattern you are using at each stage.


9.5 Context Priming Techniques

Context priming is the practice of preparing the AI's "mental state" at the beginning of a conversation to improve all subsequent responses. Think of it as setting the stage before the performance begins.

Technique 1: Role and Expertise Priming

Tell the AI what role it should play and what expertise it should bring:

You are an expert Python backend developer with deep experience
in FastAPI, SQLAlchemy, and microservice architecture. You write
clean, well-tested code following SOLID principles. You always
consider error handling, edge cases, and performance implications.

This is not just a fluffy preamble---it measurably improves code quality by activating relevant "knowledge" in the model's weights. The AI draws on different patterns when primed as a "backend expert" versus a "frontend developer" versus a "data scientist."

Technique 2: Codebase Context Priming

Give the AI enough context about your existing codebase to generate compatible code:

Here's our project structure and key patterns:

We use a repository pattern for data access:
```python
class UserRepository:
    def __init__(self, session: AsyncSession):
        self.session = session

    async def get_by_id(self, user_id: UUID) -> User | None:
        result = await self.session.execute(
            select(User).where(User.id == user_id)
        )
        return result.scalar_one_or_none()

All our API endpoints follow this pattern:

@router.get("/{item_id}", response_model=ItemResponse)
async def get_item(
    item_id: UUID,
    repo: ItemRepository = Depends(get_item_repository),
) -> ItemResponse:
    item = await repo.get_by_id(item_id)
    if not item:
        raise HTTPException(status_code=404)
    return ItemResponse.model_validate(item)

Follow these exact patterns for any new code.


By showing the AI *examples* of your existing patterns, you dramatically increase the consistency of generated code.

### Technique 3: Anti-Pattern Priming

Tell the AI what *not* to do. This is surprisingly effective:

Important constraints: - Do NOT use global variables or module-level mutable state - Do NOT use print() for logging; use the logging module - Do NOT catch generic Exception; always catch specific exceptions - Do NOT use string concatenation for SQL; always use parameterized queries - Do NOT use datetime.now(); always use datetime.now(tz=UTC)


Negative constraints are powerful because they preempt common AI code generation patterns that may not match your project's standards.

### Technique 4: Output Format Priming

Tell the AI exactly how you want responses formatted:

When you write code, follow this format: 1. Start with a brief explanation of the approach (2-3 sentences) 2. Show the complete code in a single code block 3. List any assumptions you made 4. Note any potential issues or edge cases

Do NOT show multiple alternatives unless I ask. Do NOT explain basic Python syntax.


This saves enormous context by preventing the AI from generating lengthy explanations you do not need.

> **Best Practice**
>
> Create a "priming template" for your most common types of coding sessions. Save it in a text file and paste it at the start of each new conversation. Most experienced vibe coders have 3-5 templates they use regularly: one for backend work, one for frontend, one for debugging, one for code review, and so on.

### Technique 5: Example-Driven Priming

Provide a concrete example of the input-output relationship you want:

I'm going to ask you to create database migration files. Here's an example of our migration format:

Input: "Add a status enum column to the orders table with values: pending, processing, shipped, delivered."

Expected output:

"""Add status column to orders table.

Revision ID: abc123
"""

from alembic import op
import sqlalchemy as sa

def upgrade():
    op.execute("CREATE TYPE order_status AS ENUM ('pending',
                'processing', 'shipped', 'delivered')")
    op.add_column('orders',
        sa.Column('status', sa.Enum('pending', 'processing',
                  'shipped', 'delivered', name='order_status'),
                  nullable=False, server_default='pending'))

def downgrade():
    op.drop_column('orders', 'status')
    op.execute("DROP TYPE order_status")

Now generate a migration for: [your actual request]


This "few-shot" approach (covered in more depth in Chapter 12) is one of the most reliable ways to get the AI to produce code in exactly your preferred format.

---

## 9.6 Managing Long Conversations

Every conversation has a lifecycle. It starts sharp and focused, and over time---as context accumulates and topics shift---it gradually degrades. Understanding and managing this lifecycle is essential for productive vibe coding.

### The Conversation Lifecycle

Quality ^ | * | * *** | * ** | ** | *** | *** | | *** +-----------------------------> Turns 0 5 10 15 20 25

Phase 1: Phase 2: Phase 3: Ramp-up Peak Degradation


**Phase 1: Ramp-up (turns 1-3).** You are establishing context. The AI does not yet have a full picture of your project. Responses improve as you provide more information.

**Phase 2: Peak performance (turns 3-15).** The AI has enough context to produce excellent results but has not yet accumulated so much history that it starts losing track. This is your productive sweet spot.

**Phase 3: Degradation (turns 15+).** The conversation history is getting long. The AI may start forgetting earlier constraints, repeating mistakes, or generating code that contradicts earlier decisions. The exact onset depends on how token-heavy your turns are.

### Signs of Context Degradation

Watch for these warning signs that your conversation is degrading:

1. **The AI forgets constraints you set earlier.** It generates code that violates rules you established at the start.
2. **Responses contradict earlier responses.** The AI suggests an approach it previously argued against, or changes variable names inconsistently.
3. **The AI repeats itself.** It re-explains things it already explained or regenerates code it already wrote.
4. **Quality decreases.** Code becomes sloppier, error handling gets worse, or the AI starts taking shortcuts.
5. **The AI seems confused about the current state.** It references code or files that have been replaced or asks questions you already answered.

### Strategies for Extending Conversation Life

**Strategy 1: Summarize and Reset.** Periodically ask the AI to summarize the current state, then include that summary in a fresh conversation:

Summarize everything we've built so far: the data models, API endpoints, and any key decisions we've made. Format it as a concise technical summary I can use to start a new conversation.


**Strategy 2: Anchor Messages.** Periodically restate the most important context:

Before we continue, let me restate our key constraints: - Async everywhere - Repository pattern for data access - Pydantic v2 models for all API responses - RFC 7807 error responses

Current state: We've completed the User and Recipe models, and the CRUD endpoints for both. Now let's work on the search functionality.


**Strategy 3: Trim Verbose Responses.** If the AI is generating long explanations you do not need, tell it to be concise:

From now on, show only the code. Skip explanations unless I specifically ask for them.


This can save thousands of tokens per response.

**Strategy 4: Use Focused Sub-Conversations.** For complex tasks, break your work into focused conversations instead of one marathon session:

- Conversation 1: Design the data models
- Conversation 2: Implement the API endpoints (paste in the final models from Conversation 1)
- Conversation 3: Write the tests (paste in the final API code from Conversation 2)

> **Common Pitfall**
>
> Some developers avoid starting new conversations because they feel like they are "losing" all the context they built up. In reality, a fresh conversation with a well-crafted summary of the previous one often produces *better* results than continuing a degraded conversation. The summary acts as a distilled, high-quality context that the AI can use more effectively than a sprawling 50-message history.

---

## 9.7 When to Start Fresh vs. Continue

This is one of the most important practical decisions in vibe coding. Starting fresh too often wastes time re-establishing context. Continuing too long leads to degraded outputs. Here is a framework for making this decision.

### Signals to Continue the Current Conversation

- The AI is still producing high-quality, consistent code
- You are still working on the same feature or component
- The conversation is under 15-20 turns
- The AI correctly remembers your constraints and conventions
- You have not significantly changed topics or requirements

### Signals to Start a New Conversation

- You are starting a different feature or component
- The conversation has exceeded 20-25 turns with heavy code exchanges
- The AI is showing signs of context degradation (Section 9.6)
- You have changed your mind about a fundamental design decision
- You are switching from one phase of development to another (for example, from implementation to testing)
- The AI has made the same mistake more than twice

### The "Fresh Start" Protocol

When you decide to start a new conversation, follow this protocol to minimize context loss:

1. **Ask the AI to summarize.** Before ending the old conversation, ask:
   ```
   Summarize our progress so far. Include:
   - What we've built (files and their purposes)
   - Key design decisions and why we made them
   - Current state of the code
   - What still needs to be done
   ```

2. **Gather the artifacts.** Copy the final version of any code the AI generated. You want the *code itself*, not the conversation about it.

3. **Start the new conversation with a priming message.** Combine your standard project priming with the summary and relevant code:
   ```
   I'm continuing work on a FastAPI recipe management API.
   [Project constraints from your template]

   Here's where we left off:
   [Paste the summary]

   Here's the current code:
   [Paste relevant files]

   Next task: [What you want to work on]
   ```

This "fresh start with context" approach gives you the benefits of both a clean context window and continuity with your previous work.

> **Real-World Application**
>
> Many experienced vibe coders develop a rhythm: they work in focused sessions of 10-20 turns, take a brief pause to review and summarize, then start fresh. This rhythm maps well to the Pomodoro technique or similar time-management approaches. Some developers even automate the summarization step with scripts that extract key information from their conversation history.

### The Decision Flowchart

Has the conversation exceeded 20 turns? | +-- No --> Is the AI still performing well? | | | +-- Yes --> CONTINUE | +-- No --> Are there specific degradation signs? | | | +-- Yes --> START FRESH | +-- No --> Try an anchor message first | +-- Yes --> Are you still on the same feature? | +-- No --> START FRESH +-- Yes --> Is the AI still performing well? | +-- Yes --> CONTINUE (with caution) +-- No --> START FRESH


---

## 9.8 System Prompts and Personas

System prompts are special instructions that sit at the very beginning of the context window, in a high-attention position. They shape the AI's behavior for the entire conversation. Understanding how to use them---and the tools that let you set them---is a key context management skill.

### What System Prompts Do

System prompts serve several functions:

1. **Set the AI's role and expertise level.** "You are a senior Python developer specializing in async web applications."
2. **Establish behavioral rules.** "Always include error handling. Never use deprecated APIs."
3. **Define output format.** "Respond with code only. No explanations unless asked."
4. **Provide persistent context.** "This project uses FastAPI 0.110+ with Pydantic v2."

Because system prompts sit at the beginning of the context (the primacy position), they have outsized influence on the AI's behavior compared to the same instructions placed later in the conversation.

### Tool-Specific System Prompt Mechanisms

Different AI coding tools give you different levels of control over system prompts:

**Claude Code** uses a `CLAUDE.md` file in your project root. Any instructions in this file are automatically included as context for every conversation:

```markdown
# CLAUDE.md

## Project: Recipe API
- Framework: FastAPI with async/await
- Database: PostgreSQL via SQLAlchemy 2.0
- All responses use Pydantic v2 models
- Follow Google Python Style Guide
- Always include type hints

## Code Conventions
- Use `UUID` for all primary keys
- Repository pattern for data access
- RFC 7807 for error responses

Cursor uses a .cursorrules file with similar project-level instructions.

ChatGPT allows custom instructions in the user settings that apply to all conversations.

API access (if you are building tools or using the API directly) gives you direct control over the system prompt parameter.

Designing Effective System Prompts

The best system prompts are:

  • Specific rather than generic. "Use SQLAlchemy 2.0 style queries with select()" beats "use modern SQLAlchemy."
  • Action-oriented. "Always validate input parameters at the start of each function" beats "input validation is important."
  • Concise. Every token in the system prompt is consumed for every message in the conversation. Long system prompts are expensive.
  • Prioritized. Put the most important instructions first, in case the model pays less attention to later parts.

Personas for Different Tasks

You can dramatically change the AI's behavior by defining different personas for different tasks:

The Architect:

You are a software architect reviewing code for design issues.
Focus on: separation of concerns, dependency management,
interface design, and scalability. Do not comment on style
or formatting issues.

The Security Reviewer:

You are a security engineer reviewing code for vulnerabilities.
Check for: injection attacks, authentication bypasses, data
exposure, insecure defaults, and missing input validation.
Rate each finding by severity (Critical/High/Medium/Low).

The Performance Optimizer:

You are a performance engineer. Analyze code for: unnecessary
allocations, N+1 queries, missing indexes, blocking operations,
and inefficient algorithms. Suggest specific optimizations with
expected impact.

Each persona focuses the AI's attention on a different dimension of code quality, producing more thorough and relevant feedback than a generic review.

Advanced

You can combine personas with output format instructions to create powerful analysis tools. For example, a "Security Reviewer" persona combined with "Output findings as a JSON array with fields: line_number, severity, category, description, recommendation" creates an automated security scanner that produces structured, parseable output.


9.9 File and Codebase Context Strategies

One of the most common context management challenges is figuring out which files to include in your conversation and how to present them. Including too little context leads to code that does not integrate well with your project. Including too much wastes tokens and can confuse the AI.

Strategy 1: The Relevant Subset

For most tasks, you only need to include files that are directly relevant to the task at hand. Ask yourself:

  • What file am I modifying?
  • What files does it import from?
  • What files import from it?
  • Are there tests for this file?
  • Are there configuration files that affect this code?

For example, if you are adding a new endpoint to your API, you might include:

1. The router file you're modifying        (directly relevant)
2. The data model it will use               (dependency)
3. The repository it will call              (dependency)
4. The Pydantic schema it will return       (dependency)
5. An existing endpoint as an example       (pattern reference)

You probably do not need to include: the database migration files, unrelated routers, the frontend code, or the deployment configuration.

Strategy 2: Interface-Only Context

When a file is relevant but large, you can include just the interfaces---the class definitions, method signatures, and docstrings---without the implementation details:

# Instead of pasting the entire 500-line UserRepository,
# paste just the interface:

class UserRepository:
    """Handles all database operations for User entities."""

    async def get_by_id(self, user_id: UUID) -> User | None: ...
    async def get_by_email(self, email: str) -> User | None: ...
    async def create(self, user: UserCreate) -> User: ...
    async def update(self, user_id: UUID, data: UserUpdate) -> User: ...
    async def delete(self, user_id: UUID) -> bool: ...
    async def list(self, offset: int, limit: int) -> list[User]: ...

This gives the AI enough information to write code that interacts with the repository correctly, without wasting tokens on implementation details it does not need.

Strategy 3: Contextual Snippets

Instead of including entire files, include only the relevant sections with enough surrounding context to be clear:

# From app/models/user.py (lines 45-62):
class User(Base):
    __tablename__ = "users"

    id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
    email: Mapped[str] = mapped_column(unique=True, index=True)
    name: Mapped[str] = mapped_column(String(100))
    hashed_password: Mapped[str]
    is_active: Mapped[bool] = mapped_column(default=True)
    created_at: Mapped[datetime] = mapped_column(default=func.now())

# From app/schemas/user.py (lines 10-25):
class UserResponse(BaseModel):
    id: UUID
    email: str
    name: str
    is_active: bool
    created_at: datetime

    model_config = ConfigDict(from_attributes=True)

Note how each snippet includes its source file and line numbers. This helps the AI understand where the code lives and how it relates to the broader codebase.

Strategy 4: Tree-and-Summary

For large codebases, provide the directory tree and a brief description of each major component:

project/
  app/
    __init__.py          # App factory
    config.py            # Environment-based configuration (uses pydantic-settings)
    models/
      __init__.py
      base.py            # SQLAlchemy declarative base, common mixins
      user.py            # User model (id, email, name, password, timestamps)
      recipe.py          # Recipe model (id, title, ingredients JSON, user_id FK)
    api/
      __init__.py
      deps.py            # FastAPI dependencies (get_db, get_current_user)
      users.py           # User CRUD endpoints
      recipes.py         # Recipe CRUD endpoints + search
      auth.py            # Login, register, token refresh
    services/
      __init__.py
      auth.py            # JWT creation/validation, password hashing
      search.py          # Full-text search with PostgreSQL tsvector
  tests/
    conftest.py          # Shared fixtures (test DB, client, auth headers)
    test_users.py        # User endpoint tests
    test_recipes.py      # Recipe endpoint tests
  alembic/               # Database migrations
  pyproject.toml         # Dependencies and tool configuration

This gives the AI a "map" of the codebase at minimal token cost. When you then ask it to work on a specific file, it understands where that file fits in the overall architecture.

Best Practice

Many AI coding tools can read files directly from your filesystem (Claude Code, Cursor, and others). When available, let the tool read files rather than pasting them manually. The tool can often select relevant context more efficiently than you can, and it ensures the AI sees the actual current state of the file rather than a potentially stale copy.

Strategy 5: Diff-Based Context

When you are fixing a bug or modifying existing code, providing a diff is often more efficient than providing the full file:

The following change introduced a bug in our search endpoint.
Here's the diff:

@@ -45,8 +45,8 @@ async def search_recipes(
     query: str,
     db: AsyncSession = Depends(get_db),
 ):
-    results = await search_service.search(db, query)
-    return [RecipeResponse.model_validate(r) for r in results]
+    results = await search_service.search(db, query, limit=10)
+    return results  # BUG: returns raw ORM objects instead of schemas

The AI can immediately see what changed and focus on the specific issue rather than reading through the entire file.


9.10 Context Budget Planning

Just as you would not start a software project without estimating the time and resources needed, you should not start a vibe coding session without planning your context budget. This section introduces a systematic approach to context planning.

The Context Budget Framework

Before starting a coding session, estimate how your context window will be allocated:

Total Available Context: ~150,000 tokens

Fixed Costs (always present):
  System prompt / tool overhead:       5,000 tokens
  Project instructions (CLAUDE.md):    1,000 tokens
  Reserved for AI response:            4,000 tokens
  ─────────────────────────────────────────────
  Fixed total:                        10,000 tokens

Variable Costs (this session):
  File context (estimate files needed): _____ tokens
  Conversation history budget:          _____ tokens
  Current prompt:                       _____ tokens
  ─────────────────────────────────────────────
  Variable total:                       _____ tokens

Available for conversation:     150,000 - fixed - variable

Planning by Task Type

Different types of coding tasks have different context needs:

Simple function implementation:

File context:    1,000 tokens  (one file + interface of one dependency)
Conversation:    5,000 tokens  (3-5 turns)
Current prompt:    300 tokens
─────────────────────────
Total variable:  6,300 tokens  (very light; plenty of room)

New feature across multiple files:

File context:   12,000 tokens  (5-8 files, mix of full and interface)
Conversation:   25,000 tokens  (15-20 turns of implementation)
Current prompt:    500 tokens
─────────────────────────
Total variable: 37,500 tokens  (moderate; may need mid-session summary)

Major refactoring:

File context:   30,000 tokens  (10-15 files showing before state)
Conversation:   40,000 tokens  (20-30 turns of careful refactoring)
Current prompt:  1,000 tokens
─────────────────────────
Total variable: 71,000 tokens  (heavy; definitely split into sessions)

Debugging a complex issue:

File context:   15,000 tokens  (relevant files + stack trace + logs)
Conversation:   20,000 tokens  (back-and-forth investigation)
Current prompt:  2,000 tokens  (detailed error report)
─────────────────────────
Total variable: 37,000 tokens  (moderate; keep debug context focused)

Practical Budget Optimization Techniques

Technique 1: Pre-summarize large files. Before pasting a 500-line file, create a summary version that captures the essentials in 50-100 lines. Include full details only for the sections directly relevant to your task.

Technique 2: Use code references instead of code. Instead of pasting an entire dependency, just describe it:

The UserRepository class (in app/repositories/user.py) has the
standard CRUD methods: get_by_id, create, update, delete, and
list_all. All are async and accept an AsyncSession parameter.

Technique 3: Plan your conversation turns. Before starting, estimate how many turns you will need. If you think you will need more than 15-20 turns, plan a midpoint summary and fresh start.

Technique 4: Front-load, do not drip-feed. Providing all necessary context in the first message is more token-efficient than spreading it across multiple back-and-forth exchanges. Compare:

# Drip-feeding (wasteful): 6 messages, ~2,000 tokens
User: Write a sort function.
AI:   What language?
User: Python.
AI:   What data type?
User: List of dicts.
AI:   What key?
User: "name", case-insensitive.

# Front-loading (efficient): 2 messages, ~400 tokens
User: Write a Python function to sort a list of dicts by the
      "name" key, case-insensitive, with None values last.
AI:   [complete implementation]

The front-loaded version uses 80% fewer tokens and produces a better result.

Intuition

Context budget planning is like packing for a trip. You have a suitcase of fixed size (the context window). If you pack carefully---rolling clothes instead of folding, bringing versatile items, leaving behind things you will not need---you can fit everything you need. If you throw things in randomly, you will run out of space and find yourself without something important.

Session Planning Template

Here is a template you can use to plan your vibe coding sessions:

## Session Plan

**Goal:** [What you want to accomplish]
**Estimated turns:** [How many conversation turns you expect]
**Context strategy:** [Which file context strategy you'll use]

### Files to Include
1. [filename] - [why it's needed] - [full/interface/snippet]
2. [filename] - [why it's needed] - [full/interface/snippet]
3. ...

### Priming Information
- Role: [What persona/expertise to prime]
- Constraints: [Key rules for this session]
- Patterns: [Code examples to follow]

### Conversation Plan
- Turn 1: [Priming + first task]
- Turns 2-N: [Sequence of subsequent tasks]
- Checkpoint: [When to summarize / evaluate whether to continue]

### Success Criteria
- [ ] [What "done" looks like]
- [ ] [Quality checks to perform]

Common Pitfall

Do not over-plan to the point where planning takes longer than the actual coding session. For simple tasks (a few functions, a quick fix), you can skip formal planning entirely. The planning template is most valuable for complex, multi-file tasks that will involve extended conversations.


Chapter Summary

Context management is the hidden skill that separates productive vibe coders from frustrated ones. The key insights from this chapter are:

  1. Context windows are finite and uneven. The AI pays the most attention to the beginning and end of the context, with information in the middle receiving less attention.

  2. Your conversation is a data structure. Treat it as an append-only, ordered sequence with fixed capacity. Design it deliberately.

  3. Information placement matters. Front-load critical context, use the sandwich pattern for important messages, and periodically restate key constraints.

  4. Multi-turn patterns are tools. Choose the right pattern---progressive disclosure, scaffold-then-fill, test-first, review-and-refine, or parallel exploration---for each task.

  5. Priming sets the stage. Invest tokens in role priming, codebase context, anti-patterns, and output format instructions at the start of each conversation.

  6. Conversations degrade over time. Watch for signs of context degradation and know when to summarize and start fresh.

  7. File context is a strategic choice. Use the relevant subset, interface-only, snippets, tree-and-summary, or diff-based approaches depending on the task.

  8. Budget your context. Plan your token allocation before starting a session, especially for complex tasks.

Mastering these skills will make your conversations with AI coding assistants dramatically more productive. In the next chapter, we will build on this foundation to explore specification-driven prompting---a technique that uses detailed specifications to get precise, reliable code from AI on the first attempt.


Looking Ahead

In Chapter 10: Specification-Driven Prompting, you will learn how to translate vague ideas into precise specifications that AI can implement reliably. You will see how requirements documents, user stories, API specifications, and schemas can serve as powerful prompts that produce production-quality code with minimal iteration. The context management skills from this chapter will be essential---writing good specifications requires knowing how to fit detailed requirements into the context window effectively.


Cross-references: For technical details on context windows and tokenization, see Chapter 2, Sections 2.5-2.6. For the fundamentals of writing effective prompts that this chapter builds upon, see Chapter 8. For advanced multi-turn conversation techniques, see Chapter 11: Iterative Refinement and Conversation Patterns.