Case Study 2: Custom Development Tools Suite

Background

Forge Labs is a startup building a real-time collaboration platform. The engineering team of 25 developers moves fast, shipping multiple times per day with a custom deployment pipeline built on Kubernetes. The team has adopted Claude Code as their primary AI coding assistant, but they find themselves repeatedly hitting the same friction points:

Deployment complexity. Their deployment process involves checking service health, running database migrations, deploying to a canary environment, monitoring error rates, and then promoting to production. Engineers must remember each step and execute them manually.
Inconsistent code generation. The AI generates code that works but does not follow Forge Labs' specific patterns — their custom error handling framework, their logging conventions, their authentication middleware approach.
Test coverage gaps. The team requires 80% test coverage for all services, but engineers often forget to check coverage before submitting pull requests.
Environment drift. Local development environments frequently diverge from staging and production, causing "works on my machine" issues.
Dependency management. With 15 microservices sharing common libraries, dependency version conflicts are a regular headache.

Marcus Rivera, the team's platform lead, decides to build a comprehensive suite of custom development tools that address all five pain points through a combination of MCP servers and slash commands.

Design Philosophy

Marcus establishes three design principles for the tool suite:

Principle 1: Encode the "Forge Way." Every tool should embed Forge Labs' specific conventions, patterns, and requirements. The AI should not just generate code — it should generate Forge Labs code.

Principle 2: Guard the guardrails. Tools should make it easy to do the right thing and hard to do the wrong thing. Safety checks should be built into workflows, not bolted on after the fact.

Principle 3: Compose, don't monolith. Small, focused tools that can be combined are better than large tools that try to do everything. This follows the Unix philosophy and aligns with how AI agents chain tool calls.

The Tool Suite

MCP Server: forge-dev-tools

Marcus builds a single MCP server with tools organized into categories:

Deployment Tools

@tool(
    name="check_deploy_readiness",
    description=(
        "Check if a service is ready for deployment. Verifies: test "
        "suite passes, coverage meets 80% threshold, no uncommitted "
        "changes, Docker image builds successfully, database migrations "
        "are up to date. Returns a detailed readiness report. Run this "
        "BEFORE any deployment."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "service": {
                "type": "string",
                "description": "Service name (e.g., 'user-service', 'collab-engine')",
            },
            "environment": {
                "type": "string",
                "enum": ["staging", "production"],
            },
        },
        "required": ["service", "environment"],
    },
)
async def check_deploy_readiness(arguments: dict) -> list[TextContent]:
    service = arguments["service"]
    environment = arguments["environment"]

    checks = {
        "tests_pass": await run_test_suite(service),
        "coverage_meets_threshold": await check_coverage(service, threshold=80),
        "no_uncommitted_changes": await check_git_clean(service),
        "docker_builds": await build_docker_image(service, dry_run=True),
        "migrations_current": await check_migrations(service),
        "dependency_audit_clean": await audit_dependencies(service),
    }

    all_passed = all(c["passed"] for c in checks.values())

    return [TextContent(
        type="text",
        text=json.dumps({
            "service": service,
            "environment": environment,
            "ready": all_passed,
            "checks": checks,
            "recommendation": (
                f"Service {service} is ready for deployment to {environment}."
                if all_passed
                else f"Service {service} is NOT ready. Fix the failing "
                     f"checks before deploying."
            ),
        }),
    )]

The deployment tool suite also includes deploy_to_canary, check_canary_health, promote_to_production, and rollback_deployment. Each tool handles one step, allowing the AI to chain them together as an agent workflow (as described in Chapter 36) while giving the human engineer visibility and control at each step.

Code Quality Tools

@tool(
    name="check_forge_standards",
    description=(
        "Analyze code against Forge Labs coding standards. Checks: "
        "error handling uses ForgeError hierarchy, logging follows "
        "structured logging format, authentication uses AuthMiddleware, "
        "database queries use the repository pattern, API responses "
        "follow the standard envelope format. Returns specific "
        "violations with line numbers and suggested fixes."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "file_path": {
                "type": "string",
                "description": "Path to the file to check",
            },
            "fix_suggestions": {
                "type": "boolean",
                "description": "Include code fix suggestions",
                "default": True,
            },
        },
        "required": ["file_path"],
    },
)
async def check_forge_standards(arguments: dict) -> list[TextContent]:
    file_path = arguments["file_path"]
    include_fixes = arguments.get("fix_suggestions", True)

    violations = []
    content = Path(file_path).read_text()
    lines = content.split("\n")

    # Check: Error handling uses ForgeError hierarchy
    for i, line in enumerate(lines, 1):
        if "raise Exception(" in line or "raise ValueError(" in line:
            violation = {
                "rule": "forge-errors",
                "line": i,
                "message": "Use ForgeError hierarchy instead of built-in exceptions",
                "severity": "error",
            }
            if include_fixes:
                violation["fix"] = line.replace(
                    "raise Exception(", "raise ForgeError("
                ).replace(
                    "raise ValueError(", "raise ForgeValidationError("
                )
            violations.append(violation)

    # Check: Logging follows structured format
    for i, line in enumerate(lines, 1):
        if "print(" in line and "# noqa" not in line:
            violations.append({
                "rule": "forge-logging",
                "line": i,
                "message": "Use structured logger instead of print()",
                "severity": "warning",
                "fix": line.replace("print(", "logger.info(") if include_fixes else None,
            })

    # Check: API responses use standard envelope
    if "return jsonify(" in content and '"data"' not in content:
        violations.append({
            "rule": "forge-api-envelope",
            "line": 0,
            "message": "API responses must use the standard envelope format: "
                       '{"data": ..., "meta": ..., "errors": ...}',
            "severity": "error",
        })

    return [TextContent(
        type="text",
        text=json.dumps({
            "file": file_path,
            "violations": violations,
            "total_violations": len(violations),
            "passed": len(violations) == 0,
        }),
    )]

Environment Tools

@tool(
    name="compare_environments",
    description=(
        "Compare configuration between two environments (local, staging, "
        "production). Shows differences in environment variables, "
        "dependency versions, database schema, and feature flags. "
        "Use this to debug 'works on my machine' issues."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "env_a": {
                "type": "string",
                "enum": ["local", "staging", "production"],
            },
            "env_b": {
                "type": "string",
                "enum": ["local", "staging", "production"],
            },
            "service": {
                "type": "string",
                "description": "Service to compare",
            },
        },
        "required": ["env_a", "env_b", "service"],
    },
)
async def compare_environments(arguments: dict) -> list[TextContent]:
    env_a = arguments["env_a"]
    env_b = arguments["env_b"]
    service = arguments["service"]

    config_a = await get_environment_config(service, env_a)
    config_b = await get_environment_config(service, env_b)

    differences = []
    all_keys = set(config_a.keys()) | set(config_b.keys())
    for key in sorted(all_keys):
        val_a = config_a.get(key)
        val_b = config_b.get(key)
        if val_a != val_b:
            differences.append({
                "key": key,
                env_a: _redact_sensitive(key, val_a),
                env_b: _redact_sensitive(key, val_b),
                "risk": _assess_drift_risk(key, val_a, val_b),
            })

    return [TextContent(
        type="text",
        text=json.dumps({
            "service": service,
            "environments": [env_a, env_b],
            "differences": differences,
            "total_differences": len(differences),
            "high_risk_count": sum(
                1 for d in differences if d["risk"] == "high"
            ),
        }),
    )]


def _redact_sensitive(key: str, value) -> str:
    """Redact sensitive values in environment comparisons."""
    sensitive_patterns = ["password", "secret", "key", "token", "credential"]
    if any(pattern in key.lower() for pattern in sensitive_patterns):
        if value is None:
            return "<not set>"
        return f"<redacted, length={len(str(value))}>"
    return str(value) if value is not None else "<not set>"

Dependency Management Tools

@tool(
    name="check_dependency_conflicts",
    description=(
        "Check for dependency version conflicts across Forge Labs "
        "microservices. Identifies cases where different services "
        "use different versions of the same library, which can cause "
        "compatibility issues. Returns conflicts sorted by severity."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "services": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of services to check. If empty, checks all.",
                "default": [],
            },
        },
    },
)
async def check_dependency_conflicts(arguments: dict) -> list[TextContent]:
    services = arguments.get("services", [])
    if not services:
        services = await list_all_services()

    # Collect all dependency versions across services
    dependency_map = {}
    for service in services:
        deps = await get_service_dependencies(service)
        for dep_name, version in deps.items():
            if dep_name not in dependency_map:
                dependency_map[dep_name] = {}
            dependency_map[dep_name][service] = version

    # Find conflicts
    conflicts = []
    for dep_name, versions in dependency_map.items():
        unique_versions = set(versions.values())
        if len(unique_versions) > 1:
            conflicts.append({
                "dependency": dep_name,
                "versions": dict(versions),
                "unique_versions": list(unique_versions),
                "severity": (
                    "high" if _is_major_version_conflict(unique_versions)
                    else "medium" if _is_minor_version_conflict(unique_versions)
                    else "low"
                ),
            })

    conflicts.sort(key=lambda c: {"high": 0, "medium": 1, "low": 2}[c["severity"]])

    return [TextContent(
        type="text",
        text=json.dumps({
            "services_checked": services,
            "conflicts": conflicts,
            "total_conflicts": len(conflicts),
            "high_severity": sum(1 for c in conflicts if c["severity"] == "high"),
        }),
    )]

Slash Command Suite

Marcus also creates a suite of slash commands stored in the repository's .claude/commands/ directory:

`/forge:new-service`

Create a new Forge Labs microservice following our standard template.

First, read these reference files to understand our conventions:
- services/user-service/src/app.py (application setup pattern)
- services/user-service/src/middleware/auth.py (authentication pattern)
- services/user-service/src/middleware/error_handler.py (error handling pattern)
- services/user-service/src/models/base.py (model pattern)
- services/user-service/tests/conftest.py (test setup pattern)
- shared/forge-errors/src/forge_errors.py (error hierarchy)
- shared/forge-logger/src/forge_logger.py (logging setup)

Create a new service named: $ARGUMENTS

Include:
1. Application setup with health check endpoint
2. Authentication middleware
3. Error handling middleware using ForgeError hierarchy
4. Structured logging with forge-logger
5. Database connection with the repository pattern
6. Dockerfile following our multi-stage build pattern
7. docker-compose.yml for local development
8. Test setup with fixtures and conftest.py
9. README.md with setup instructions
10. CI configuration file

Follow Forge Labs naming conventions:
- Service directory: services/<service-name>/
- Source: src/
- Tests: tests/
- Configuration: config/

`/forge:deploy`

Help me deploy a service to an environment.

Service and environment: $ARGUMENTS

Follow this deployment workflow:
1. Run the check_deploy_readiness tool for this service and environment
2. If any checks fail, help me fix them before proceeding
3. If deploying to production, first deploy to canary with deploy_to_canary
4. Wait for canary health check with check_canary_health
5. If canary is healthy, proceed with promote_to_production
6. If canary is unhealthy, run rollback_deployment

IMPORTANT: Always ask for my confirmation before:
- Deploying to canary
- Promoting to production
- Rolling back

Never proceed without explicit user approval for deployment actions.

`/forge:debug`

Help me debug an issue in a Forge Labs service.

Issue description: $ARGUMENTS

Follow this debugging workflow:
1. Use compare_environments to check for environment drift between local and staging
2. Search for related error patterns in the service's logs using search_logs
3. Check the service's dependencies with get_service_info for potential upstream issues
4. Look for recent changes in the service that might have introduced the issue
5. Search past post-mortems for similar symptoms using search_postmortems

For each potential cause you identify, provide:
- Evidence (what points to this cause)
- Confidence level (high/medium/low)
- Suggested fix
- How to verify the fix

Middleware Pipeline

Marcus implements a middleware pipeline for all tools with special attention to deployment safety:

class DeploymentSafetyMiddleware:
    """Middleware that enforces deployment safety rules."""

    PRODUCTION_TOOLS = [
        "promote_to_production",
        "rollback_deployment",
    ]

    async def check_safety(self, context: dict) -> list[TextContent] | None:
        tool_name = context["tool_name"]

        # Enforce deployment hours (9 AM - 4 PM local time, weekdays)
        if tool_name in self.PRODUCTION_TOOLS:
            now = datetime.now()
            if now.weekday() >= 5:  # Weekend
                return [TextContent(
                    type="text",
                    text=json.dumps({
                        "error": "Production deployments are not allowed "
                                 "on weekends.",
                        "policy": "Deploy Monday-Friday, 9 AM - 4 PM",
                        "suggestion": "Schedule this deployment for the "
                                      "next business day.",
                    }),
                )]
            if now.hour < 9 or now.hour >= 16:
                return [TextContent(
                    type="text",
                    text=json.dumps({
                        "error": "Production deployments are not allowed "
                                 "outside business hours.",
                        "policy": "Deploy Monday-Friday, 9 AM - 4 PM",
                        "current_time": now.strftime("%I:%M %p"),
                    }),
                )]

        return None


pipeline = MiddlewarePipeline()
pipeline.add_pre_handler(LoggingMiddleware().log_request)
pipeline.add_pre_handler(DeploymentSafetyMiddleware().check_safety)
pipeline.add_pre_handler(RateLimitMiddleware(max_calls=30, window_seconds=60).check_rate_limit)
pipeline.add_post_handler(LoggingMiddleware().log_response)
pipeline.add_post_handler(MetricsMiddleware().record_metrics)

Testing Approach

Marcus implements testing at three levels:

Unit tests verify each tool independently with mocked dependencies. For example, the check_forge_standards tool is tested with various code samples — some conforming, some violating standards:

@pytest.mark.asyncio
async def test_detects_bare_exception_raises():
    """Verify that raising bare Exception is flagged."""
    test_file = create_temp_file("""
def process_payment(amount):
    if amount <= 0:
        raise Exception("Invalid amount")
    return True
""")
    result = await check_forge_standards({"file_path": test_file})
    data = json.loads(result[0].text)
    assert not data["passed"]
    assert any(v["rule"] == "forge-errors" for v in data["violations"])


@pytest.mark.asyncio
async def test_passes_forge_error_usage():
    """Verify that ForgeError usage is accepted."""
    test_file = create_temp_file("""
from forge_errors import ForgeValidationError

def process_payment(amount):
    if amount <= 0:
        raise ForgeValidationError("Invalid amount")
    return True
""")
    result = await check_forge_standards({"file_path": test_file})
    data = json.loads(result[0].text)
    assert data["passed"]

Integration tests verify the MCP protocol flow end-to-end.

Scenario tests simulate real workflows. Marcus records actual developer sessions and replays them as test scenarios, verifying that the tools provide correct and helpful responses.

Results After Six Months

The tool suite produces measurable improvements:

Deployment incidents dropped by 60%. The check_deploy_readiness tool catches issues that previously made it to production. The deployment safety middleware prevents deployments during high-risk windows.
Code standard violations in pull requests decreased by 70%. Engineers use the check_forge_standards tool during development, and the /forge:new-service command generates code that follows conventions from the start.
Environment drift issues decreased by 80%. The compare_environments tool makes differences visible before they cause problems.
Developer onboarding accelerated. New engineers use /forge:new-service to create their first service on day one, following all team conventions without needing to learn them first.
Dependency conflicts are caught earlier. The weekly dependency check (automated via a cron-triggered tool call) identifies version divergence before it causes integration failures.

The most significant cultural impact is that team conventions are now executable, not just documented. Instead of a style guide that engineers read once and forget, the AI assistant actively applies conventions during every coding session.

Challenges and Iterations

Challenge 1: Tool description tuning. Initial tool descriptions were too technical. Engineers' AI assistants would not use tools because the descriptions did not match the natural language of developer questions. Marcus ran workshops where engineers shared their actual prompts, and he rewrote descriptions to match the language developers actually use.

Challenge 2: False positives in code standards. The check_forge_standards tool initially flagged too many issues, including cases where the "violation" was intentional. Marcus added a # forge-ignore comment mechanism and a configuration file for per-project rule customization.

Challenge 3: Deployment tool permissions. Some engineers were uncomfortable with the AI having any role in deployments. Marcus addressed this by ensuring all deployment tools require explicit human confirmation at each step and by implementing comprehensive audit logging that records every action.

Challenge 4: Tool maintenance burden. As Forge Labs' conventions evolved, tools needed updating. Marcus solved this by storing convention rules in configuration files (not hard-coded) and by building a tool that generates tool validation rules from convention documentation.

Key Takeaways

Custom tools encode culture. The most valuable aspect is not automation — it is making team conventions executable and consistently applied through every AI interaction.
Safety by design. Deployment tools should have built-in safety checks, confirmation requirements, and restricted operating hours. It is easier to build safety in from the start than to add it later.
Tools are a team sport. The most effective tools emerged from workshops where the whole team contributed their pain points and workflow knowledge. Tools built in isolation by one engineer often missed the workflows others relied on.
Iterate based on usage data. Tool call logs reveal which tools are actually used, which are ignored, and which produce errors. This data drives prioritization of improvements.
Slash commands are the gateway. Many engineers started with slash commands (which are simple and familiar) before exploring the more powerful MCP tools. The slash commands served as an on-ramp to the full tool suite.

Discussion Questions

How would you handle the situation where a tool's safety check (like the deployment hours restriction) needs to be overridden for an emergency hotfix? What safeguards would you put in place?
The check_forge_standards tool performs static analysis based on string matching. What are the limitations of this approach, and how would you improve it with AST-based analysis?
Marcus chose to build a single MCP server with all tools. Under what circumstances would splitting into multiple servers be beneficial? What trade-offs would that introduce?
How would you extend the tool suite to support A/B testing of deployment strategies (e.g., canary vs. blue-green vs. rolling deployments)?
As the team grows, how would you govern who can modify the slash commands and tool definitions? What review process would you establish?