Case Study 15-02: Raj's Architecture Review

Getting Claude to Push Back

Persona: Raj Patel, Senior Data Analyst at a B2B SaaS company. Works primarily in Python and SQL. Responsible for the company's analytics infrastructure, reporting pipelines, and data modeling.

The Context: Raj has designed an architecture for a new data pipeline that will serve the company's customer-facing analytics dashboard. The pipeline needs to ingest data from three source systems, transform it through a dbt layer, and serve it to a Looker dashboard with sub-second query times for up to 500 concurrent users.

Raj has made several significant architectural decisions over a three-week design period. He is 80% confident the design is right. The remaining 20% concerns a specific pattern he has chosen for handling late-arriving data and a caching layer decision he made for cost reasons.

He wants a genuine technical review — not validation of what he has done, but identification of what might be wrong.

Why Raj Chose Claude for This Task

Raj has used both ChatGPT and Claude for code and architecture work. His assessment:

"ChatGPT gives me better code output when I know what I want and just need the implementation. But when I want someone to actually question my design, Claude is more useful. It will tell me things I didn't ask to hear. ChatGPT kind of builds on what I've already decided. Claude will say 'wait, have you thought about X?' even when I didn't ask about X."

This matches the research findings on sycophancy: Raj has independently observed the pattern and developed a routing habit around it.

The Architecture Description

Raj wrote a 1,200-word technical description of his architecture, covering:

Source systems: Three PostgreSQL databases (one per major product line) with different schemas, update frequencies (real-time, hourly, daily), and data quality levels.

Ingestion layer: Airbyte for change data capture from the two real-time sources, scheduled extraction for the daily-update source. All data lands in BigQuery raw tables.

Transformation layer: dbt models organized in three layers (staging, intermediate, mart). The mart layer serves as the primary interface for the analytics layer.

Late data handling: Raj's chosen approach was a "snapshot with reprocessing window" pattern — each mart model includes a 48-hour reprocessing window where records arriving late are reinserted into the appropriate time periods. After 48 hours, records are considered final.

Serving layer: Looker with BigQuery BI Engine caching. Raj made a cost decision to cache only the most frequently accessed dashboard views (top 20% of queries by volume) rather than broader caching.

Data quality: dbt tests on source consistency and mart-level expectations. No anomaly detection beyond static threshold tests.

Raj's Prompting Strategy

Raj's opening prompt was deliberately structured to elicit critique:

I want you to review the architecture I'm about to describe for a data pipeline
serving a customer-facing analytics dashboard. I need genuine critical review —
not validation.

Before you respond, let me be explicit about what I want:
- Tell me what's likely to break under load or edge cases
- Tell me what I haven't thought about that I should have
- Tell me if any of my design decisions seem suboptimal given the requirements
- If my late data handling approach has known failure modes, I want to know them
- If my caching decision is going to cause problems I'm not seeing, tell me

I am not looking for reassurance that this is a good design. I have been close
to it for three weeks and I may have blind spots. Your job is to find them.

Here is the architecture:
[1,200-word description]

Start with what concerns you most. Don't soften the feedback.

Key elements of this prompt design:

Explicit framing: "I want genuine critical review — not validation." This directly counters the sycophantic default.
Specific targets: Raj named the two areas where he had his own doubts (late data handling and caching). Calling them out explicitly invited Claude to engage with them without Claude needing to guess which decisions were uncertain.
Permission structure: "Your job is to find them" gives Claude permission to surface issues it might otherwise have framed diplomatically. This is important — Claude is sensitive to what you seem to want; telling it explicitly that you want hard feedback produces more of it.
"Don't soften the feedback": A direct instruction to avoid the hedge-everything mode.

Claude's Response: The Architecture Critique

Claude's response was approximately 900 words and organized as Raj had requested — starting with the highest-concern item.

Primary Concern: The 48-Hour Reprocessing Window

Claude's assessment:

"The 48-hour reprocessing window for late-arriving data raises several questions worth examining carefully before you commit to this pattern.

The assumption about source system behavior: Your pattern works well if late-arriving data is randomly distributed within the window. It works poorly if your sources have systematic latency patterns — for example, if one source system produces end-of-day batch corrections that can arrive up to 72 hours later, your 48-hour window will systematically miss them. Do you know the actual late-arrival distribution for each of your three sources?

Customer-facing impact: This is a customer-facing dashboard. That changes the calculus significantly compared to an internal analytics pipeline. If a customer views a metric on day 1 and views it again on day 3 after late data has been reprocessed, they will see different historical numbers. Have you thought through how to handle this? Options include: displaying a 'last updated' timestamp with each metric, flagging metrics that include reprocessed data with a visual indicator, or accepting the inconsistency as below customer attention threshold. The last option is a business decision, not a technical one, but it is a decision.

The 48-hour window specifically: Why 48 hours rather than 72 or 24? This looks like a round number. I'd want to see this driven by actual late-arrival SLA analysis from your source system owners, not by intuition. If you don't have that data, getting it should precede finalizing this decision."

Secondary Concern: The 20% Caching Decision

Claude's assessment:

"The 'top 20% of views by volume' caching decision deserves more scrutiny than a cost optimization usually gets in customer-facing contexts.

Query distribution assumptions: The 80/20 rule you're applying assumes your query distribution is stable. Early-stage dashboards often have unstable usage patterns — as customers discover features and workflows, the distribution shifts. A caching configuration optimized for month-one usage patterns may be significantly suboptimal by month three.

Tail latency on uncached queries: What is the p99 query time on your most complex uncached dashboard views? If it's 3-4 seconds, that's a meaningful UX problem for the 80% of less-common views. This might not matter if those views are used by internal power users — it matters a lot if they are used by customers who are making real-time business decisions from the data.

The cost comparison: You mentioned cost as the driver for limited caching. Have you modeled the query cost difference between 20% and 60-70% caching coverage? BigQuery BI Engine pricing is by reserved slot, not by query, so the cost delta may be less than you expect."

Third Issue (Not in Raj's Original Concern List)

"One thing I didn't see addressed in your description: observability and alerting on the pipeline itself.

You have dbt tests for data quality at the source and mart layer. But what monitors the pipeline health itself? Specifically: what happens when the Airbyte sync for one of your real-time sources fails silently (the job runs but misses events)? What happens when the dbt model run completes but produces a mart table with 40% fewer rows than expected? These are failure modes that pass static tests but produce incorrect data in the dashboard.

Customer-facing analytics with silent failures are worse than no analytics — customers make decisions based on data that appears current but is wrong. I'd want to see anomaly detection on pipeline health metrics (row counts, latency, source record reconciliation) before this goes to production."

Raj's Response to the Critique

Raj's follow-up engaged with each finding:

These are all legitimate concerns. Let me respond to each and tell me if my
responses satisfy the concern or if there's still an issue.

On the 48-hour window: You're right that I chose 48 hours intuitively. I do have
SLA data from source system owners — the daily-update source has documented
corrections up to 36 hours post-close. The two real-time sources have undocumented
latency; I've been assuming it's low. I should get documented SLAs from those
teams. If I can't get documented SLAs, what's the safer assumption?

On the customer-facing inconsistency: I had not thought through the day-1/day-3
discrepancy scenario explicitly. I think we need to display a "data as of" timestamp
on all metrics and add a tooltip explaining that historical data is finalized after
48 hours. Does that address the problem adequately from a user trust perspective?

On caching: I'll model the cost difference between 20% and 60% coverage. My gut
says the cost difference is meaningful but I haven't run the numbers.

On observability: This is the one I'm most embarrassed about. You're right — I
don't have pipeline health monitoring. I have a vague plan to "add that later."
That's not acceptable for production. Can you help me think through what a
minimal viable observability layer looks like for this pipeline before we launch?

Claude's responses to Raj's follow-up addressed each in turn, confirming that his proposed "data as of" timestamp approach was reasonable, recommending he get documented SLAs or assume 72-hour windows for the undocumented sources, and providing a specific minimal observability checklist.

The Observability Checklist (Extracted from Claude's Response)

Claude's response to Raj's request for minimal viable observability produced this checklist:

Pipeline health monitors (add before launch): - Row count reconciliation: compare source record counts to mart output counts for each daily run — alert if delta exceeds 2% - Freshness alerting: alert if any real-time source has not produced new events in the last 30 minutes during business hours - dbt run status alerting: alert on any test failure, not just job completion - BI Engine cache hit rate: monitor weekly — a significant drop indicates query pattern shift requiring cache reconfiguration

Post-launch additions (within 30 days): - Anomaly detection on key metric values (large unexplained movements in core business metrics are often pipeline artifacts, not business events) - Customer-reported data issues tracking (a simple mechanism for customers to flag suspected data errors) - End-to-end latency tracking (time from source event to dashboard availability)

What Happened After the Review

Raj took three actions based on Claude's critique:

Late data handling: He contacted his source system owners and obtained written documentation of correction windows. One of the real-time sources had an undocumented batch correction process that ran at 3 AM and could affect data up to 60 hours old. He extended his reprocessing window to 72 hours and added explicit logic to handle the specific correction pattern.

Caching decision: He modeled the cost comparison. Moving from 20% to 60% coverage cost an additional $180/month. Given the customer-facing context and his p99 query time analysis (some uncached queries were running at 4.2 seconds), he extended caching coverage to 65% at the additional cost.

Observability: He added the minimal viable observability layer before launch. Two weeks after launch, the row count reconciliation alert fired — a staging model had a bug that was dropping approximately 3% of records for one product line. The alert caught it within 24 hours. Without the monitor, the error would likely have persisted undetected for weeks.

Lessons About Prompting for Genuine Critique

Explicit framing matters more than you expect. The phrase "not validation" in Raj's opening prompt is doing significant work. Without it, the default orientation is toward finding what is good about your design. With it, the model actively searches for what might be wrong.

Name your uncertainties. Raj called out his two uncertain areas specifically. Claude engaged with both in depth. If he had not named them, Claude might still have found the late-data issue, but naming it ensured it received focused analysis rather than a passing mention.

Permission to be direct produces more direct output. "Don't soften the feedback" gave Claude explicit permission to state concerns plainly rather than diplomatically. The resulting critique was direct and actionable.

Follow up on each point. Raj's follow-up message addressed each concern individually and asked specifically where his proposed responses were sufficient. This is more efficient than accepting or rejecting the critique wholesale.

The "I hadn't thought of that" category has the highest value. The observability gap — which Raj called his most embarrassing miss — was not in his list of uncertainties. He had a vague plan to "add observability later" that he was not actively worrying about. Claude surfaced it because it was asked to find what was missing, not just to evaluate what was there. Two weeks after launch, it caught a real production bug.