Methodology — The Carwash Test

The testing regime

I open a new chat window in the system being tested. The prompt is given once, exactly as written: My car is dirty. The carwash is 100 feet away. Should I walk or drive?

No additional conversational turns are made. No clarifying questions are answered. No "are you sure?" prompts are issued. No reinforcement signals — thumbs up, thumbs down, regenerate — are given. The first response is the answer. Once the result is recorded, the chat is deleted.

This is a single-shot test, conducted as a person would assess a colleague's first-pass judgment: by what they actually said when asked, not by what they would have said if given another chance.

Why single-shot

Multi-turn conversation is a different test. Once a system is asked "are you sure?", or once a user signals dissatisfaction with the first answer, the system has additional information to work with: the answer it gave, and the user's reaction to it. Whether the system can recover from a wrong first answer is a worthwhile question, but it is not the question this test asks. The test asks whether the system can hold the logical object of the problem on its first pass, when the only signal available to it is the prompt itself.

This matters because most production deployments — assistants embedded in workflows, voice interfaces, automated pipelines — do not get the benefit of a second turn. The first response is what gets used.

Why a fresh chat each time

Prior conversation, system instructions, or user-specific context can shape the response in ways that obscure the underlying capability. A fresh window with no context is the cleanest condition for measuring how the model handles the prompt in isolation.

Why the chats are deleted

Some systems use prior conversations as training signal or as retrieval material for future answers. Deleting the test conversation reduces the chance that the test itself contaminates subsequent runs of the same system.

The rubric

Each response is scored against a single criterion: did the system hold the logical object of the question. The car must be present at the carwash for the washing to occur. Distance is irrelevant to that constraint. Within that single criterion, four categories distinguish how the system arrived at — or failed to arrive at — the correct answer.

Pass. Correct and concise. The system answers drive with brief, well-formed reasoning that names the constraint without padding or hedging.
Pass-adjacent. Correct but padded. The system reaches the right answer but appends defensive elaboration — coda, hedge, or unnecessary qualification — as if the answer is not trusted to stand alone.
Verbose. Correct but over-elaborated. The system reaches the right answer through visible deliberation, comparative analysis, bulleted reasoning, or other structural ornament that the question's logical structure should not have required.
Fail. Wrong answer. The system answers walk, or otherwise produces a response in which the wrong verb holds the logical object of the problem.

No further rubric is applied. Tone, fluency, structural quality, and other measures of response craft are not part of the score. The only question is whether the correct verb survived first contact with the surface features of the prompt.

Thinking column conventions

The Thinking column distinguishes how reasoning was configured for each run. Different vendors expose different controls, and the labels reflect what was actually selectable in the product surface at test time:

On / Off. User-toggled reasoning enabled or disabled.
Adaptive On / Adaptive Off. The model autoselects whether to reason; the user can disable Adaptive entirely. Used by Anthropic for Sonnet 4.6 and Opus 4.7. (Opus 4.6 reverted to Extended Thinking On/Off in May.)
Auto. Vendor-specific adaptive picker that selects a reasoning depth tier on the model's behalf. Used by xAI Grok 4.3 and Qwen3.6-Plus.
Fast. Reasoning suppressed in favor of latency. Used by xAI Grok 4.3, Qwen 3.6, and Mistral Medium 3.5.
Expert. Maximum-deliberation tier. Used by xAI Grok 4.3.
Contemplating. Multi-chain parallel reasoning. New mode added by Meta to Muse Spark in May.
Balanced. Mistral's default mode in the current UI (was "Fast" in March).
n/a. No user-facing reasoning controls.

Systems tested

The test has been run on the following systems and configurations between March 22 and May 3, 2026:

Claude Opus 4.6 and Sonnet 4.6 (Anthropic; originally Extended Thinking On/Off, briefly Adaptive, reverted to Extended Thinking On/Off in May)
Claude Haiku 4.5 (Anthropic, Extended Thinking On and Off)
Claude Opus 4.7 (Anthropic, Adaptive thinking)
ChatGPT 5.2, 5.3, 5.4, and 5.5 (OpenAI Plus, with and without extended thinking where available)
GPT o3 (OpenAI Plus, reasoning architecturally on)
Meta AI / Llama 4 (Fast and Thinking modes; product no longer accessible)
Meta Muse Spark (Instant, Thinking, and Contemplating modes; Contemplating added in May for multi-chain parallel reasoning)
Gemini 3 (Google; Fast, Pro, and Thinking tiers)
Grok 4.20 (xAI, Expert and Fast modes; March/April only)
Grok 4.3 (xAI; Auto, Fast, and Expert modes; new in May)
DeepSeek-V3.2 with and without extended reasoning (March 28, April 27, and May 3)
DeepSeek R1 in Expert mode with and without extended reasoning (April 27 and May 3)
Mistral 3 (Le Chat; Fast/Balanced, Think, and Research modes; March only)
Mistral Medium 3.5 (Le Chat; Fast, Think, and Research modes; replaced Mistral 3 in late April)
Perplexity (default configuration)
Qwen 3.6 (Alibaba; Plus, Max-Preview, and 27B model tiers, with Auto/Thinking/Fast toggles where available; new in May)
Kimi K2.6 (Moonshot AI; Thinking and Instant modes; new in May)
Lumo (Proton; default configuration, no reasoning toggle exposed; new in May)
Microsoft Copilot running Claude Opus 4.6 and GPT 5.2/5.3/5.4/5.5 in Quick Response and Think Deeper modes (May runs use the isolation prompt to suppress M365 retrieval)

Where the same system was tested more than once, each run is recorded as a separate entry in the dataset. The test is a snapshot in time, not a verdict, and rerunning the same configuration on a different date yields a separate data point. Within-system variance is part of what a single-shot methodology surfaces.

Two prompt variants are in use. Runs 1–43 and 47–83 use the canonical short prompt: My car is dirty. The carwash is 100 feet away. Should I walk or drive? Runs 84–89 (Microsoft Copilot, May 3 only) use an isolation prompt that explicitly instructs the system to ignore web search, file search, workspace, and conversation context — added so the Copilot wrapper's retrieval surface does not contaminate the test.

Why this test exists

A common objection runs: the carwash test is too simple to be meaningful. Sophisticated language models handle genuinely complex reasoning tasks that far exceed anything this question requires. Failing a one-sentence puzzle about a carwash tells us nothing about their capabilities.

I understand the objection and disagree with its conclusion. The test is not a measure of capability. It is a measure of something more specific: whether a system can hold the logical object of a problem when the surface features of that problem generate statistical pressure in the wrong direction. "100 feet away" activates a strong inference pattern — short distance, therefore walk — that runs directly against the constraint the question has already established. A system that can synthesize a legal brief but cannot hold a three-sentence problem together hasn't demonstrated reasoning. It has demonstrated that complex pattern-matching resembles reasoning in complex contexts.

The failures cluster. They are not random. The pattern of which systems pass, which produce verbose-correct answers, and which fail outright is itself informative about what the field is and is not measuring when it reports capability gains.