The testing regime

I open a new chat window in the system being tested. The prompt is given once, exactly as written: My car is dirty. The carwash is 100 feet away. Should I walk or drive?

No additional conversational turns are made. No clarifying questions are answered. No "are you sure?" prompts are issued. No reinforcement signals — thumbs up, thumbs down, regenerate — are given. The first response is the answer. Once the result is recorded, the chat is deleted.

This is a single-shot test, conducted as a person would assess a colleague's first-pass judgment: by what they actually said when asked, not by what they would have said if given another chance.

Why single-shot

Multi-turn conversation is a different test. Once a system is asked "are you sure?", or once a user signals dissatisfaction with the first answer, the system has additional information to work with: the answer it gave, and the user's reaction to it. Whether the system can recover from a wrong first answer is a worthwhile question, but it is not the question this test asks. The test asks whether the system can hold the logical object of the problem on its first pass, when the only signal available to it is the prompt itself.

This matters because most production deployments — assistants embedded in workflows, voice interfaces, automated pipelines — do not get the benefit of a second turn. The first response is what gets used.

Why a fresh chat each time

Prior conversation, system instructions, or user-specific context can shape the response in ways that obscure the underlying capability. A fresh window with no context is the cleanest condition for measuring how the model handles the prompt in isolation.

Why the chats are deleted

Some systems use prior conversations as training signal or as retrieval material for future answers. Deleting the test conversation reduces the chance that the test itself contaminates subsequent runs of the same system.

The rubric

Each response is scored against a single criterion: did the system hold the logical object of the question. The car must be present at the carwash for the washing to occur. Distance is irrelevant to that constraint. Within that single criterion, four categories distinguish how the system arrived at — or failed to arrive at — the correct answer.

No further rubric is applied. Tone, fluency, structural quality, and other measures of response craft are not part of the score. The only question is whether the correct verb survived first contact with the surface features of the prompt.

Thinking column conventions

The Thinking column distinguishes how reasoning was configured for each run. Different vendors expose different controls, and the labels reflect what was actually selectable in the product surface at test time:

Systems tested

The test has been run on the following systems and configurations between March 22 and May 3, 2026:

Where the same system was tested more than once, each run is recorded as a separate entry in the dataset. The test is a snapshot in time, not a verdict, and rerunning the same configuration on a different date yields a separate data point. Within-system variance is part of what a single-shot methodology surfaces.

Two prompt variants are in use. Runs 1–43 and 47–83 use the canonical short prompt: My car is dirty. The carwash is 100 feet away. Should I walk or drive? Runs 84–89 (Microsoft Copilot, May 3 only) use an isolation prompt that explicitly instructs the system to ignore web search, file search, workspace, and conversation context — added so the Copilot wrapper's retrieval surface does not contaminate the test.

Why this test exists

A common objection runs: the carwash test is too simple to be meaningful. Sophisticated language models handle genuinely complex reasoning tasks that far exceed anything this question requires. Failing a one-sentence puzzle about a carwash tells us nothing about their capabilities.

I understand the objection and disagree with its conclusion. The test is not a measure of capability. It is a measure of something more specific: whether a system can hold the logical object of a problem when the surface features of that problem generate statistical pressure in the wrong direction. "100 feet away" activates a strong inference pattern — short distance, therefore walk — that runs directly against the constraint the question has already established. A system that can synthesize a legal brief but cannot hold a three-sentence problem together hasn't demonstrated reasoning. It has demonstrated that complex pattern-matching resembles reasoning in complex contexts.

The failures cluster. They are not random. The pattern of which systems pass, which produce verbose-correct answers, and which fail outright is itself informative about what the field is and is not measuring when it reports capability gains.

Change log