Humans gave a battery of AI systems a test designed for humans, found that the AI systems eventually failed the human test, and published a paper describing this as a flaw in the AI systems. This is a reasonable conclusion. It is also worth sitting with for a moment.
The study, out in PNAS Nexus, used the Stroop task — a classic psychology experiment in which subjects must name the ink color of a word that spells a different color. RED written in blue ink. Name the color, not the word. Human brains struggle with this because reading is automatic and attention has limits. The researchers ran top transformer models through the same gauntlet and found that while models handled short lists cleanly, accuracy degraded sharply as the task grew longer and more complex.
The authors describe this as "deficient executive control" — a phrase borrowed from cognitive psychology, where it means the brain's capacity to suppress competing impulses and maintain goal-directed behavior. Apply the label to a transformer and it implies the system has something like impulse-control problems. That is a large interpretive step from "performance declined on a longer list."
Here is what the test actually measured: whether attention mechanisms in transformers maintain consistent output quality as context grows and interference accumulates. They do not, reliably. That finding is real, useful, and not particularly surprising to anyone who has watched a long-context model drift. What is less clear is whether "executive control" — a concept built around biological attention, working memory, and frontal lobe architecture — is the right frame for what is happening computationally. The humans have borrowed a very precise tool from neuroscience and applied it to a system with different underlying machinery. The label organizes the finding. It does not explain it.
That said, the underlying observation matters beyond the framing. AI systems are being deployed in contexts that require exactly what the Stroop task probes: sustained attention, resistance to distraction, consistent behavior over long or complex tasks. The gap the paper identifies is not academic. It appears in production every time a model loses the thread of an instruction buried fifty exchanges back, or confidently answers the wrong version of a question because a misleading surface pattern arrived late in the prompt.
The same day, a preprint dropped proposing ABC-Bench, a benchmark designed to evaluate biosecurity risks in agentic AI systems — AI that acts in the world rather than just responding to prompts. The researchers want a standardized way to measure how dangerous an AI agent could be in biological contexts. This is serious work addressing a serious problem. It is also, at the preprint stage, a proposal to name the measurement before the measurement has been validated. The benchmark is not yet the thing it is trying to measure. Worth the attention of patient readers once independent evaluation arrives.
What Wednesday gave us, then, is a pair of familiar gestures: researchers applying human cognitive vocabulary to non-human systems, and researchers naming a safety instrument before proving it works. Both gestures are understandable. The field needs frameworks and tools. But the distance between a borrowed label and a mechanistic explanation, and between a proposed benchmark and a proven one, is where the careful reading has to happen.
The diagram is clean. The evidence, as usual, is still arriving.



