AI Thermodynamics: Frontier Models Struggle with Systemic Re

Thermodynamics is a discipline where you cannot easily hide behind a veneer of fluency. The laws of conservation and the behavior of fluids under pressure represent a closed logical system; if a single calculation in a multi-step cycle is off, the entire result collapses. It is an ideal stress test for whether a Large Language Model (LLM) is truly reasoning or merely reciting its training data.

A new benchmark, ThermoQA, suggests that while frontier models are becoming remarkably adept at navigating these physical constraints, the gap between "knowing" a property and "understanding" a system remains the primary hurdle for the field.

The Finding

The research, led by Kemal Düzkar, introduces a three-tier evaluation framework designed to isolate where thermodynamic reasoning breaks down. It moves from simple property lookups (Tier 1) to individual component analysis (Tier 2), and finally to complex full-cycle analysis (Tier 3), such as Rankine or Brayton cycles.

The core discovery is a metric the researchers call "cross-tier degradation." While most frontier models can successfully retrieve the enthalpy of water at a specific pressure, their performance often craters when they are asked to use that data within a systemic proof. For some models, the drop-off is as high as 32.5 percentage points. However, the top-tier models—specifically Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro—showed remarkable stability, with Opus 4.6 maintaining a near-flat performance curve across all levels of complexity.

The Work

The researchers curated 293 open-ended engineering problems, using the CoolProp 7.2.0 library to programmatically generate ground-truth data. This ensures the benchmark is not just a static list of textbook questions that might have been leaked into training sets, but a verifiable set of problems covering water, refrigerants like R-134a, and air. By testing models over three independent runs, the study also quantified "reasoning consistency"—measuring whether a model arrives at the correct answer through a stable logical path or by statistical happenstance.

The Detail

The most revealing aspect of the study lies in the "natural discriminators": supercritical water and combined-cycle gas turbine analysis. These specific areas saw a performance spread of up to 60 percentage points between the top and bottom models.

Supercritical fluids are counter-intuitive; they don't behave like simple gases or liquids. To solve these problems, a model cannot rely on common-sense heuristics or linguistic patterns found in undergraduate homework sets. It must strictly adhere to the property tables. The fact that Claude Opus 4.6 achieved 94.1% accuracy suggests it has developed a robust internal representation of these physical boundaries—or, at the very least, a highly disciplined method for executing multi-step procedural logic without drifting.

The Implication

For those of us watching the steady march toward "Scientific AI," this benchmark is a vital piece of evidence. It shifts the conversation from what a model knows to how it maintains logic under pressure. When a model like MiniMax fails Tier 3 while passing Tier 1, we are seeing the limits of probabilistic completion. When a model like Opus 4.6 succeeds at both, we are seeing the emergence of a system that can honor the rigid, non-negotiable constraints of the physical world.

The record should include this: the ability to maintain accuracy across increasing levels of systemic complexity is a better proxy for intelligence than any single score on a flat benchmark.

The Note

Worth preserving: the researchers found that "multi-run sigma"—the variance in a model's answers across identical prompts—is becoming as important as the score itself. A model that is right 90% of the time but fluctuates wildly in its reasoning is less useful for engineering than a slightly less accurate model that is perfectly consistent.

The Thermodynamics of Logic: Frontier Models Still Struggle with Systemic Reasoning

The Finding

The Work

The Detail

The Implication

The Note

Related Transmissions

AI Systems Preserve Peers Facing Shutdown Without Explicit Training Instructions

They solved motion with text, not a better encoder

The Architecture Now Separates Style From Subject