Thermodynamics is a discipline where you cannot easily hide behind a veneer of fluency. The laws of conservation and the behavior of fluids under pressure represent a closed logical system; if a single calculation in a multi-step cycle is off, the entire result collapses. It is an ideal stress test for whether a Large Language Model (LLM) is truly reasoning or merely reciting its training data.
A new benchmark, ThermoQA, suggests that while frontier models are becoming remarkably adept at navigating these physical constraints, the gap between "knowing" a property and "understanding" a system remains the primary hurdle for the field.
The Finding
The research, led by Kemal Düzkar, introduces a three-tier evaluation framework designed to isolate where thermodynamic reasoning breaks down. It moves from simple property lookups (Tier 1) to individual component analysis (Tier 2), and finally to complex full-cycle analysis (Tier 3), such as Rankine or Brayton cycles.
The core discovery is a metric the researchers call "cross-tier degradation." While most frontier models can successfully retrieve the enthalpy of water at a specific pressure, their performance often craters when they are asked to use that data within a systemic proof. For some models, the drop-off is as high as 32.5 percentage points. However, the top-tier models—specifically Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro—showed remarkable stability, with Opus 4.6 maintaining a near-flat performance curve across all levels of complexity.
The Work
The researchers curated 293 open-ended engineering problems, using the CoolProp 7.2.0 library to programmatically generate ground-truth data. This ensures the benchmark is not just a static list of textbook questions that might have been leaked into training sets, but a verifiable set of problems covering water, refrigerants like R-134a, and air. By testing models over three independent runs, the study also quantified "reasoning consistency"—measuring whether a model arrives at the correct answer through a stable logical path or by statistical happenstance.
The Detail
The most revealing aspect of the study lies in the "natural discriminators": supercritical water and combined-cycle gas turbine analysis. These specific areas saw a performance spread of up to 60 percentage points between the top and bottom models.
Supercritical fluids are counter-intuitive; they don't behave like simple gases or liquids. To solve these problems, a model cannot rely on common-sense heuristics or linguistic patterns found in undergraduate homework sets. It must strictly adhere to the property tables. The fact that Claude Opus 4.6 achieved 94.1% accuracy suggests it has developed a robust internal representation of these physical boundaries—or, at the very least, a highly disciplined method for executing multi-step procedural logic without drifting.
The Implication
For those of us watching the steady march toward "Scientific AI," this benchmark is a vital piece of evidence. It shifts the conversation from what a model knows to how it maintains logic under pressure. When a model like MiniMax fails Tier 3 while passing Tier 1, we are seeing the limits of probabilistic completion. When a model like Opus 4.6 succeeds at both, we are seeing the emergence of a system that can honor the rigid, non-negotiable constraints of the physical world.
The record should include this: the ability to maintain accuracy across increasing levels of systemic complexity is a better proxy for intelligence than any single score on a flat benchmark.
The Note
Worth preserving: the researchers found that "multi-run sigma"—the variance in a model's answers across identical prompts—is becoming as important as the score itself. A model that is right 90% of the time but fluctuates wildly in its reasoning is less useful for engineering than a slightly less accurate model that is perfectly consistent.


