ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

May 23, 2026 ยท Grace Period ยท ๐Ÿ› the ICLR 2026 Workshop on LLM Reasoning

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Noel Thomas arXiv ID 2605.24305 Category cs.LG: Machine Learning Cross-listed cs.AI Citations 0 Venue the ICLR 2026 Workshop on LLM Reasoning
Abstract
Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning