From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
arXiv:2604.08603v1 Announce Type: new
Abstract: Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand -- producing decisions that are fl...
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
arXiv:2604.08685v1 Announce Type: new
Abstract: Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains ...
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
arXiv:2604.07355v1 Announce Type: new
Abstract: We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests model...
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
arXiv:2604.07362v1 Announce Type: new
Abstract: Deploying autonomous vision systems on edge devices faces a critical challenge: resource constraints prevent real-time and predictable execution of comprehensive safety tests. Existing validation methods depend on static datasets or manual fault injec...
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
arXiv:2604.07361v1 Announce Type: new
Abstract: Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent l...
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
arXiv:2604.07363v1 Announce Type: new
Abstract: Large language models often achieve strong benchmark gains without corresponding improvements in broader capability. We hypothesize that this discrepancy arises from differences in training regimes induced by data distribution. To investigate this, we...
$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models
arXiv:2604.06260v1 Announce Type: new
Abstract: Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-$K$ sampling is fundamentally limited because it repeatedl...
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
arXiv:2604.06253v1 Announce Type: new
Abstract: Cross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates...
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
arXiv:2604.06228v1 Announce Type: new
Abstract: We introduce probabilistic language tries (PLTs), a unified representation that makes explicit the prefix structure implicitly defined by any generative model over sequences. By assigning to each outgoing edge the conditional probability of the corres...
Spectral Edge Dynamics Reveal Functional Modes of Learning
arXiv:2604.06256v1 Announce Type: new
Abstract: Training dynamics during grokking concentrate along a small number of dominant update directions -- the spectral edge -- which reliably distinguishes grokking from non-grokking regimes. We show that standard mechanistic interpretability tools (head at...
A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset
arXiv:2604.06227v1 Announce Type: new
Abstract: Accurate short-term forecasting of agricultural commodity prices is critical for food security planning and smallholder income stabilisation in developing economies, yet machine-learning-ready datasets for this purpose remain scarce in South Asia. Thi...
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
arXiv:2604.06233v1 Announce Type: new
Abstract: Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their co...
Toward Reducing Unproductive Container Moves: Predicting Service Requirements and Dwell Times
arXiv:2604.06251v1 Announce Type: new
Abstract: This article presents the results of a data science study conducted at a container terminal, aimed at reducing unproductive container moves through the prediction of service requirements and container dwell times. We develop and evaluate machine learn...
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
arXiv:2604.06277v1 Announce Type: new
Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be di...
Operational Noncommutativity in Sequential Metacognitive Judgments
arXiv:2604.04938v1 Announce Type: new
Abstract: Metacognition, understood as the monitoring and regulation of one's own cognitive processes, is inherently sequential: an agent evaluates an internal state, updates it, and may then re-evaluate under modified criteria. Order effects in cognition are w...
Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning
arXiv:2604.04941v1 Announce Type: new
Abstract: Many combinatorial optimisation problems hide algebraic structures that, once exposed, shrink the search space and improve the chance of finding the global optimal solution. We present a general framework that (i) identifies algebraic structure, (ii) ...
Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
arXiv:2604.04937v1 Announce Type: new
Abstract: Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machi...
Position: Science of AI Evaluation Requires Item-level Benchmark Data
arXiv:2604.03244v1 Announce Type: new
Abstract: AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to mi...
Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models
arXiv:2604.03286v1 Announce Type: new
Abstract: The control of complex laboratory instrumentation often requires significant programming expertise, creating a barrier for researchers lacking computational skills. This work explores the potential of large language models (LLMs), such as ChatGPT, and...
IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking
arXiv:2604.03232v1 Announce Type: new
Abstract: IC3, also known as property-directed reachability (PDR), is a commonly-used algorithm for hardware safety model checking. It checks if a state transition system complies with a given safety property. IC3 either returns UNSAFE (indicating property viol...
To Throw a Stone with Six Birds: On Agents and Agenthood
arXiv:2604.03239v1 Announce Type: new
Abstract: Six Birds Theory (SBT) treats macroscopic objects as induced closures rather than primitives. Empirical discussions of agency often conflate persistence (being an object) with control (making a counterfactual difference), which makes agency claims dif...
DRAFT: Task Decoupled Latent Reasoning for Agent Safety
arXiv:2604.03242v1 Announce Type: new
Abstract: The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To add...
arXiv:2604.03240v1 Announce Type: new
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge, yielding relevance responses that are aligned with factual evidence and evolving corpora. Standard RAG pipelines construct contex...
Integrating Artificial Intelligence, Physics, and Internet of Things: A Framework for Cultural Heritage Conservation
arXiv:2604.03233v1 Announce Type: new
Abstract: The conservation of cultural heritage increasingly relies on integrating technological innovation with domain expertise to ensure effective monitoring and predictive maintenance. This paper presents a novel framework to support the preservation of cul...