Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
arXiv:2604.09574v1 Announce Type: new
Abstract: The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-ce...
AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers
arXiv:2604.09576v1 Announce Type: new
Abstract: Deploying continual object detection on microcontrollers (MCUs) with under 100KB memory requires efficient feature compression that can adapt to evolving task distributions. Existing approaches rely on fixed compression strategies (e.g., FiLM conditio...
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
arXiv:2604.09554v1 Announce Type: new
Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-dr...
GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback
arXiv:2604.08553v1 Announce Type: new
Abstract: Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the low-resource setting, where la...
Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
arXiv:2604.08572v1 Announce Type: new
Abstract: State-of-the-art post-hoc out-of-distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the act...
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
arXiv:2604.08570v1 Announce Type: new
Abstract: Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBenc...
arXiv:2604.08571v1 Announce Type: new
Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques ...
Parameterized Complexity Of Representing Models Of MSO Formulas
arXiv:2604.08707v1 Announce Type: new
Abstract: Monadic second order logic (MSO2) plays an important role in parameterized complexity due to the Courcelle's theorem. This theorem states that the problem of checking if a given graph has a property specified by a given MSO2 formula can be solved by a...
OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains
arXiv:2604.08601v1 Announce Type: new
Abstract: The rise of autonomous AI agents exposes a fundamental flaw in API-centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that re...
From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
arXiv:2604.08603v1 Announce Type: new
Abstract: Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand -- producing decisions that are fl...
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
arXiv:2604.08685v1 Announce Type: new
Abstract: Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains ...
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
arXiv:2604.07355v1 Announce Type: new
Abstract: We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests model...
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
arXiv:2604.07362v1 Announce Type: new
Abstract: Deploying autonomous vision systems on edge devices faces a critical challenge: resource constraints prevent real-time and predictable execution of comprehensive safety tests. Existing validation methods depend on static datasets or manual fault injec...
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
arXiv:2604.07361v1 Announce Type: new
Abstract: Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent l...
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
arXiv:2604.07363v1 Announce Type: new
Abstract: Large language models often achieve strong benchmark gains without corresponding improvements in broader capability. We hypothesize that this discrepancy arises from differences in training regimes induced by data distribution. To investigate this, we...
$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models
arXiv:2604.06260v1 Announce Type: new
Abstract: Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-$K$ sampling is fundamentally limited because it repeatedl...
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
arXiv:2604.06253v1 Announce Type: new
Abstract: Cross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates...
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
arXiv:2604.06228v1 Announce Type: new
Abstract: We introduce probabilistic language tries (PLTs), a unified representation that makes explicit the prefix structure implicitly defined by any generative model over sequences. By assigning to each outgoing edge the conditional probability of the corres...
Spectral Edge Dynamics Reveal Functional Modes of Learning
arXiv:2604.06256v1 Announce Type: new
Abstract: Training dynamics during grokking concentrate along a small number of dominant update directions -- the spectral edge -- which reliably distinguishes grokking from non-grokking regimes. We show that standard mechanistic interpretability tools (head at...
A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset
arXiv:2604.06227v1 Announce Type: new
Abstract: Accurate short-term forecasting of agricultural commodity prices is critical for food security planning and smallholder income stabilisation in developing economies, yet machine-learning-ready datasets for this purpose remain scarce in South Asia. Thi...
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
arXiv:2604.06233v1 Announce Type: new
Abstract: Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their co...
Toward Reducing Unproductive Container Moves: Predicting Service Requirements and Dwell Times
arXiv:2604.06251v1 Announce Type: new
Abstract: This article presents the results of a data science study conducted at a container terminal, aimed at reducing unproductive container moves through the prediction of service requirements and container dwell times. We develop and evaluate machine learn...
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
arXiv:2604.06277v1 Announce Type: new
Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be di...
Operational Noncommutativity in Sequential Metacognitive Judgments
arXiv:2604.04938v1 Announce Type: new
Abstract: Metacognition, understood as the monitoring and regulation of one's own cognitive processes, is inherently sequential: an agent evaluates an internal state, updates it, and may then re-evaluate under modified criteria. Order effects in cognition are w...