PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
arXiv:2606.12616v1 Announce Type: new
Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work...
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
arXiv:2606.12594v1 Announce Type: new
Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) an...
arXiv:2606.12587v1 Announce Type: new
Abstract: Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes suppo...
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
arXiv:2606.12563v1 Announce Type: new
Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation....
ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
arXiv:2606.12451v1 Announce Type: new
Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval ...
The benchmark gap, explained: What AI leaderboards measure and what they miss
Every frontier model now scores above 88% on MMLU. So why does a 37% gap still exist between lab benchmark scores and real-world AI deployment performance? We explain why the tests keep lying, and what rigorous evaluation actually looks like.
From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference
arXiv:2606.11207v1 Announce Type: new
Abstract: We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared elemen...
Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation
arXiv:2606.11192v1 Announce Type: new
Abstract: We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and...
To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending
arXiv:2606.11201v1 Announce Type: new
Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidance...
Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
arXiv:2606.11205v1 Announce Type: new
Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both sta...
ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation
arXiv:2606.11243v1 Announce Type: new
Abstract: De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack m...
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI
arXiv:2606.11245v1 Announce Type: new
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing L...
arXiv:2606.11337v1 Announce Type: new
Abstract: Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a larg...
Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
arXiv:2606.11349v1 Announce Type: new
Abstract: In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger...
Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline
arXiv:2606.11379v1 Announce Type: new
Abstract: Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated medi...
A classic brain test exposed AI's biggest weakness
Researchers gave top AI models a classic attention test used in psychology and found a major flaw. While the models could correctly name colors in short lists, their performance deteriorated sharply as the task became longer and more complex. Some leading systems fell from over 90% accuracy to nearl...
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding
arXiv:2606.09859v1 Announce Type: new
Abstract: MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizin...
Mechanistic Analysis of Alignment Algorithms in Language Models
arXiv:2606.09850v1 Announce Type: new
Abstract: Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORP...
SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning
arXiv:2606.09853v1 Announce Type: new
Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches operate at the architect...
Uncertainty-aware Multi-fidelity Closure via Conditional Normalizing Flows
arXiv:2606.09857v1 Announce Type: new
Abstract: Reduced-order models (ROMs) provide an efficient surrogate for complex multiscale systems, but their predictive accuracy is often compromised by truncation errors and the inadequate representation of interactions between resolved and unresolved scales...
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
arXiv:2606.10147v1 Announce Type: new
Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through ...