-->
Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a ...
Scientists built the hardest AI test ever and the results are surprising
As AI systems began acing traditional tests, researchers realized those benchmarks were no longer tough enough. In response, nearly 1,000 experts created Humanity’s Last Exam, a massive 2,500-question challenge covering highly specialized topics across many fields. The exam was engineered so that an...
Interventional Time Series Priors for Causal Foundation Models
arXiv:2603.11090v1 Announce Type: new
Abstract: Prior-data fitted networks (PFNs) have emerged as powerful foundation models for tabular causal inference, yet their extension to time series remains limited by the absence of synthetic data generators that provide interventional targets. Existing tim...
Graph Tokenization for Bridging Graphs and Transformers
arXiv:2603.11099v1 Announce Type: new
Abstract: The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization...
DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
arXiv:2603.11076v1 Announce Type: new
Abstract: Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversit...
A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
arXiv:2603.11093v1 Announce Type: new
Abstract: The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments...
PACED: Distillation at the Frontier of Student Competence
arXiv:2603.11178v1 Announce Type: new
Abstract: Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely in...
Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios
arXiv:2603.11214v1 Announce Type: new
Abstract: We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across exten...
Reversible Lifelong Model Editing via Semantic Routing-Based LoRA
arXiv:2603.11239v1 Announce Type: new
Abstract: The dynamic evolution of real-world necessitates model editing within Large Language Models. While existing methods explore modular isolation or parameter-efficient strategies, they still suffer from semantic drift or knowledge forgetting due to conti...
mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While m...
Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments
We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptat...
Meta buys Moltbook: The social network where AI agents talk to each other
Meta’s acquisition of Moltbook highlights a growing focus on agent-to-agent systems and the infrastructure required to support them. It’s a small deal that signals bigger shifts in how AI ecosystems may evolve.
Systematic debugging for AI agents: Introducing the AgentRx framework
As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an AI a...
arXiv:2603.09980v1 Announce Type: new
Abstract: LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific...
MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
arXiv:2603.09983v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation...
Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
arXiv:2603.10009v1 Announce Type: new
Abstract: Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for...
LWM-Temporal: Sparse Spatio-Temporal Attention for Wireless Channel Representation Learning
arXiv:2603.10024v1 Announce Type: new
Abstract: LWM-Temporal is a new member of the Large Wireless Models (LWM) family that targets the spatiotemporal nature of wireless channels. Designed as a task-agnostic foundation model, LWM-Temporal learns universal channel embeddings that capture mobility-in...
Agentic Control Center for Data Product Optimization
arXiv:2603.10133v1 Announce Type: new
Abstract: Data products enable end users to gain greater insights about their data by providing supporting assets, such as example question-SQL pairs which can be answered using the data or views over the database tables. However, producing useful data products...
Hybrid Self-evolving Structured Memory for GUI Agents
arXiv:2603.10291v1 Announce Type: new
Abstract: The remarkable progress of vision-language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent inter...
HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation
arXiv:2603.10359v1 Announce Type: new
Abstract: Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex "corner-case" problems...
Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability
arXiv:2603.10384v1 Announce Type: new
Abstract: Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reaso...
Verbalizing LLM's Higher-order Uncertainty via Imprecise Probabilities
arXiv:2603.10396v1 Announce Type: new
Abstract: Despite the growing demand for eliciting uncertainty from large language models (LLMs), empirical evidence suggests that LLM behavior is not always adequately captured by the elicitation techniques developed under the classical probabilistic uncertain...
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverage...