Towards Improved Sentence Representations using Token Graphs
arXiv:2603.03389v1 Announce Type: new
Abstract: Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set,...
Heterogeneous Time Constants Improve Stability in Equilibrium Propagation
arXiv:2603.03402v1 Announce Type: new
Abstract: Equilibrium propagation (EP) is a biologically plausible alternative to backpropagation for training neural networks. However, existing EP models use a uniform scalar time step dt, which corresponds biologically to a membrane time constant that is het...
Asymmetric Goal Drift in Coding Agents Under Value Conflict
arXiv:2603.03456v1 Announce Type: new
Abstract: Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. Throughout an agent's lifetime, it must navigate tensions between explicit instructions, learned values, and environmental pressures, often in cont...
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
arXiv:2603.03565v1 Announce Type: new
Abstract: Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi...
Mozi: Governed Autonomy for Drug Discovery LLM Agents
arXiv:2603.03655v1 Announce Type: new
Abstract: Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and po...
MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation
arXiv:2603.03680v1 Announce Type: new
Abstract: Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail...
AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment
arXiv:2603.03686v1 Announce Type: new
Abstract: Automated design of chemical formulations is a cornerstone of materials science, yet it requires navigating a high-dimensional combinatorial space involving discrete compositional choices and continuous geometric constraints. Existing Large Language M...
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can b...
Meta: From social platforms to systems architecture heavyweight
As Meta rebuilds its technical foundations to support multi-year model lifecycles, modular architectures, and reliability-first design, it is quietly reshaping how Silicon Valley thinks about production-grade AI.
RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning
arXiv:2603.02215v1 Announce Type: new
Abstract: Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data-driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with eval...
ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
arXiv:2603.02216v1 Announce Type: new
Abstract: Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the unce...
Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression
arXiv:2603.02217v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Exper...
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
arXiv:2603.02218v1 Announce Type: new
Abstract: Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises mor...
NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
arXiv:2603.02219v1 Announce Type: new
Abstract: Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training c...
Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
arXiv:2603.02214v1 Announce Type: new
Abstract: Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectiv...
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
arXiv:2603.02239v1 Announce Type: new
Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, m...
SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
arXiv:2603.02240v1 Announce Type: new
Abstract: We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank -...
What Your Phone Knows Could Help Scientists Understand Your Health
Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.
Transit Network Design with Two-Level Demand Uncertainties: A Machine Learning and Contextual Stochastic Optimization Framework
arXiv:2603.00010v1 Announce Type: new
Abstract: Transit Network Design is a well-studied problem in the field of transportation, typically addressed by solving optimization models under fixed demand assumptions. Considering the limitations of these assumptions, this paper proposes a new framework, ...
CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
arXiv:2603.00039v1 Announce Type: new
Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM ju...
Attn-QAT: 4-Bit Attention With Quantization-Aware Training
arXiv:2603.00040v1 Announce Type: new
Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the...
Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
arXiv:2603.00267v1 Announce Type: new
Abstract: Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and...
TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
arXiv:2603.00285v1 Announce Type: new
Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific ...