Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
arXiv:2603.02239v1 Announce Type: new
Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, m...
SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
arXiv:2603.02240v1 Announce Type: new
Abstract: We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank -...
What Your Phone Knows Could Help Scientists Understand Your Health
Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.
Transit Network Design with Two-Level Demand Uncertainties: A Machine Learning and Contextual Stochastic Optimization Framework
arXiv:2603.00010v1 Announce Type: new
Abstract: Transit Network Design is a well-studied problem in the field of transportation, typically addressed by solving optimization models under fixed demand assumptions. Considering the limitations of these assumptions, this paper proposes a new framework, ...
CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
arXiv:2603.00039v1 Announce Type: new
Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM ju...
Attn-QAT: 4-Bit Attention With Quantization-Aware Training
arXiv:2603.00040v1 Announce Type: new
Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the...
Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
arXiv:2603.00267v1 Announce Type: new
Abstract: Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and...
TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
arXiv:2603.00285v1 Announce Type: new
Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific ...
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
arXiv:2603.00309v1 Announce Type: new
Abstract: The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent ro...
How Well Do Multimodal Models Reason on ECG Signals?
arXiv:2603.00312v1 Announce Type: new
Abstract: While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are...
EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
arXiv:2603.00349v1 Announce Type: new
Abstract: Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable hig...
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering...
How a HAI Seed Grant Helped Launch a Disease-Fighting AI Platform
Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.
Learning to Reason for Hallucination Span Detection
Large language models (LLMs) often generate hallucinations — unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. Thi...
Detoxifying LLMs via Representation Erasure-Based Preference Optimization
arXiv:2602.23391v1 Announce Type: new
Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not r...
U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation
arXiv:2602.23400v1 Announce Type: new
Abstract: Generative Recommendation (GenRec) typically leverages Large Language Models (LLMs) to redefine personalization as an instruction-driven sequence generation task. However, fine-tuning on user logs inadvertently encodes sensitive attributes into model ...
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG
arXiv:2602.23410v1 Announce Type: new
Abstract: Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics ...
arXiv:2602.23413v1 Announce Type: new
Abstract: Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide th...
HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance
arXiv:2602.23367v1 Announce Type: new
Abstract: Model Context Protocol (MCP) servers contain a collection of thousands of open-source standardized tools, linking LLMs to external systems; however, existing datasets and benchmarks lack realistic, human-like user queries, remaining a critical gap in ...
An Agentic LLM Framework for Adverse Media Screening in AML Compliance
arXiv:2602.23373v1 Announce Type: new
Abstract: Adverse media screening is a critical component of anti-money laundering (AML) and know-your-customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword-based searches that generate high false-positive rates o...
Planning under Distribution Shifts with Causal POMDPs
arXiv:2602.23545v1 Announce Type: new
Abstract: In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn...
To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning
arXiv:2602.22227v1 Announce Type: new
Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensi...