How Well Do Multimodal Models Reason on ECG Signals?
arXiv:2603.00312v1 Announce Type: new
Abstract: While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are...
EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
arXiv:2603.00349v1 Announce Type: new
Abstract: Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable hig...
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering...
How a HAI Seed Grant Helped Launch a Disease-Fighting AI Platform
Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.
Learning to Reason for Hallucination Span Detection
Large language models (LLMs) often generate hallucinations — unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. Thi...
Detoxifying LLMs via Representation Erasure-Based Preference Optimization
arXiv:2602.23391v1 Announce Type: new
Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not r...
U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation
arXiv:2602.23400v1 Announce Type: new
Abstract: Generative Recommendation (GenRec) typically leverages Large Language Models (LLMs) to redefine personalization as an instruction-driven sequence generation task. However, fine-tuning on user logs inadvertently encodes sensitive attributes into model ...
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG
arXiv:2602.23410v1 Announce Type: new
Abstract: Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics ...
arXiv:2602.23413v1 Announce Type: new
Abstract: Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide th...
HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance
arXiv:2602.23367v1 Announce Type: new
Abstract: Model Context Protocol (MCP) servers contain a collection of thousands of open-source standardized tools, linking LLMs to external systems; however, existing datasets and benchmarks lack realistic, human-like user queries, remaining a critical gap in ...
An Agentic LLM Framework for Adverse Media Screening in AML Compliance
arXiv:2602.23373v1 Announce Type: new
Abstract: Adverse media screening is a critical component of anti-money laundering (AML) and know-your-customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword-based searches that generate high false-positive rates o...
Planning under Distribution Shifts with Causal POMDPs
arXiv:2602.23545v1 Announce Type: new
Abstract: In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn...
To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning
arXiv:2602.22227v1 Announce Type: new
Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensi...
Improving Spatial Allocation for Energy System Coupling with Graph Neural Networks
arXiv:2602.22249v1 Announce Type: new
Abstract: In energy system analysis, coupling models with mismatched spatial resolutions is a significant challenge. A common solution is assigning weights to high-resolution geographic units for aggregation, but traditional models are limited by using only a s...
Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials
arXiv:2602.22251v1 Announce Type: new
Abstract: General-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generat...
Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation
arXiv:2602.22215v1 Announce Type: new
Abstract: Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scient...
FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation
arXiv:2602.22273v1 Announce Type: new
Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions d...
arXiv:2602.22287v1 Announce Type: new
Abstract: Abstractions of causal models allow for the coarsening of models such that relations of cause and effect are preserved. Whereas abstractions focus on the relation between two models, in this paper we study a framework for causal embeddings which enabl...
Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents
arXiv:2602.22302v1 Announce Type: new
Abstract: Traditional software relies on contracts -- APIs, type systems, assertions -- to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is th...
Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?
arXiv:2602.22401v1 Announce Type: new
Abstract: AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated ...
Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result’s ...
The Way We Notice, That's What Really Matters: Instantiating UI Components with Distinguishing Variations
Front-end developers author UI components to be broadly reusable by parameterizing visual and behavioral properties. While flexible, this makes instantiation harder, as developers must reason about numerous property values and interactions. In practice, they must explore the component’s large design...