When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems
arXiv:2601.16280v1 Announce Type: new
Abstract: Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework tha...
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
arXiv:2601.16344v1 Announce Type: new
Abstract: Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark compari...
SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems
arXiv:2601.16286v1 Announce Type: new
Abstract: Agentic AI pipelines suffer from a hidden inefficiency: they frequently reconstruct identical intermediate logic, such as metric normalization or chart scaffolding, even when the user's natural language phrasing is entirely novel. Conventional boundar...
SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care
arXiv:2601.16529v1 Announce Type: new
Abstract: Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient ...
Doc2AHP: Inferring Structured Multi-Criteria Decision Models via Semantic Trees with LLMs
arXiv:2601.16479v1 Announce Type: new
Abstract: While Large Language Models (LLMs) demonstrate remarkable proficiency in semantic understanding, they often struggle to ensure structural consistency and reasoning reliability in complex decision-making tasks that demand rigorous logic. Although class...
Analyzing Neural Network Information Flow Using Differential Geometry
arXiv:2601.16366v1 Announce Type: new
Abstract: This paper provides a fresh view of the neural network (NN) data flow problem, i.e., identifying the NN connections that are most important for the performance of the full model, through the lens of graph theory. Understanding the NN data flow provide...
FedUMM: A General Framework for Federated Learning with Unified Multimodal Models
arXiv:2601.15390v1 Announce Type: new
Abstract: Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datas...
Empowering LLMs for Structure-Based Drug Design via Exploration-Augmented Latent Inference
arXiv:2601.15333v1 Announce Type: new
Abstract: Large Language Models (LLMs) possess strong representation and reasoning capabilities, but their application to structure-based drug design (SBDD) is limited by insufficient understanding of protein structures and unpredictable molecular generation. T...
arXiv:2601.15337v1 Announce Type: new
Abstract: Users should not be systemically disadvantaged by the language they use for interacting with LLMs; i.e. users across languages should get responses of similar quality irrespective of language used. In this work, we create a set of real-world open-ende...
VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration
arXiv:2601.14440v1 Announce Type: new
Abstract: Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields mar...
On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL
arXiv:2601.14456v1 Announce Type: new
Abstract: Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, ...
Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning
arXiv:2601.14263v1 Announce Type: new
Abstract: The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructur...
Quality or Quantity? Error-Informed Selective Online Learning with Gaussian Processes in Multi-Agent Systems: Extended Version
arXiv:2601.14275v1 Announce Type: new
Abstract: Effective cooperation is pivotal in distributed learning for multi-agent systems, where the interplay between the quantity and quality of the machine learning models is crucial. This paper reveals the irrationality of indiscriminate inclusion of all m...
Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct
arXiv:2601.14277v1 Announce Type: new
Abstract: Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware, which is e...
Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation
arXiv:2601.14274v1 Announce Type: new
Abstract: Multimodal emotion recognition in conversation (MERC) requires representations that effectively integrate signals from multiple modalities. These signals include modality-specific cues, information shared across modalities, and interactions that emerg...
arXiv:2601.14266v1 Announce Type: new
Abstract: While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion ...
Dynamical Systems Analysis Reveals Functional Regimes in Large Language Models
arXiv:2601.11622v1 Announce Type: new
Abstract: Large language models perform text generation through high-dimensional internal dynamics, yet the temporal organisation of these dynamics remains poorly understood. Most interpretability approaches emphasise static representations or causal interventi...
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
arXiv:2601.11568v1 Announce Type: new
Abstract: Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gradient splitting, but its static hyperparameters -- the subspace ratio ($\rho$) and update frequency ($T$) -- ...
arXiv:2601.11604v1 Announce Type: new
Abstract: Multi-objective reinforcement learning (MORL) enables agents to optimize vector-valued rewards while respecting user preferences. CAPQL, a preference-conditioned actor-critic method, achieves this by conditioning on weight vectors w and restricts data...
Discrete Semantic States and Hamiltonian Dynamics in LLM Embedding Spaces
arXiv:2601.11572v1 Announce Type: new
Abstract: We investigate the structure of Large Language Model (LLM) embedding spaces using mathematical concepts, particularly linear algebra and the Hamiltonian formalism, drawing inspiration from analogies with quantum mechanical systems. Motivated by the ob...
GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment
arXiv:2601.11574v1 Announce Type: new
Abstract: Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful ...
CSyMR: Benchmarking Compositional Symbolic Muisc Reasoning With MIR Tool Integration
arXiv:2601.11556v1 Announce Type: new
Abstract: Large Language Models (LLMs) are leveraged in symbolic music reasoning, yet existing benchmarks emphasize isolated knowledge or atomic analyses rather than the integrative compositional reasoning needed to connect musical structures. To address this, ...
Reasoning Stabilization Point: A Training-Time Signal for Stable Evidence and Shortcut Reliance
arXiv:2601.11625v1 Announce Type: new
Abstract: Fine-tuning pretrained language models can improve task performance while subtly altering the evidence a model relies on. We propose a training-time interpretability view that tracks token-level attributions across finetuning epochs. We define explana...
MIMIC-RD: Can LLMs differentially diagnose rare diseases in real-world clinical settings?
arXiv:2601.11559v1 Announce Type: new
Abstract: Despite rare diseases affecting 1 in 10 Americans, their differential diagnosis remains challenging. Due to their impressive recall abilities, large language models (LLMs) have been recently explored for differential diagnosis. Existing approaches to ...