An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing
arXiv:2604.21936v1 Announce Type: new
Abstract: Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and...
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
arXiv:2604.21937v1 Announce Type: new
Abstract: Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and...
Math Takes Two: A test for emergent mathematical reasoning in communication
arXiv:2604.21935v1 Announce Type: new
Abstract: Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existing evaluations rely on...
Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results
arXiv:2604.21965v1 Announce Type: new
Abstract: Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an ...
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
arXiv:2604.20972v1 Announce Type: new
Abstract: Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize ...
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
arXiv:2604.20995v1 Announce Type: new
Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior d...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
arXiv:2604.20987v1 Announce Type: new
Abstract: Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and ...
arXiv:2604.21003v1 Announce Type: new
Abstract: AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, ...
AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains
arXiv:2604.19751v1 Announce Type: new
Abstract: Generative AI is entering research, education, and professional work faster than current governance frameworks can specify how AI-assisted outputs should be judged in learning-intensive settings. The central problem is proxy failure: a polished artifa...
Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom
arXiv:2604.19754v1 Announce Type: new
Abstract: Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augment...
Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks
arXiv:2604.19755v1 Announce Type: new
Abstract: Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence a...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
arXiv:2604.19835v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active co...
WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience
arXiv:2604.19756v1 Announce Type: new
Abstract: Large language model (LLM) agents often suffer from high reasoning overhead, excessive token consumption, unstable execution, and inability to reuse past experiences in complex tasks like business queries, tool use, and workflow orchestration. Traditi...
Transparent Screening for LLM Inference and Training Impacts
arXiv:2604.19757v1 Announce Type: new
Abstract: This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environment...
FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing
arXiv:2604.18644v1 Announce Type: new
Abstract: Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, w...
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
arXiv:2604.18701v1 Announce Type: new
Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the i...
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
arXiv:2604.18639v1 Announce Type: new
Abstract: Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial ann...
Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs
arXiv:2604.18587v1 Announce Type: new
Abstract: Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we addre...
AI scientists produce results without reasoning scientifically
arXiv:2604.18805v1 Announce Type: new
Abstract: Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we eval...
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
arXiv:2604.18789v1 Announce Type: new
Abstract: Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe beh...
Quantum inspired qubit qutrit neural networks for real time financial forecasting
arXiv:2604.18838v1 Announce Type: new
Abstract: This research investigates the performance and efficacy of machine learning models in stock prediction, comparing Artificial Neural Networks (ANNs), Quantum Qubit-based Neural Networks (QQBNs), and Quantum Qutrit-based Neural Networks (QQTNs). By outl...
Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
arXiv:2604.18724v1 Announce Type: new
Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, an...
UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
arXiv:2604.16325v1 Announce Type: new
Abstract: Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based met...
A Discordance-Aware Multimodal Framework with Multi-Agent Clinical Reasoning
arXiv:2604.16333v1 Announce Type: new
Abstract: Knee osteoarthritis frequently exhibits discordance between structural damage observed in imaging and patient-reported symptoms such as pain. This mismatch complicates clinical interpretation and patient stratification and remains insufficiently model...