Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results
arXiv:2604.21965v1 Announce Type: new
Abstract: Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an ...
Rethinking Publication: A Certification Framework for AI-Enabled Research
arXiv:2604.22026v1 Announce Type: new
Abstract: AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship an...
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
arXiv:2604.20972v1 Announce Type: new
Abstract: Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize ...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
arXiv:2604.20987v1 Announce Type: new
Abstract: Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and ...
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
arXiv:2604.20995v1 Announce Type: new
Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior d...
arXiv:2604.21003v1 Announce Type: new
Abstract: AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, ...
AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains
arXiv:2604.19751v1 Announce Type: new
Abstract: Generative AI is entering research, education, and professional work faster than current governance frameworks can specify how AI-assisted outputs should be judged in learning-intensive settings. The central problem is proxy failure: a polished artifa...
Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom
arXiv:2604.19754v1 Announce Type: new
Abstract: Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augment...
Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks
arXiv:2604.19755v1 Announce Type: new
Abstract: Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence a...
Learning Long-Term Motion Embeddings for Efficient Kinematics Generation
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of ma...
WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience
arXiv:2604.19756v1 Announce Type: new
Abstract: Large language model (LLM) agents often suffer from high reasoning overhead, excessive token consumption, unstable execution, and inability to reuse past experiences in complex tasks like business queries, tool use, and workflow orchestration. Traditi...
Transparent Screening for LLM Inference and Training Impacts
arXiv:2604.19757v1 Announce Type: new
Abstract: This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environment...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
arXiv:2604.19835v1 Announce Type: new
Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active co...
ParaRNN: Large-Scale Nonlinear RNNs, Trainable in Parallel
Recurrent Neural Networks (RNNs) are naturally suited to efficient inference, requiring far less memory and compute than attention-based architectures, but the sequential nature of their computation has historically made it impractical to scale up RNNs to billions of parameters. A new advancement fr...
AutoAdapt: Automated domain adaptation for large language models
Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because adapting models to domain-specific requirements is a slow and ma...
Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs
arXiv:2604.18587v1 Announce Type: new
Abstract: Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we addre...
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
arXiv:2604.18639v1 Announce Type: new
Abstract: Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial ann...
FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing
arXiv:2604.18644v1 Announce Type: new
Abstract: Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, w...
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
arXiv:2604.18701v1 Announce Type: new
Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the i...
Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
arXiv:2604.18724v1 Announce Type: new
Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, an...
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
arXiv:2604.18789v1 Announce Type: new
Abstract: Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe beh...
AI scientists produce results without reasoning scientifically
arXiv:2604.18805v1 Announce Type: new
Abstract: Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we eval...