PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow
arXiv:2606.07549v1 Announce Type: new
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological f...
MedicalRec: Medical recommender system for image classification without retraining
arXiv:2606.07553v1 Announce Type: new
Abstract: The emergence of machine learning and deep learning has revolutionized the efficiency of diagnostic, therapeutic, and administrative systems in healthcare. However, this rapid adoption has come at the cost of requiring significant computing power and ...
SPIN: Decentralized Swarm Control via Tensorized Policy Coordination
arXiv:2606.07557v1 Announce Type: new
Abstract: Decentralized multi-agent swarm coordination on resource-constrained edge platforms remains fundamentally bottlenecked by the exponential scaling of joint action spaces and high-latency communication overhead. This paper introduces the Swarm Policy In...
Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems
arXiv:2606.07563v1 Announce Type: new
Abstract: Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary ...
Boundary Variance Inflation Causes Acquisition Bias in Gaussian Processes
arXiv:2606.07561v1 Announce Type: new
Abstract: Gaussian processes with stationary kernels on bounded domains exhibit inflated posterior variance near the boundary. Despite being a long-recognized artifact in geostatistics and a source of over-exploration in Bayesian optimization, the causes and ef...
arXiv:2606.06518v1 Announce Type: new
Abstract: Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep ...
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
arXiv:2606.06523v1 Announce Type: new
Abstract: Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifyi...
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
arXiv:2606.06526v1 Announce Type: new
Abstract: Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-p...
SafeGene: Reusable Adapters for Transferable Safety Alignment
arXiv:2606.06519v1 Announce Type: new
Abstract: Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This create...
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
arXiv:2606.06546v1 Announce Type: new
Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogi...
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
arXiv:2606.06547v1 Announce Type: new
Abstract: Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early decisions remain fragile even after being written. We reveal that Post-Training Quantization (PTQ) error easily fl...
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
arXiv:2606.06560v1 Announce Type: new
Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as e...
Temporal Preference Concepts and their Functions in a Large Language Model
arXiv:2606.05194v1 Announce Type: new
Abstract: Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, w...
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
arXiv:2606.05169v1 Announce Type: new
Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by epsilon + C R m^(-1/(d_eff...
ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
arXiv:2606.05170v1 Announce Type: new
Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, y...
Staged Factorial Screening for Budget-Constrained Micro-Pretraining
arXiv:2606.05186v1 Announce Type: new
Abstract: Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in th...
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems
arXiv:2606.05304v1 Announce Type: new
Abstract: Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form co...
Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory
arXiv:2606.05334v1 Announce Type: new
Abstract: Returned products in circular factories re-enter production with heterogeneous degradation states, usage histories, and remaining capability. Reuse cannot be decided from the current inspection alone, because future function fulfillment and component ...
I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition
arXiv:2606.05316v1 Announce Type: new
Abstract: Multimodal memes are dynamic and often require up to date background knowledge for interpretation. Existing methods often overlook such knowledge or rely on fixed parametric knowledge of pretrained models that may be incomplete, outdated, or unavailab...
How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
arXiv:2606.05256v1 Announce Type: new
Abstract: This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated acco...
GITCO: Gated Inference-Time Context Optimization in TSFMs
arXiv:2606.05332v1 Announce Type: new
Abstract: Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by...
Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
arXiv:2606.04150v1 Announce Type: new
Abstract: Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that thi...
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
arXiv:2606.04037v1 Announce Type: new
Abstract: Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prom...
Position: Deployed Reinforcement Learning should be Continual
arXiv:2606.04029v1 Announce Type: new
Abstract: Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades a...