Scientists built the hardest AI test ever and the results are surprising
As AI systems began acing traditional tests, researchers realized those benchmarks were no longer tough enough. In response, nearly 1,000 experts created Humanity’s Last Exam, a massive 2,500-question challenge covering highly specialized topics across many fields. The exam was engineered so that an...
Before quantum computing arrives, this startup wants enterprises already running on it
After selling his AI startup to AMD for $665 million, Peter Sarlin is back with QuTwo, a new venture building the infrastructure it believes enterprises will need when quantum computing finally arrives.
Truecaller now lets you hang up on scammers — on behalf of your family
Caller identity platform Truecaller recently launched a new feature that lets one person become an admin of a family group, get alerts about fraud calls received by other members, and even end a call on their behalf if they suspect a family member might get scammed. The company, which has over 450 m...
Interventional Time Series Priors for Causal Foundation Models
arXiv:2603.11090v1 Announce Type: new
Abstract: Prior-data fitted networks (PFNs) have emerged as powerful foundation models for tabular causal inference, yet their extension to time series remains limited by the absence of synthetic data generators that provide interventional targets. Existing tim...
Graph Tokenization for Bridging Graphs and Transformers
arXiv:2603.11099v1 Announce Type: new
Abstract: The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization...
DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
arXiv:2603.11076v1 Announce Type: new
Abstract: Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversit...
A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
arXiv:2603.11093v1 Announce Type: new
Abstract: The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments...
PACED: Distillation at the Frontier of Student Competence
arXiv:2603.11178v1 Announce Type: new
Abstract: Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely in...
Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios
arXiv:2603.11214v1 Announce Type: new
Abstract: We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across exten...
Reversible Lifelong Model Editing via Semantic Routing-Based LoRA
arXiv:2603.11239v1 Announce Type: new
Abstract: The dynamic evolution of real-world necessitates model editing within Large Language Models. While existing methods explore modular isolation or parameter-efficient strategies, they still suffer from semantic drift or knowledge forgetting due to conti...
mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While m...
Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments
We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptat...
Defense official reveals how AI chatbots could be used for targeting decisions
The US military might use generative AI systems to rank lists of targets and make recommendations about which to strike first, which would then be vetted by humans, according to a Defense official with knowledge of the matter. The disclosure about how the military may use AI chatbots comes as the Pe...
Can AI help predict which heart-failure patients will worsen within a year?
Researchers at MIT, Mass General Brigham, and Harvard Medical School developed a deep-learning model to forecast a patient’s heart failure prognosis up to a year in advance.
Stanford Researchers Release OpenJarvis: A Local-First Framework for Building On-Device Personal AI Agents with Tools, Memory, and Learning
Stanford researchers have introduced OpenJarvis, an open-source framework for building personal AI agents that run entirely on-device. The project comes from Stanford’s Scaling Intelligence Lab and is presented as both a research platform and deployment-ready infrastructure for local-first AI system...
Facebook Marketplace now lets Meta AI respond to buyers’ messages
When buyers inquire about an item’s availability, sellers can use Meta AI to automatically draft replies using information from their listing, such as the description, availability, pickup location, and price.
Clarifai 12.2: Three-Command CLI Workflow for Model Deployment
Clarifai 12.2 introduces a three-command CLI workflow for model deployment. Initialize, test locally, and deploy to production with automatic GPU selection and infrastructure provisioning.
Meta buys Moltbook: The social network where AI agents talk to each other
Meta’s acquisition of Moltbook highlights a growing focus on agent-to-agent systems and the infrastructure required to support them. It’s a small deal that signals bigger shifts in how AI ecosystems may evolve.
Systematic debugging for AI agents: Introducing the AgentRx framework
As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an AI a...