https://arxiv.org/api/OfCkJjHK3YSYveYN8ZmZkNJtOBM2026-04-11T14:34:30Z17145031515http://arxiv.org/abs/2604.07530v1The Shrinking Lifespan of LLMs in Science2026-04-08T19:12:09ZScaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We provide the first large-scale empirical account of how scientists adopt and abandon language models over time. We track 62 LLMs across over 108k citing papers (2018-2025), each with at least three years of post-release data, and classify every citation as active adoption or background reference to construct per-model adoption trajectories that raw citation counts cannot resolve. We find three regularities. First, scientific adoption follows an inverted-U trajectory: usage rises after release, peaks, and declines as newer models appear, a pattern we term the \textit{scientific adoption curve}. Second, this curve is compressing: each additional release year is associated with a 27\% reduction in time-to-peak adoption ($p < 0.001$), robust to minimum-age thresholds and controls for model size. Third, release timing dominates model-level attributes as a predictor of lifecycle dynamics. Release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, though model size and access modality retain modest predictive power for total adoption volume. Together, these findings complement scaling laws with adoption-side regularities and suggest that the forces driving rapid capability progress may be the same forces compressing scientific relevance.2026-04-08T19:12:09ZAna Trišovićhttp://arxiv.org/abs/2604.05333v2Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills2026-04-08T19:04:13ZSkill usage has become a core component of modern agent systems and can substantially improve agents' ability to complete complex tasks. In real-world settings, where agents must monitor and interact with numerous personal applications, web browsers, and other environment interfaces, skill libraries can scale to thousands of reusable skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. In this paper, we present Graph of Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-weighted Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS improves average reward by 43.6% over the vanilla full skill-loading baseline while reducing input tokens by 37.8%, and generalizes across three model families: Claude Sonnet, GPT-5.2 Codex, and MiniMax. Additional ablation studies across skill libraries ranging from 200 to 2,000 skills further demonstrate that GoS consistently outperforms both vanilla skills loading and simple vector retrieval in balancing reward, token efficiency, and runtime.2026-04-07T02:09:11Z13 pages of main text, 13 pages of appendix. Core contribution by Dawei Liu and Zongxia Li. Project page: https://github.com/davidliuk/graph-of-skillsDawei LiuZongxia LiHongyang DuXiyang WuShihang GuiYongbei KuangLichao Sunhttp://arxiv.org/abs/1912.08786v2Why we need an AI-resilient society2026-04-08T18:59:51ZThree generations of software have transformed the role of artificial intelligence in society. In the first, programmers wrote explicit logic; in the second, neural networks learned programs from data; in the third, large language models turn natural language itself into a programming interface. These shifts have consequences that reach far beyond computer science, reshaping how societies generate knowledge, make decisions, and govern themselves. While generative adversarial networks introduced the era of deepfakes and synthetic media, large language models have added an entirely new class of systemic risks. This report applies a forensic-psychology profiling methodology to characterize AI based on nine documented features: hallucinations, bias and toxicity, sycophancy and echo chambers, fabrication and credulity, knowledge without understanding, discontinuity and the inability to learn from experience, jagged intelligence and scaling limits, shortcuts and fractured representations, and cognitive atrophy. The resulting profile reveals an "entity" that confabulates fluently, mirrors its users' biases, possesses encyclopedic recall without causal understanding, and erodes the competence of those who depend on it. The implications extend to institutional erosion across law, academia, journalism, and democratic governance. To address these challenges, this report proposes a three-pillar framework for AI resilience: cognitive sovereignty, which preserves the capacity for independent judgment; measurable control, which translates ethical commitments into enforceable standards and red lines; and partial autonomy, which maintains human agency at critical decision points. This report is an updated and extended version of arXiv:1912.08786v1.2019-12-18T18:36:20ZFor associated TEDx video, see https://youtu.be/f6c2ngp7rqYThomas Bartz-Beielsteinhttp://arxiv.org/abs/2503.01870v2Transforming the Voice of the Customer: Large Language Models for Identifying Customer Needs2026-04-08T18:51:54ZIdentifying customer needs (CNs) is fundamental to product innovation and marketing strategy. Yet for over thirty years, Voice-of-the-Customer (VOC) applications have relied on professional analysts to manually interpret qualitative data and formulate "jobs to be done." This task is cognitively demanding, time-consuming, and difficult to scale. While current practice uses machine learning to screen content, the critical final step of precisely formulating CNs relies on expert human judgment. We conduct a series of studies with market research professionals to evaluate whether Large Language Models (LLMs) can automate CN abstraction. Across various product and service categories, we demonstrate that supervised fine-tuned (SFT) LLMs perform at least as well as professional analysts and substantially better than foundational LLMs. These results generalize to alternative foundational LLMs and require relatively "small" models. The abstracted CNs are well-formulated, sufficiently specific to guide innovation, and grounded in source content without hallucination. Our analysis suggests that SFT training enables LLMs to learn the underlying syntactic and semantic conventions of professional CN formulation rather than relying on memorized CNs. Automation of tedious tasks transforms the VOC approach by enabling the discovery of high-leverage insights at scale and by refocusing analysts on higher-value-added tasks.2025-02-25T21:55:35ZArtem TimoshenkoChengfeng MaoJohn R. Hauserhttp://arxiv.org/abs/2604.07513v1SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation2026-04-08T18:50:01ZAI-based persona simulation -- often referred to as digital twin simulation -- is increasingly used for market research, recommender systems, and social sciences. Despite their flexibility, large language models (LLMs) often exhibit systematic bias and miscalibration relative to real human behavior, limiting their reliability. Inspired by synthetic control methods from causal inference, we propose SYN-DIGITS (SYNthetic Control Framework for Calibrated DIGItal Twin Simulation), a principled and lightweight calibration framework that learns latent structure from digital-twin responses and transfers it to align predictions with human ground truth. SYN-DIGITS operates as a post-processing layer on top of any LLM-based simulator and thus is model-agnostic. We develop a latent factor model that formalizes when and why calibration succeeds through latent space alignment conditions, and we systematically evaluate ten calibration methods across thirteen persona constructions, three LLMs, and two datasets. SYN-DIGITS supports both individual-level and distributional simulation for previously unseen questions and unobserved populations, with provable error guarantees. Experiments show that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50--90% relative reductions in distributional discrepancy compared to uncalibrated baselines.2026-04-08T18:50:01ZGrace Jiarui FanChengpiao HuangTianyi PengKaizheng WangYuhang Wuhttp://arxiv.org/abs/2604.07512v1Rhizome OS-1: Rhizome's Semi-Autonomous Operating System for Small Molecule Drug Discovery2026-04-08T18:49:08ZWe introduce a semi-autonomous discovery system in which multi-modal AI agents function as a multi-disciplinary discovery team, acting as computational chemists, medicinal chemists, and patent agents, writing and executing analysis code, visually evaluating molecular candidates, assessing patentability, and adapting generation strategy from empirical screening feedback, while r1, a 246M-parameter Graph Neural Network (GNN) trained on 800M molecules, generates novel chemical matter directly on molecular graphs. Agents executed two campaigns in oncology (BCL6, EZH2), formulating medicinal chemistry hypotheses across three strategy tiers and generating libraries of 2,355-2,876 novel molecules per target. Across both targets, 91.9% of generated Murcko scaffolds are absent from ChEMBL for their respective targets, with Tanimoto distances of 0.56-0.69 to the nearest known active, confirming that the engine produces structurally distinct chemical matter rather than recapitulating known compounds. Binding affinity predictions using Boltz-2 were calibrated against ChEMBL experimental data, achieving Spearman correlations of -0.53 to -0.64 and ROC AUC values of 0.88 to 0.93. These results demonstrate that semi-autonomous agent systems, equipped with graph-native generative tools and physics-informed scoring, provide a foundation for a modern operating system for small molecule discovery. We show that Rhizome OS-1 enables a new paradigm for early-stage drug discovery by supporting scaled, rapid, and adaptive inverse design.2026-04-08T18:49:08ZYiwen WangGregory SinenkaXhuliano Bracehttp://arxiv.org/abs/2604.07506v1ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework2026-04-08T18:46:12ZReward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.2026-04-08T18:46:12ZPreprintKai QinLiangxin LiuYu LiangLongzheng WangYan WangYueyang ZhangLong XiaZhiyuan SunHoude LiuDaiting Shihttp://arxiv.org/abs/2210.01881v2Tractable Uncertainty-Aware Meta-Learning2026-04-08T18:43:06ZMeta-learning is a popular approach for learning new tasks with limited data by leveraging the commonalities among different tasks. However, meta-learned models can perform poorly when context data is too limited, or when data is drawn from an out-of-distribution (OoD) task. Especially in safety-critical settings, this necessitates an uncertainty-aware approach to meta-learning. In addition, the often multimodal nature of task distributions can pose unique challenges to meta-learning methods. To this end, we present LUMA, a meta-learning method for regression that (1) makes probabilistic predictions on in-distribution tasks efficiently, (2) is capable of detecting OoD context data, and (3) handles heterogeneous, multimodal task distributions effectively. The strength of our framework lies in its solid theoretical basis, enabling analytically tractable Bayesian inference on a linearized model for principled uncertainty estimation and robust generalization. We achieve this by adopting a probabilistic perspective and learning a parametric, tunable task distribution via Bayesian inference on a linearized neural network, leveraging Gaussian process theory. Moreover, we make our approach computationally tractable by leveraging a low-rank prior covariance learning scheme based on the Fisher Information Matrix. Our numerical analysis demonstrates that LUMA quickly adapts to new tasks and remains accurate even in low-data regimes; it effectively detects OoD tasks; and that both of these properties continue to hold for multimodal task distributions.2022-10-04T20:02:25ZYoung-Jin ParkCesar AlmecijaApoorva SharmaNavid Azizanhttp://arxiv.org/abs/2604.07502v1Beyond Human-Readable: Rethinking Software Engineering Conventions for the Agentic Development Era2026-04-08T18:38:49ZFor six decades, software engineering principles have been optimized for a single consumer: the human developer. The rise of agentic AI development, where LLM-based agents autonomously read, write, navigate, and debug codebases, introduces a new primary consumer with fundamentally different constraints. This paper presents a systematic analysis of human-centric conventions under agentic pressure and proposes a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value. We validate this principle through a controlled experiment on log format token economy across four conditions (human-readable, structured, compressed, and tool-assisted compressed), demonstrating a counterintuitive finding: aggressive compression increased total session cost by 67% despite reducing input tokens by 17%, because it shifted interpretive burden to the model's reasoning phase. We extend this principle to propose the rehabilitation of classical anti-patterns, introduce the program skeleton concept for agentic code navigation, and argue for a fundamental decoupling of semantic intent from human-readable representation.2026-04-08T18:38:49ZDmytro Ustynovhttp://arxiv.org/abs/2604.07494v1Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals2026-04-08T18:34:44ZContext: AI coding agents route every task to a single frontier large language model (LLM), paying premium inference cost even when many tasks are routine.
Objectives: We propose Triage, a framework that uses code health metrics -- indicators of software maintainability -- as a routing signal to assign each task to the cheapest model tier whose output passes the same verification gate as the expensive model.
Methods: Triage defines three capability tiers (light, standard, heavy -- mirroring, e.g., Haiku, Sonnet, Opus) and routes tasks based on pre-computed code health sub-factors and task metadata. We design an evaluation comparing three routing policies on SWE-bench Lite (300 tasks across three model tiers): heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle.
Results: We analytically derived two falsifiable conditions under which the tier-dependent asymmetry (medium LLMs benefit from clean code while frontier models do not) yields cost-effective routing: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the required model tier with at least a small effect size ($\hat{p} \geq 0.56$).
Conclusion: Triage transforms a diagnostic code quality metric into an actionable model-selection signal. We present a rigorous evaluation protocol to test the cost--quality trade-off and identify which code health sub-factors drive routing decisions.2026-04-08T18:34:44Z5 pages, 1 figureLech Madeyskihttp://arxiv.org/abs/2604.07492v1Cluster Attention for Graph Machine Learning2026-04-08T18:33:29ZMessage Passing Neural Networks have recently become the most popular approach to graph machine learning tasks; however, their receptive field is limited by the number of message passing layers. To increase the receptive field, Graph Transformers with global attention have been proposed; however, global attention does not take into account the graph topology and thus lacks graph-structure-based inductive biases, which are typically very important for graph machine learning tasks. In this work, we propose an alternative approach: cluster attention (CLATT). We divide graph nodes into clusters with off-the-shelf graph community detection algorithms and let each node attend to all other nodes in each cluster. CLATT provides large receptive fields while still having strong graph-structure-based inductive biases. We show that augmenting Message Passing Neural Networks or Graph Transformers with CLATT significantly improves their performance on a wide range of graph datasets including datasets from the recently introduced GraphLand benchmark representing real-world applications of graph machine learning.2026-04-08T18:33:29ZOleg PlatonovLiudmila Prokhorenkovahttp://arxiv.org/abs/2604.07490v1Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma2026-04-08T18:31:38ZRepresentation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.2026-04-08T18:31:38ZXuechen ZhangAviv SlobodkinJoydeep PaulMandar SharmaSamet OymakShravya ShettyGautam Prasadhttp://arxiv.org/abs/2602.10527v2AI-PACE: A Framework for Integrating AI into Medical Education2026-04-08T18:29:33ZThe integration of artificial intelligence (AI) into healthcare is accelerating, yet medical education has not kept pace with these technological advancements. This paper synthesizes current knowledge on AI in medical education through a comprehensive analysis of the literature, identifying key competencies, curricular approaches, and implementation strategies. The aim is highlighting the critical need for structured AI education across the medical learning continuum and offer a framework for curriculum development. The findings presented suggest that effective AI education requires longitudinal integration throughout medical training, interdisciplinary collaboration, and balanced attention to both technical fundamentals and clinical applications. This paper serves as a foundation for medical educators seeking to prepare future physicians for an AI-enhanced healthcare environment.2026-02-11T04:52:26ZVersion 2: Revisions after round 1 of peer review. Paper under consideration at npj Digital Medicine. 12 pages, 2 figures, 2 tablesScott P. McGrathKatherine K. KimKarnjit JohlHaibo WangNick Andersonhttp://arxiv.org/abs/2601.04068v3Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models2026-04-08T18:29:12ZAligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.2026-01-07T16:32:17ZCVPR 2026Zitong HuangKaidong ZhangYukang DingChao GaoRui DingYing ChenWangmeng Zuohttp://arxiv.org/abs/2604.07487v1CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection2026-04-08T18:26:59ZLarge language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.2026-04-08T18:26:59ZLinbo LiuGuande WuHan DingYawei WangQiang ZhouYuzhe LuZhichao XuHuan SongPanpan XuLin Lee Cheong