https://arxiv.org/api/04go67DtTsVB5RaiJKBthzYkZK02026-04-21T11:08:51Z17341269015http://arxiv.org/abs/2604.16729v1Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis2026-04-17T22:04:16ZState-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.2026-04-17T22:04:16ZAyhan Can ErdurDaniel ScholzJiazhen PanBenedikt WiestlerDaniel RueckertJan C. Peekenhttp://arxiv.org/abs/2604.16723v1Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training2026-04-17T21:54:15ZLarge Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.2026-04-17T21:54:15ZMoein SalimiBabak Hosseini MohtashamAmin AghakasiriMahdi NaieniAmir Hossein QeysarbeigiMohammad Masih Shalchian NazerZahra AzarMahdi Jafari SiavoshaniMohammad Hossein Rohbanhttp://arxiv.org/abs/2604.16721v1Late Fusion Neural Operators for Extrapolation Across Parameter Space in Partial Differential Equations2026-04-17T21:52:37ZDeveloping neural operators that accurately predict the behavior of systems governed by partial differential equations (PDEs) across unseen parameter regimes is crucial for robust generalization in scientific and engineering applications. In practical applications, variations in physical parameters induce distribution shifts between training and prediction regimes, making extrapolation a central challenge. As a result, the way parameters are incorporated into neural operator models plays a key role in their ability to generalize, particularly when state and parameter representations are entangled. In this work, we introduce the Late Fusion Neural Operator, an architecture that disentangles learning state dynamics from parameter effects, improving predictive performance both within and beyond the training distribution. Our approach combines neural operators for learning latent state representations with sparse regression to incorporate parameter information in a structured manner. Across four benchmark PDEs including advection, Burgers, and both 1D and 2D reaction-diffusion equations, the proposed method consistently outperforms Fourier Neural Operator and CAPE-FNO. Late Fusion Neural Operators achieve consistently the best performance in all experiments, with an average RMSE reduction of 72.9% in-domain and 71.8% out-domain compared to the second-best method. These results demonstrate strong generalization across both in-domain and out-domain parameter regimes.2026-04-17T21:52:37ZEva van TegelenTaniya KapoorGeorge A. K. van VoornPeter van HeijsterIoannis N. Athanasiadishttp://arxiv.org/abs/2512.10211v2ID-PaS+ : Identity-Aware Predict-and-Search for General Mixed-Integer Linear Programs2026-04-17T21:33:34ZMixed-Integer Linear Programs (MIPs) are powerful and flexible tools for modeling a wide range of real-world combinatorial optimization problems. Predict-and-Search methods operate by using a predictive model to estimate promising variable assignments and then guiding a search procedure toward high-quality solutions. Recent research has demonstrated that incorporating machine learning (ML) into the Predict-and-Search framework significantly enhances its performance. Still, it is restricted to binary-only problems and overlooks the presence of fixed variable structures that commonly arise in real-world settings. This work extends the current Predict-and-Search (PAS) framework to parametric general parametric MIPs and introduces ID-PAS+, an identity-aware learning framework that enables the ML model to handle heterogeneous variable types more effectively. Experiments on several real-world large-scale problems demonstrate that ID-PAS+ consistently achieves superior performance compared to the state-of-the-art solver Gurobi and PAS.2025-12-11T01:58:28ZJunyang CaiEl Mehdi Er RaqabiPascal Van HentenryckBistra Dilkinahttp://arxiv.org/abs/2604.16715v1Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs2026-04-17T21:29:35ZGraph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity.
In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.2026-04-17T21:29:35ZAccepted to the 63rd ACM/IEEE Design Automation Conference (DAC 2026)Jun-Liang LinKamesh MadduriMahmut Taylan Kandemirhttp://arxiv.org/abs/2604.16706v1Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench2026-04-17T21:15:35ZAutomated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.2026-04-17T21:15:35Z9 pages, 5 figures, 12 tables (8 main + 4 supplementary). Under review at Information Processing & Management. Code and data: https://github.com/bhaskargurram-ai/agenthallu-benchBhaskar Gurramhttp://arxiv.org/abs/2604.16704v1The impact of postediting on AI generative translation in Yemeni context: Translating literary prose by ChatGPT2026-04-17T21:12:56ZThis study examines the role of artificial intelligence in translation, focusing on ChatGPT, specifically ChatGPT-4, and the extent to which human postediting is required in literary translation. A mixed-method approach was adopted, involving 30 professional translators who evaluated and postedited AI-generated translations of selected Arabic and English literary texts. The results show that although AI improves translation speed and accessibility, it remains limited in handling cultural, stylistic, and figurative aspects of language. Participants generally confirmed the necessity of human postediting, particularly in novels and drama. The findings indicate that emerging human-machine collaboration model rather than replacement of human translators. The study concludes that AI should be used as a supportive tool, while human expertise remains essential for ensuring translation quality and cultural appropriateness.2026-04-17T21:12:56Z20 pages, 4 TablesNasim Al-wagiehIbb UniversityMohammed Q. ShormaniIbb Universityhttp://arxiv.org/abs/2412.02617v2Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback2026-04-17T21:00:45ZLarge text-to-video models hold immense potential for a wide range of downstream applications. However, they struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We first point out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no algorithmically dominant method in principle; rather, we should care about the property of reward and data. While human feedback is less scalable, vision-language models could notice the video scenes as humans do. We then propose leveraging vision-language models to provide perceptual feedback specifically tailored to object dynamics in videos. Compared to popular video quality metrics measuring alignment or dynamics, the experiments demonstrate that our approach with binary AI feedback drives the most significant improvements in the quality of interaction scenes in video, as confirmed by AI, human, and quality metric evaluations. Notably, we observe substantial gains when using signals from vision language models, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.2024-12-03T17:44:23ZWebsite: https://sites.google.com/view/aif-dynamic-t2v/Hiroki FurutaHeiga ZenDale SchuurmansAleksandra FaustYutaka MatsuoPercy LiangSherry Yanghttp://arxiv.org/abs/2604.16696v1LOD-Net: Locality-Aware 3D Object Detection Using Multi-Scale Transformer Network2026-04-17T20:52:20Z3D object detection in point cloud data remains a challenging task due to the sparsity and lack of global structure inherent in the input. In this work, we propose a novel Multi-Scale Attention (MSA) mechanism integrated into the 3DETR architecture to better capture both local geometry and global context. Our method introduces an upsampling operation that generates high-resolution feature maps, enabling the network to better detect smaller and semantically related objects. Experiments conducted on the ScanNetv2 dataset demonstrate that our 3DETR + MSA model improves detection performance, achieving a gain of almost 1% in mAP@25 and 4.78% in mAP@50 over the baseline. While applying MSA to the 3DETR-m variant shows limited improvement, our analysis reveals the importance of adapting the upsampling strategy for lightweight models. These results highlight the effectiveness of combining hierarchical feature extraction with attention mechanisms in enhancing 3D scene understanding.2026-04-17T20:52:20ZMustaqeem KhanAidana NurakhmetovaWail GueaiebAbdulmotaleb El Saddikhttp://arxiv.org/abs/2604.16694v1RankGuide: Tensor-Rank-Guided Routing and Steering for Efficient Reasoning2026-04-17T20:51:04ZLarge reasoning models (LRMs) enhance problem-solving capabilities by generating explicit multi-step chains of thought (CoT) reasoning; however, they incur substantial inference latency and computational overhead. To mitigate this issue, recent works have explored model collaboration paradigms, where small reasoning models (SRMs) generate intermediate reasoning steps to achieve a better accuracy--latency trade-off. Despite recent progress, effectively and efficiently detecting and mitigating SRM failures in collaborative systems remains a key challenge. To address this issue, we analyze SRM inference in both the generated text and hidden-state spaces, and identify three types of failure modes: \textit{overconfidence}, \textit{uncertainty}, and \textit{heavy revalidation}. Building on these insights, we propose \textbf{RankGuide}, a framework that improves the efficiency and effectiveness of SRM--LRM collaboration through tensor-rank-guided routing and steering. Specifically, RankGuide leverages a routing signal that incorporates tensor-rank signals derived from consecutive hidden states to detect when SRMs are likely to fail and selectively invoke LRMs. In addition, we introduce a tensor-rank-filtered steering vector extraction method to modulate the reasoning trajectory of SRMs, thereby improving their generation quality. By improving both routing and steering through tensor-rank signals, RankGuide enables SRM--LRM collaborative systems to achieve more efficient reasoning with fewer steps and improved accuracy. Experiments on multiple reasoning benchmarks demonstrate the efficacy of RankGuide in reducing latency by up to $1.75\times$ compared to LRM, while maintaining competitive accuracy relative to prior methods.2026-04-17T20:51:04ZJiayi TianYupeng SuRyan SolgiSouvik KunduZheng Zhanghttp://arxiv.org/abs/2604.16689v1The Query Channel: Information-Theoretic Limits of Masking-Based Explanations2026-04-17T20:46:25ZMasking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.2026-04-17T20:46:25ZErciyes KarakayaOzgur Ercetinhttp://arxiv.org/abs/2604.16687v1Agentic Risk-Aware Set-Based Engineering Design2026-04-17T20:45:35ZThis paper introduces a multi-agent framework guided by Large Language Models (LLMs) to assist in the early stages of engineering design, a phase often characterized by vast parameter spaces and inherent uncertainty. Operating under a human-in-the-loop paradigm and demonstrated on the canonical problem of aerodynamic airfoil design, the framework employs a team of specialized agents: a Coding Assistant, a Design Agent, a Systems Engineering Agent, and an Analyst Agent - all coordinated by a human Manager. Integrated within a set-based design philosophy, the process begins with a collaborative phase where the Manager and Coding Assistant develop a suite of validated tools, after which the agents execute a structured workflow to systematically explore and prune a large set of initial design candidates. A key contribution of this work is the explicit integration of formal risk management, employing the Conditional Value-at-Risk (CVaR) as a quantitative metric to filter designs that exhibit a high probability of failing to meet performance requirements, specifically the target coefficient of lift. The framework automates labor-intensive initial exploration through a global sensitivity analysis conducted by the Analyst agent, which generates actionable heuristics to guide the other agents. The process culminates by presenting the human Manager with a curated final set of promising design candidates, augmented with high-fidelity Computational Fluid Dynamics (CFD) simulations. This approach effectively leverages AI to handle high-volume analytical tasks, thereby enhancing the decision-making capability of the human expert in selecting the final, risk-assessed design.2026-04-17T20:45:35ZVarun KumarGeorge Em Karniadakishttp://arxiv.org/abs/2604.16686v1No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation2026-04-17T20:45:02ZLarge language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and quantify it by measuring accuracy drops on baseline-correct items under answer-consistent contexts. We propose No-Worse Context-Aware Decoding (NWCAD), a decode-time adapter built on a two-stream setup with a two-stage gate: it backs off to no-context decoding when the context is non-informative, and otherwise uses context-conditioned decoding with a CAD-style fallback under uncertainty. We evaluate NWCAD on benchmarks that separate do-no-harm reliability from context utilization (accuracy gains on genuinely helpful contexts). NWCAD prevents neutral regression on baseline-correct items while preserving strong context-driven accuracy on helpful contexts.2026-04-17T20:45:02ZFindings at ACL 2026Yufei TaoAmeeta Agrawalhttp://arxiv.org/abs/2604.16685v1Graph Transformer-Based Pathway Embedding for Cancer Prognosis2026-04-17T20:41:58ZAccurate prediction of cancer progression remains a challenge due to the high heterogeneity of molecular omics data across patients. While biologically informed models have improved the interpretability of these predictions, a persistent limitation lies in how they encode individual genes to construct pathway representations. Existing hierarchical models typically derive gene features by directly mapping raw molecular inputs, whereas integration frameworks often rely on simple statistical aggregations of patient-level signals. These approaches often fail to explicitly learn a shared base representation for each gene, thereby limiting the expressiveness and biological accuracy of downstream pathway embeddings. To address this, we introduce PATH, a modulation-based, patient-conditioned gene embedding strategy. PATH represents a paradigm shift by starting from a shared base embedding for each gene, preserving a stable biological identity across the population, and then dynamically adapting it using patient-specific copy number variation (CNV) and mutation signals. This allows the model to capture subtle individual molecular variations while maintaining a consistent latent understanding of the gene itself. We integrate PATH into a graph transformer framework that models interactions among biologically connected pathways through pathway-guided attention. Across pancancer metastasis prediction, PATH achieves an F1 score of 0.8766, representing an 8.8 percent improvement over the current SOTA multi-omics benchmarks. Beyond superior predictive accuracy, our approach identifies biologically meaningful pathways and, crucially, reveals disease-state-specific pathway rewiring, offering new insights into the evolving pathway-pathway interactions that drive cancer progression.2026-04-17T20:41:58Z25 pages, 5 figuresKoushik HowladerMd Tauhidul IslamWei Lehttp://arxiv.org/abs/2604.16683v1Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning2026-04-17T20:41:14ZImitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long-horizon action-chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over-trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind-IL, a training-free online safeguard framework for generative action-chunked imitation policies. Rewind-IL combines a zero-shot failure detector based on Temporal Inter-chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state-respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision-language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind-IL monitors self-consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real-world and simulated long-horizon manipulation tasks, including transfer to flow-matching action-chunked policies, demonstrate that policy-internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at https://sjay05.github.io/rewind-il2026-04-17T20:41:14Z9 pages, 8 figures, 6 tables. Project page at https://sjay05.github.io/rewind-ilGehan ZhengSanjay SeenivasanMatthew Johnson-RobersonWeiming Zhi