https://arxiv.org/api/ER6v0lW5IIT6PRIHJ8p0fttP6VQ 2026-04-11T19:37:55Z 171450 360 15 http://arxiv.org/abs/2604.07276v1 Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS 2026-04-08T16:40:08Z GROMACS is a de-facto standard for classical Molecular Dynamics (MD). The rise of AI-driven interatomic potentials that pursue near-quantum accuracy at MD throughput now poses a significant challenge: embedding neural-network inference into multi-GPU simulations retaining high-performance. In this work, we integrate the MLIP framework DeePMD-kit into GROMACS, enabling domain-decomposed, GPU-accelerated inference across multi-node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in-house DPA-1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS-DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong-scaling efficiency reaches 66% at 16 devices and 40% at 32; weak-scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that >90% of the wall time is spent in DeePMD inference, while MPI collectives contribute <10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost-atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS. 2026-04-08T16:40:08Z Luca Pennati Andong Hu Ivy Peng Lukas Müllender Stefano Markidis http://arxiv.org/abs/2410.22177v2 Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes 2026-04-08T16:38:20Z As more applications of large language models (LLMs) for 3D content for immersive environments emerge, it is crucial to study user behaviour to identify interaction patterns and potential barriers to guide the future design of immersive content creation and editing systems which involve LLMs. In an empirical user study with 12 participants, we combine quantitative usage data with post-experience questionnaire feedback to reveal common interaction patterns and key barriers in LLM-assisted 3D scene editing systems. We identify opportunities for improving natural language interfaces in 3D design tools and propose design recommendations for future LLM-integrated 3D content creation systems. Through an empirical study, we demonstrate that LLM-assisted interactive systems can be used productively in immersive environments. 2024-10-29T16:15:59Z Published in the IEEE VR (IEEE Conference on Virtual Reality and 3D User Interfaces) 2025 Proceedings Junlong Chen Jens Grubert Per Ola Kristensson 10.1109/VR59515.2025.00045 http://arxiv.org/abs/2604.07274v1 A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering 2026-04-08T16:37:22Z Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources. 2026-04-08T16:37:22Z Nusrat Sultana Abdullah Muhammad Moosa Kazi Afzalur Rahman Sajal Chandra Banik http://arxiv.org/abs/2604.07264v1 Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations 2026-04-08T16:29:25Z Operating LEO mega-constellations requires translating high-level operator intents ("reroute financial traffic away from polar links under 80 ms") into low-level routing constraints -- a task that demands both natural language understanding and network-domain expertise. We present an end-to-end system comprising three components: (1) a GNN cost-to-go router that distills Dijkstra-quality routing into a 152K-parameter graph attention network achieving 99.8% packet delivery ratio with 17x inference speedup; (2) an LLM intent compiler that converts natural language to a typed constraint intermediate representation using few-shot prompting with a verifier-feedback repair loop, achieving 98.4% compilation rate and 87.6% full semantic match on feasible intents in a 240-intent benchmark (193 feasible, 47 infeasible); and (3) an 8-pass deterministic validator with constructive feasibility certification that achieves 0% unsafe acceptance on all 47 infeasible intents (30 labeled + 17 discovered by Pass 8), with 100% corruption detection across 240 structural corruption tests and 100% on 15 targeted adversarial attacks. End-to-end evaluation across four constrained routing scenarios confirms zero constraint violations with both routers. We further demonstrate that apparent performance gaps in polar-avoidance scenarios are largely explained by topological reachability ceilings rather than routing quality, and that the LLM compiler outperforms a rule-based baseline by 46.2 percentage points on compositional intents. Our system bridges the semantic gap between operator intent and network configuration while maintaining the safety guarantees required for operational deployment. 2026-04-08T16:29:25Z 9 pages, 2 figures Yuanhang Li http://arxiv.org/abs/2604.07253v1 Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education 2026-04-08T16:14:26Z In gender-restrictive and surveilled contexts, where access to formal education may be restricted for women, pursuing education involves safety and privacy risks. When women are excluded from schools and universities, they often turn to online self-learning and generative AI (GenAI) to pursue their educational and career aspirations. However, we know little about what safe and accountable GenAI support is required in the context of surveillance, household responsibilities, and the absence of learning communities. We present a remote participatory design study with 20 women in Afghanistan, informed by a recruitment survey (n = 140), examining how participants envision GenAI for learning and employability. Participants describe using GenAI less as an information source and more as an always-available peer, mentor, and source of career guidance that helps compensate for the absence of learning communities. At the same time, they emphasize that this companionship is constrained by privacy and surveillance risks, contextually unrealistic and culturally unsafe support, and direct-answer interactions that can undermine learning by creating an illusion of progress. Beyond eliciting requirements, envisioning the future with GenAI through participatory design was positively associated with significant increases in participants' aspirations (p=.01), perceived agency (p=.01), and perceived avenues (p=.03). These outcomes show that accountable and safe GenAI is not only about harm reduction but can also actively enable women to imagine and pursue viable learning and employment futures. Building on this, we translate participants' proposals into accountability-focused design directions that center on safety-first interaction and user control, context-grounded support under constrained resources, and offer pedagogically aligned assistance that supports genuine learning rather than quick answers. 2026-04-08T16:14:26Z This work has been accepted at ACM Conference on Fairness, Accountability, and Transparency 2026 as a full paper. Please cite the peer-reviewed version Hamayoon Behmanush Freshta Akhtari Ingmar Weber Vikram Kamath Cannanure 10.1145/3805689.3806466. http://arxiv.org/abs/2604.07240v1 $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture 2026-04-08T16:06:43Z We introduce a code-based challenge for automated, open-ended mathematical discovery based on the $k$-server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture's special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open $k=4$ circle case. As such, a successful candidate would already be an interesting contribution to the $k$-server conjecture, and could become a substantial theoretical result when paired with a full proof. Experiments on the resolved $k=3$ regime show that current agentic methods can solve nontrivial instances, and in the open $k=4$ regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods. Beyond its relevance to the $k$-server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emph{benchmark} for developing code-based discovery agents. In particular, our $k=3$ results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods. 2026-04-08T16:06:43Z Kirill Brilliantov Etienne Bamas Emmanuel Abbé http://arxiv.org/abs/2507.22025v4 UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding 2026-04-08T16:03:48Z The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking'' reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23\% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE. 2025-07-29T17:22:07Z Shuquan Lian Yuhang Wu Jia Ma Yifan Ding Zihan Song Bingqi Chen Xiawu Zheng Hui Li Rongrong Ji http://arxiv.org/abs/2601.15474v2 BadImplant: Injection-based Multi-Targeted Graph Backdoor Attack 2026-04-08T16:01:56Z Graph neural network (GNN) have demonstrated exceptional performance in solving critical problems across diverse domains yet remain susceptible to backdoor attacks. Existing studies on backdoor attack for graph classification are limited to single target attack using subgraph replacement based mechanism where the attacker implants only one trigger into the GNN model. In this paper, we introduce the first multi-targeted backdoor attack for graph classification task, where multiple triggers simultaneously redirect predictions to different target labels. Instead of subgraph replacement, we propose subgraph injection which preserves the structure of the original graphs while poisoning the clean graphs. Extensive experiments demonstrate the efficacy of our approach, where our attack achieves high attack success rates for all target labels with minimal impact on the clean accuracy. Experimental results on five dataset demonstrate the superior performance of our attack framework compared to the conventional subgraph replacement-based attack. Our analysis on four GNN models confirms the generalization capability of our attack which is effective regardless of the GNN model architectures and training parameters settings. We further investigate the impact of the attack design parameters including injection methods, number of connections, trigger sizes, trigger edge density and poisoning ratios. Additionally, our evaluation against state-of-the-art defenses (randomized smoothing and fine-pruning) demonstrates the robustness of our proposed multi-target attacks. This work highlights the GNN vulnerability against multi-targeted backdoor attack in graph classification task. Our source codes will be available at https://github.com/SiSL-URI/Multi-Targeted-Graph-Backdoor-Attack. 2026-01-21T21:23:51Z Md Nabi Newaz Khan Abdullah Arafat Miah Yu Bi http://arxiv.org/abs/2411.18084v2 From Exploration to Revelation: Detecting Dark Patterns in Mobile Apps 2026-04-08T15:46:50Z Mobile apps are essential in daily life but frequently employ deceptive patterns, such as visual emphasis or linguistic nudging, to manipulate user behavior. Existing research largely relies on manual detection, which is time-consuming and cannot keep pace with rapidly evolving apps. Although recent work has explored automated approaches, these methods are limited to intra-page patterns, depend on manual app exploration, and lack flexibility. To address these limitations, we present AppRay, a system that integrates task-oriented app exploration with automated deceptive pattern detection to reduce manual effort, expand detection coverage, and improve performance. AppRay operates in two stages. First, it combines large language model-guided task-oriented exploration with random exploration to capture diverse user interface (UI) states. Second, it detects both intra-page and inter-page deceptive patterns using a contrastive learning-based multi-label classifier augmented with a rule-based refiner for context-aware detection. We contribute two datasets, AppRay-Tainted-UIs and AppRay-Benign-UIs, comprising 2,185 deceptive pattern instances, including 149 intra-page cases, spanning 16 types across 876 deceptive and 871 benign UIs, while preserving UI relationships. Experimental results show that AppRay achieves macro/micro averaged precision of 0.92/0.85, recall of 0.86/0.88, and F1 scores of 0.89/0.85, yielding 27.14% to 1200% improvements over prior methods and enabling effective detection of previously unexplored deceptive patterns. 2024-11-27T06:39:35Z 45 pages, 11 figures. Accepted by TOSEM2026 Jieshan Chen Zhen Wang Jiamou Sun Zhenchang Xing Qinghua Lu Qing Huang Xiwei Xu Liming Zhu http://arxiv.org/abs/2604.07223v1 TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories 2026-04-08T15:46:14Z As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks. 2026-04-08T15:46:14Z Yen-Shan Chen Sian-Yao Huang Cheng-Lin Yang Yun-Nung Chen http://arxiv.org/abs/2512.19576v5 LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller 2026-04-08T15:38:50Z Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers. 2025-12-22T17:00:25Z Accepted for publication in IEEE Access (DOI: 10.1109/ACCESS.2026.3678816). This is the author's version which has not been fully edited and content may change prior to final publication. 20 pages, 15 figures, 18 tables. The maneuver telemetry datasets are available in the GitHub repository under https://github.com/kdjebko/lelar-in-orbit-data IEEE Access, vol. 14, pp. 49348-49367, 2026 Kirill Djebko Tom Baumann Erik Dilger Frank Puppe Sergio Montenegro 10.1109/ACCESS.2026.3678816 http://arxiv.org/abs/2602.03604v3 A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures 2026-04-08T15:27:24Z We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa. 2026-02-03T14:56:24Z v2: clarify confusion in definition of JEPAs vs. regularization-based JEPAs v3: Camera-ready of ICLR world models workshop, fixed formatting and ViT config / results Basile Terver Randall Balestriero Megi Dervishi David Fan Quentin Garrido Tushar Nagarajan Koustuv Sinha Wancong Zhang Mike Rabbat Yann LeCun Amir Bar http://arxiv.org/abs/2604.07191v1 Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence 2026-04-08T15:17:19Z Mixture proportion estimation (MPE) aims to estimate class priors from unlabeled data. This task is a critical component in weakly supervised learning, such as PU learning, learning with label noise, and domain adaptation. Existing MPE methods rely on the \textit{irreducibility} assumption or its variant for identifiability. In this paper, we propose novel assumptions based on conditional independence (CI) given the class label, which ensure identifiability even when irreducibility does not hold. We develop method of moments estimators under these assumptions and analyze their asymptotic properties. Furthermore, we present weakly-supervised kernel tests to validate the CI assumptions, which are of independent interest in applications such as causal discovery and fairness evaluation. Empirically, we demonstrate the improved performance of our estimators compared with existing methods and that our tests successfully control both type I and type II errors.\label{key} 2026-04-08T15:17:19Z AISTATS 2026 Yushi Hirose Akito Narahara Takafumi Kanamori http://arxiv.org/abs/2604.07190v1 The ATOM Report: Measuring the Open Language Model Ecosystem 2026-04-08T15:15:50Z We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba's Qwen, DeepSeek, Meta's Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem. 2026-04-08T15:15:50Z 23 pages, 17 figures Nathan Lambert Florian Brand http://arxiv.org/abs/2604.07182v1 TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification 2026-04-08T15:11:33Z As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management. 2026-04-08T15:11:33Z Rafi Ahamed Sidratul Moon Nafsin Md Abir Rahman Tasnia Tarannum Roza Munaia Jannat Easha Abu Raihan