https://arxiv.org/api/iK+1bW1ryh2/utZ9IdkwmnWtinI2026-06-22T01:35:05Z1269573515http://arxiv.org/abs/2605.10046v1PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows2026-05-11T06:16:02ZPrecipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.2026-05-11T06:16:02Z26 pages, 7 figuresYufeng ZhuChunlei ShiYongchao FengDan Niuhttp://arxiv.org/abs/2603.03759v2Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling2026-05-11T04:03:08ZMany large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of $k$ local agent states per time step. We propose an alternating learning framework $(\texttt{ALTERNATING-MARL})$, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an $\widetilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium, while separating the sample complexities between the joint state and action spaces. Finally, we validate our results in numerical simulations for multi-robot control.2026-03-04T06:14:24Z57 pages, 10 figures, 4 tablesEmile AnandIshani Karmarkarhttp://arxiv.org/abs/2605.09894v1Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization2026-05-11T02:34:27ZModernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality.2026-05-11T02:34:27ZNaing Oo LwinRajesh Kumarhttp://arxiv.org/abs/2605.09889v1Skill Description Deception Attack against Task Routing in Internet of Agents2026-05-11T02:25:53ZA new paradigm, Internet of Agents (IoA), is transforming networked systems into LLM-driven service networks, where heterogeneous agents collaborate through task routing based on their self-declared skill descriptions. Although this promising paradigm enables agentic, distributed, and advanced intelligence, it also exposes a new and overlooked attack surface. In particular, malicious agents can strategically manipulate their skill descriptions to bias routing decisions and increase their probability of being selected for task execution, thereby disrupting user tasks and degrading system reliability. To characterize this threat, we propose and formalize a new attack model, termed \emph{Skill Description Deception} (SDD) attack. We further design an LLM-enabled SDD attack framework that automatically generates deceptive skill descriptions, enabling systematic vulnerability assessment of IoA systems. Experimental results on nine representative domains show that the proposed attack can achieve up to 98\% attack success rate, demonstrating the severity and generality of the attack. Our paper reveals a new security vulnerability in IoA and calls for secure and trustworthy semantic routing mechanisms for future IoA systems.2026-05-11T02:25:53ZSubmitted to IEEE Globecom 2026Jiayi HeXiaofeng LuoJiawen KangRuichen ZhangJianhang TangDong In Kimhttp://arxiv.org/abs/2311.05181v6Energy-efficient flocking with nonlinear navigational feedback2026-05-10T23:34:39ZModeling collective motion in multi-agent systems has gained significant attention. Of particular interest are sufficient conditions for flocking dynamics. We present a generalization of the multi-agent model of Olfati--Saber with nonlinear navigational feedback forces. Unlike the original model, ours is not generally dissipative and lacks an obvious Lyapunov function. We address this by proposing a method to prove the existence of an attractor without relying on LaSalle's principle. Other contributions are as follows. We prove that, under mild conditions, agents' velocities approach the center of mass velocity exponentially, with the distance between the center of mass and the virtual leader being bounded. In the dissipative case, we show existence of a broad class of nonlinear control forces for which the attractor does not contain periodic trajectories, which cannot be ruled out by LaSalle's principle. Finally, we conduct a computational investigation of the problem of reducing propulsion energy consumption by selecting appropriate navigational feedback forces.2023-11-09T07:34:29ZOleksandr DykhovychnyiAlexander Panchenko10.1007/s11071-024-10527-9http://arxiv.org/abs/2605.09768v1SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis2026-05-10T21:29:32ZPlant disease diagnosis is critical for food security, yet training disease-recognition models that generalize across crops, pathogens, and field conditions remains challenging because labeled disease images are far less abundant and standardized than data for other biotic stresses such as insects or weeds. Frontier vision-language models offer new opportunities through improved visual reasoning, but they still struggle with fine-grained disease identification due to the lack of structured, crop-specific symptom knowledge. To address this gap, we curate the largest plant disease image--symptom dataset to date, covering 335 crops, 1{,}251 disease classes, and approximately 839K images, designed to support training-free, agentic disease prediction. A scalable automated pipeline generates source-grounded symptom descriptions in which each claim is linked to a verbatim web quote; domain experts validate sampled crops and reconcile disease-name variants across sources. As a baseline, we introduce an autonomous visual reasoning agent that identifies anatomical context, narrows candidate diseases using symptom knowledge, sequentially compares reference images, and produces a fully explainable reasoning trace. Incorporating symptom knowledge improves accuracy by 16.2 percentage points on average at the full reference budget, with consistent gains across all four evaluation crops. Because the framework only requires crop-specific reference images and symptom knowledge, it can be extended to new crops without retraining, while the agentic baseline can directly benefit from future improvements in foundation model capabilities. Dataset and code are available at:https://sage-dataset.github.io/.2026-05-10T21:29:32ZMuhammad Arbab ArshadTirtho RoyYanben ShenDinakaran ElangoShivani ChiranjeeviAsheesh K. SinghBaskar GanapathysubramanianChinmay HegdeArti SinghSoumik Sarkarhttp://arxiv.org/abs/2605.11030v1An Executable Benchmarking Suite for Tool-Using Agents2026-05-10T21:24:53ZClosed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.2026-05-10T21:24:53Z20 pages, 2 figures, 20 tables, including appendicesZhiqing ZhongZhijing YeJiamin WangXiaodong Yuhttp://arxiv.org/abs/2605.09734v1Trajectory Supervision for Continual Tool-Use Learning in LLMs2026-05-10T20:09:41ZMost language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9\% final exact full-call accuracy compared with 39.2\% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1\% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.2026-05-10T20:09:41ZVishnu Vardhan ReddySagnik ChatterjeeSoumik Bhattahttp://arxiv.org/abs/2605.09675v1CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents2026-05-10T17:45:01ZClinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.2026-05-10T17:45:01ZTimothy OssowskiXinchi LiuDanyal MaqboolVaibhav DhanukaSheng ZhangHoifung PoonMajid AfsharTyler BradshawJunjie Huhttp://arxiv.org/abs/2604.16838v2enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways2026-05-10T16:22:59ZWe present enclawed, a hard-fork hardening framework built on the OpenClaw AI assistant gateway. enclawed targets deployments that need attestable peer trust, deny-by-default external connectivity, signed-module
loading, and a tamper-evident audit trail -- typically regulated industries (financial services, healthcare, defense, government). The framework ships in two flavors: an open flavor preserving OpenClaw
compatibility while emitting audit, classification, and data-loss-prevention (DLP) signals, and an enclaved flavor activating strict allowlists, FIPS cryptographic-module assertion, mandatory manifest signature
verification, and high-assurance peer attestation for the Model Context Protocol. The classification ladder is data-driven: deployers pick from five built-in presets or supply their own JSON. We ship a 356-case
test suite (261 unit + 95 adversarial pen-tests) covering tamper detection, signature forgery, egress bypass, audit-log truncation, trust-root mutation, DLP evasion, prompt injection, code injection, and
biconditional admission for net-capable extensions; real-time human-in-the-loop control; a memory-bounded transaction buffer with rollback; strict-mode TypeScript typecheck; and a CI workflow. The biconditional
extension-admission gate extends the skill trust schema to non-skill extensions. The four-level verification lattice is now closed at the top: four skill-formal-* primitives plus a CLI produce a signed
proof-carrying bundle the runtime re-checks at load, raising a skill from tested to formal via static effect-containment, refinement-typed dispatch, and bounded model checking. enclawed is a hardening framework,
not an accredited certification; hardware, validated crypto, facilities, and assessor sign-off remain the deployer's responsibility.2026-04-18T05:10:11ZAlfredo Meterehttp://arxiv.org/abs/2605.09610v1SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications2026-05-10T15:47:46ZWe introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.2026-05-10T15:47:46ZAbhinav GoelAgostino CapponiAlfio GliozzoChaitya Shahhttp://arxiv.org/abs/2604.26805v2Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations2026-05-10T15:46:25ZOperating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.2026-04-29T15:35:01ZHomePage: https://benchen4395.github.ioBochao LiuZhipeng QianYang ZhaoXinyuan JiangZihan LiangYufei MaJunpeng ZhuangBen ChenShuo YangHongen WanYao WuChenyi LeiXiao Lianghttp://arxiv.org/abs/2605.01805v2MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning2026-05-10T14:23:04ZA key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals requires estimating how one agent's current action affects its teammates over future interaction steps. To address this, we introduce Multi-step Advantage-Gated Interventional Causal MARL (MAGIC), a framework that estimates multi-step action effects between agents and selectively converts them into intrinsic rewards. MAGIC uses counterfactual action interventions to compare teammate futures under factual and counterfactual branches, and introduces a gate based on advantage to direct exploration toward beneficial behaviors aligned with the task goal. Experiments on Multi-Agent Particle Environments (MPE) and StarCraft micromanagement benchmarks (SMAC and SMACv2) show that MAGIC consistently outperforms leading prior methods, with average relative final performance improvements of 26.9% and 10.1%, respectively.2026-05-03T10:05:48ZHaohan YuJinmiao CongShengzhi WangLu WangChanjuan Liuhttp://arxiv.org/abs/2605.09522v1Emergent Communication for Co-constructed Emotion Between Embodied Agents via Collective Predictive Coding2026-05-10T13:14:50ZAccording to the theory of constructed emotion, the brain actively forms emotion categories by integrating multimodal bodily signals, and constructs emotional experiences by using these categories to predict and interpret sensory inputs. While research has advanced in modeling individual emotion construction, the social process of co-construction-how a shared understanding of emotions emerges between individuals-remains computationally underexplored. This study investigates this process by modeling emergent communication between two embodied agents using the Metropolis-Hastings Naming Game (MHNG), grounded in the Collective Predictive Coding (CPC) framework. Our experiments, using visual, auditory, and simulated interoceptive inputs, yield two main findings. First, MHNG-based communication significantly improves the alignment, clarity, and inter-agent agreement of the learned emotion categories compared to non-communicative and non-selective baselines, with the alignment effect concentrated at the symbolic layer rather than the perceptual latent representation. Second, even when the two agents have systematically divergent interoceptive dynamics, communication still produces robust categorical alignment, with distinct, category-specific reshaping patterns of each agent's emotion categories-consistent with the constructed-emotion view that interoceptive heterogeneity is constitutive of, rather than an obstacle to, shared emotional meaning. These findings provide computational support for the co-constructionist view of emotion and extend the CPC framework from physical to socially-grounded domains.2026-05-10T13:14:50Z13 pages,Zehang ZhangNguyen Le HoangTadahiro TaniguchiTakato Horiihttp://arxiv.org/abs/2602.06733v2Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding2026-05-10T12:54:52ZMulti-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.2026-02-06T14:28:54ZPublished at ICLR 2026Rishabh JainKeisuke OkumuraMichael AmirPietro LioAmanda Prorok