https://arxiv.org/api/eFCaJAzlP+QhfrnaoxX9l4AFmwc2026-04-10T20:35:39Z2795346515http://arxiv.org/abs/2504.06254v3Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes2026-03-09T17:06:19ZWe present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with classical and more recent safe memory reclamation (SMR) schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these (robust) schemes protect against stalled threads but lack support for well-known data structures with optimistic traversals, e.g., Harris' list and the Natarajan-Mittal tree. Such schemes are either incompatible with them or need changes with performance trade-offs (e.g., the Harris-Michael list).
SCOT keeps existing SMR schemes intact and retains performance benefits of original data structures. We implement and evaluate SCOT with Harris' list and the Natarajan-Mittal tree, but it is also applicable to other data structures. Furthermore, we provide a simple modification for wait-free traversals. We observe similar performance speedups (e.g., Harris vs. Harris-Michael lists) that were previously available only to EBR users. Our version of the tree also achieves very high throughput, comparable to that of EBR, which is often treated as a practical upper bound.2025-04-08T17:57:14ZProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2026)Md Amit Hasan AroviRuslan Nikolaev10.1145/3774934.3786455http://arxiv.org/abs/2507.20201v2Silent Self-Stabilising Leader Election in Programmable Matter Systems with Holes2026-03-09T16:50:35ZLeader election is a fundamental problem in distributed computing, particularly within programmable matter systems, where coordination among simple computational entities is crucial for solving complex tasks. In these systems, particles (i.e., constant-memory computational entities) operate in a regular triangular grid as described in the geometric Amoebot model. While leader election has been extensively studied in non self-stabilising settings, self-stabilising solutions remain more limited. In this work, we study the problem of self-stabilising leader election in connected (but not necessarily simply connected) configurations. We present the first self-stabilising algorithm for connected programmable matter systems that guarantees the election of a unique leader under an unfair scheduler, for oblivious particles (i.e., particles with no persistent memory) that share a common sense of direction. Our approach leverages particle movement, a capability not previously exploited in the self-stabilising context. We show that movement in conjunction with particles sharing a sense of orientation and operating in a grid can overcome classical impossibility results for constant-memory systems established by Dolev, Gouda and Schneider (1999).2025-07-27T09:39:10Z20 pages, accepted at SIROCCO 2026Jérémie ChalopinShantanu DasMaria Kokkouhttp://arxiv.org/abs/2506.11024v4Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients2026-03-09T13:50:42ZAs AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we move beyond these restrictive assumptions by addressing both data and model heterogeneity. We propose a task-relevance-aware model aggregation strategy to reduce parameter interference under heterogeneous data. Moreover, we introduce Co-LoRA, a dimension-invariant module that enables knowledge sharing across heterogeneous architectures. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. Extensive experiments shows that our proposed method significantly outperforms the state-of-the-art PFL methods under heterogeneous scenarios.2025-05-20T09:17:07ZICLR 2026Minhyuk SeoTaeheon KimHankook LeeJonghyun ChoiTinne Tuytelaarshttp://arxiv.org/abs/2603.08288v1A Blockchain-based Traceability System for AI-Driven Engine Blade Inspection2026-03-09T12:06:56ZAircraft engine blade maintenance relies on inspection records shared across manufacturers, airlines, maintenance organizations, and regulators. Yet current systems are fragmented, difficult to audit, and vulnerable to tampering. This paper presents BladeChain, a blockchain-based system providing immutable traceability for blade inspections throughout the component life cycle. BladeChain is the first system to integrate multi-stakeholder endorsement, automated inspection scheduling, AI model provenance, and cryptographic evidence binding, delivering auditable maintenance traceability for aerospace deployments. Built on a four-stakeholder Hyperledger Fabric network (OEM, Airline, MRO, Regulator), BladeChain captures every life-cycle event in a tamper-evident ledger. A chaincode-enforced state machine governs blade status transitions and automatically triggers inspections when configurable flight hour, cycle, or calendar thresholds are exceeded, eliminating manual scheduling errors. Inspection artifacts are stored off-chain in IPFS and linked to on-chain records via SHA-256 hashes, with each inspection record capturing the AI model name and version used for defect detection. This enables regulators to audit both what defects were found and how they were found. The detection module is pluggable, allowing organizations to adopt or upgrade inspection models without modifying the ledger or workflows. We built a prototype and evaluated it on workloads of up to 100 blades, demonstrating 100% life cycle completion with consistent throughput of 26 operations per minute. A centralized SQL baseline quantifies the consensus overhead and highlights the security trade-off. Security validation confirms tamper detection within 17~ms through hash verification.2026-03-09T12:06:56ZMahmoud HafezEman OudaMohammed A. Mohammed EltoumKhaled SalahYusra Abdulrahmanhttp://arxiv.org/abs/2603.08278v1TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction2026-03-09T11:49:42ZAccurate and interpretable mortality risk prediction in intensive care units (ICUs) remains a critical challenge due to the irregular temporal structure of electronic health records (EHRs), the complexity of longitudinal disease trajectories, and the lack of clinically grounded explanations in many data-driven models. To address these challenges, we propose \textit{TA-RNN-Medical-Hybrid}, a time-aware and knowledge-enriched deep learning framework that jointly models longitudinal clinical sequences and irregular temporal dynamics through explicit continuous-time encoding, along with standardized medical concept representations. The proposed framework extends time-aware recurrent modeling by integrating explicit continuous-time embeddings that operate independently of visit indexing, SNOMED-based disease representations, and a hierarchical dual-level attention mechanism that captures both visit-level temporal importance and feature/concept-level clinical relevance. This design enables accurate mortality risk estimation while providing transparent and clinically meaningful explanations aligned with established medical knowledge. We evaluate the proposed approach on the MIMIC-III critical care dataset and compare it against strong time-aware and sequential baselines. Experimental results demonstrate that TA-RNN-Medical-Hybrid consistently improves predictive performance in terms of AUC, accuracy, and recall-oriented F$_2$-score. Moreover, qualitative analysis shows that the model effectively decomposes mortality risk across time and clinical concepts, yielding interpretable insights into disease severity, chronicity, and temporal progression. Overall, the proposed framework bridges the gap between predictive accuracy and clinical interpretability, offering a scalable and transparent solution for high-stakes ICU decision support systems.2026-03-09T11:49:42ZZahra JafariAzadeh ZamanifarAmirfarhad Farhadihttp://arxiv.org/abs/2603.08192v1A Hodge-Based Framework for Service Operational Analysis in Serverless Platforms2026-03-09T10:19:46ZIn this paper we propose a method for analyzing services deployed in serverless platforms. These services typically consists of orchestrated functions that can exhibit complex and non-conservative information flows due to the interaction of independently deployed functions under coarse-grained control mechanisms. We introduce a topological model of serverless services and make use of the Hodge decomposition to partition observed operational flows into locally correctable components and globally persistent harmonic modes. Our analysis shows that harmonic flows naturally arise from different kind of interactions among functions and should be interpreted as structural properties of serverless systems rather than configuration errors. We present a systematic methodology for analyzing inter-function flows and deriving actionable remediation strategies, including dumping effects to contain the effects of harmonic inefficiencies as an alternative to completely restructure the topological model of the service. Experimental results confirm that the proposed approach can uncover latent architectural structures leading to inefficiencies.2026-03-09T10:19:46ZSubmitted for journal publicationGianluca RealiMauro Femminellahttp://arxiv.org/abs/2603.08003v1SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types2026-03-09T06:16:44ZData replication is a critical aspect of data center design, as it ensures high availability, scalability, and fault tolerance. However, replicas need to be coordinated to maintain convergence and database integrity constraints under transactional workloads. Commutative Replicated Data Types (RDTs) provide convergence for conflict-free objects using relaxed consistency, and Well-coordinated Replicated Data Types (WRDTs) provide convergence and integrity for general objects using a hybrid model, relaxed when possible and strong when necessary. While state-of-the-art hardware acceleration of RDT uses Remote Direct Memory Access (RDMA), we observe that trends towards lower latency and higher throughput have driven recent data center architectures to leverage FPGAs as application accelerators. In contrast to deploying an FPGA-based Smart NIC, this paper connects an FPGA accelerator card directly to the network, which allows a complete redesign of the NIC to match the needs of the FPGA-hosted application. We co-design a network-attached FPGA replication engine with an FPGA-resident network interface, enabling near-network execution of replicated transactions and direct invocation of FPGA-resident operators. Following this approach, we introduce SafarDB, FPGA-accelerated Conflict-Free Replicated Data Types (CRDTs) and WRDTs. SafarDB accelerates both relaxed and strongly ordered replication paths; when strong ordering is required, SafarDB accelerates the underlying consensus control path. SafarDB improves CRDT latency and throughput by 7.0X and 5.3X, and WRDT latency and throughput by 12X and 6.8X compared to a state-of-the-art RDMA-based implementation. Further, experiments demonstrate that SafarDB is more resilient to crash-failures than existing CPU/RDMA-based CRDT and WRDT implementations, and SafarDB can detect leader failures and elect new leaders much faster than previously possible.2026-03-09T06:16:44ZJavad SaberlatibariPrithviraj YuvarajMohsen LesaniPhilip BriskMohammad Sadoghihttp://arxiv.org/abs/2603.07992v1SI-ChainFL: Shapley-Incentivized Secure Federated Learning for High-Speed Rail Data Sharing2026-03-09T05:57:36ZIn high-speed rail (HSR) systems, federated learning (FL) enables cross-departmental flow prediction without sharing raw data. However, existing schemes suffer from two key limitations: (1) insufficient incentives, leading to free-riding and model poisoning; and (2) centralized aggregation, which introduces a single point of failure. We propose a secure and efficient framework SI-ChainFL that addresses these issues by combining contribution-aware incentives with decentralized aggregation. First, we quantify client contributions using a Shapley value metric that jointly considers rare-event utility, data diversity, data quality, and timeliness. To reduce computational overhead, we further develop a rare positive driven client clustering strategy to accelerate Shapley estimation. Moreover, we design a blockchain-based consensus protocol for decentralized aggregation, where aggregation eligibility is tied to Shapley incentives. This design motivates clients to submit high-quality updates and enables efficient and secure global aggregation. Experiments on MNIST, CIFAR 10 and CIFAR 100, and a HSR flow dataset show that SI ChainFL remains effective under 90% malicious clients in PA attacks, achieving 14.12% higher accuracy than RAGA. Theoretical analysis further guarantees an upper bound on performance2026-03-09T05:57:36Z17 pages, 19 figuresMingjie ZhaoCheng DaiFei ChenXin ChenKaoru OtaMianxiong DongBing Guohttp://arxiv.org/abs/2603.07982v1ACE-GF-based Attestation Relay for PQC - Lightweight Mempool Propagation Without On-Path Proofs2026-03-09T05:41:31ZIn post-quantum blockchain settings, objects that require validity proofs (e.g., blob roots, execution-layer or consensus-layer signature aggregates) must be broadcast through mempool and relay networks. Recursive STARKs have been proposed to aggregate such proofs so that each node forwards one proof per tick plus objects without proofs, capping per-node proof bandwidth at roughly 128 KB degree per tick. We observe that propagation does not inherently require validity proofs on the path-only a lightweight assurance that an object is eligible for relay. We present AR-ACE (ACE-GF-based Attestation Relay for PQC), in which relay nodes forward objects plus compact attestations (e.g., identity-bound signatures or commitments) and do not generate, hold, or forward any full validity proof. Only the builder (or final verifier) performs a single aggregated validity proof over the set of objects it includes. This proof-off-path design removes proof overhead from the propagation path entirely, yielding an order-of-magnitude reduction in proof-related relay bandwidth relative to proof-carrying propagation. When instantiated with ACE-GF-derived attestation keys, AR-ACE preserves a unified identity story with on-chain authorization and is PQC-ready. We specify a protocol model, state design goals and security considerations, define security games, and provide a structural bandwidth comparison with recursive-STARK-based propagation.2026-03-09T05:41:31Z12 pagesJian Sheng Wanghttp://arxiv.org/abs/2506.08528v4EROICA: Online Performance Troubleshooting for Large-scale Model Training2026-03-09T05:14:48ZTroubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. EROICA has been deployed as a production service for large-scale GPU clusters of ~100,000 GPUs for 1.5 years. It has diagnosed a variety of difficult performance issues with 97.5% success.2025-06-10T07:46:14ZYu GuanZhiyu YinHaoyu ChenSheng ChengChaojie YangKun QianTianyin XuPengcheng ZhangYang ZhangHanyu ZhaoYong LiWei LinDennis CaiEnnan Zhaihttp://arxiv.org/abs/2511.20982v3DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving2026-03-09T04:31:26ZTo meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.2025-11-26T02:27:10Z14 pagesIEEE Transactions on Services Computing 2026Junhan LiaoMinxian XuWanyi ZhengYan WangKejiang YeRajkumar BuyyaChengzhong Xuhttp://arxiv.org/abs/2507.17128v2Auto-scaling Approaches for Microservice Applications: A Survey and Taxonomy2026-03-09T04:14:28ZMicroservice applications are created as loosely coupled application components and they leverage cloud elasticity to reduce costs and increase development speed. However, microservice applications exhibit complex interactions among dynamically evolving services and highly variable workloads, posing significant challenges to auto-scaling mechanisms. Key issues include service dependency management, performance profiling, anomaly detection, workload characterization, and fine-grained resource allocation. To address these challenges, recent auto-scaling approaches leverage historical and runtime data to adapt resource provisioning and optimize system efficiency. Since 2018, marked by the graduation of Kubernetes as the first Cloud Native Computing Foundation (CNCF) project, microservice applications have been widely deployed on standardized orchestration platforms, fundamentally shifting auto-scaling from coarse-grained to service-level, dependency-aware strategies. Accordingly, this paper surveys state-of-the-art auto-scaling approaches for microservice applications since 2018 and presents a taxonomy along five dimensions: infrastructure, architecture, scaling methods, optimization objectives, and behavior modeling. These perspectives collectively target key objectives, including resource efficiency, cost efficiency, and Service Level Agreement (SLA) assurance, aiming to balance system optimization with SLA compliance. We further present a comprehensive comparison and in-depth analysis of representative approaches, examining their core features, strengths, limitations, and applicable scenarios, as well as their performance across diverse environments and workload conditions.2025-07-23T02:04:40Z14 pagesMinxian XuJunhan LiaoLinfeng WenHuaming WuKejiang YeRajkumar BuyyaChengzhong Xuhttp://arxiv.org/abs/2603.07850v1A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture2026-03-08T23:58:47ZWe present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.2026-03-08T23:58:47Z14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpuIsaac Llorente-Saguerhttp://arxiv.org/abs/2603.07770v1ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs2026-03-08T19:20:25ZAlthough existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.2026-03-08T19:20:25Z13 figures, 1 tableYuzhuang XuXu HanYuxuan LiWanxiang Chehttp://arxiv.org/abs/2510.19805v2Next Generation Cloud-native In-Memory Stores: From Redis to Valkey and Beyond2026-03-08T18:02:45ZIn-memory key-value datastores have become indispensable building blocks of modern cloud-native infrastructures, yet their evolution faces scalability, compatibility, and sustainability constraints. The current literature lacks an experimental evaluation of state-of-the-art tools in the domain. This study addressed this timely gap by benchmarking Redis alternatives and systematically evaluating Valkey, KeyDB, and Garnet under realistic workloads within Kubernetes deployments. The results demonstrate clear trade-offs among the benchmarked data systems. Our study presents a comprehensive performance and viability assessment of the emerging in-memory key-value stores. Metrics include throughput, tail latency, CPU and memory efficiency, and migration complexity. We highlight trade-offs between performance, compatibility, and long-term viability, including project maturity, community support, and sustained development.2025-10-22T17:40:17ZThe first author was neither informed nor did he give his consent to the publication of the paper on arXiv. Further, the submission contains multiple major errors in the reported numerical results, and the related conclusions are not supported by the underlying benchmark data. The first author does not stand behind the paperCarl-Johan Fauvelle Munck af Rosensch"oldFeras M. AwayshehAhmad Awad