https://arxiv.org/api/eFCaJAzlP+QhfrnaoxX9l4AFmwc 2026-04-10T20:35:39Z 27953 465 15 http://arxiv.org/abs/2504.06254v3 Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes 2026-03-09T17:06:19Z

We present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with classical and more recent safe memory reclamation (SMR) schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these (robust) schemes protect against stalled threads but lack support for well-known data structures with optimistic traversals, e.g., Harris' list and the Natarajan-Mittal tree. Such schemes are either incompatible with them or need changes with performance trade-offs (e.g., the Harris-Michael list). SCOT keeps existing SMR schemes intact and retains performance benefits of original data structures. We implement and evaluate SCOT with Harris' list and the Natarajan-Mittal tree, but it is also applicable to other data structures. Furthermore, we provide a simple modification for wait-free traversals. We observe similar performance speedups (e.g., Harris vs. Harris-Michael lists) that were previously available only to EBR users. Our version of the tree also achieves very high throughput, comparable to that of EBR, which is often treated as a practical upper bound.

2025-04-08T17:57:14Z Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2026) Md Amit Hasan Arovi Ruslan Nikolaev 10.1145/3774934.3786455 http://arxiv.org/abs/2507.20201v2 Silent Self-Stabilising Leader Election in Programmable Matter Systems with Holes 2026-03-09T16:50:35Z

Leader election is a fundamental problem in distributed computing, particularly within programmable matter systems, where coordination among simple computational entities is crucial for solving complex tasks. In these systems, particles (i.e., constant-memory computational entities) operate in a regular triangular grid as described in the geometric Amoebot model. While leader election has been extensively studied in non self-stabilising settings, self-stabilising solutions remain more limited. In this work, we study the problem of self-stabilising leader election in connected (but not necessarily simply connected) configurations. We present the first self-stabilising algorithm for connected programmable matter systems that guarantees the election of a unique leader under an unfair scheduler, for oblivious particles (i.e., particles with no persistent memory) that share a common sense of direction. Our approach leverages particle movement, a capability not previously exploited in the self-stabilising context. We show that movement in conjunction with particles sharing a sense of orientation and operating in a grid can overcome classical impossibility results for constant-memory systems established by Dolev, Gouda and Schneider (1999).

2025-07-27T09:39:10Z 20 pages, accepted at SIROCCO 2026 Jérémie Chalopin Shantanu Das Maria Kokkou http://arxiv.org/abs/2506.11024v4 Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients 2026-03-09T13:50:42Z

As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we move beyond these restrictive assumptions by addressing both data and model heterogeneity. We propose a task-relevance-aware model aggregation strategy to reduce parameter interference under heterogeneous data. Moreover, we introduce Co-LoRA, a dimension-invariant module that enables knowledge sharing across heterogeneous architectures. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. Extensive experiments shows that our proposed method significantly outperforms the state-of-the-art PFL methods under heterogeneous scenarios.

2025-05-20T09:17:07Z ICLR 2026 Minhyuk Seo Taeheon Kim Hankook Lee Jonghyun Choi Tinne Tuytelaars http://arxiv.org/abs/2603.08288v1 A Blockchain-based Traceability System for AI-Driven Engine Blade Inspection 2026-03-09T12:06:56Z

Aircraft engine blade maintenance relies on inspection records shared across manufacturers, airlines, maintenance organizations, and regulators. Yet current systems are fragmented, difficult to audit, and vulnerable to tampering. This paper presents BladeChain, a blockchain-based system providing immutable traceability for blade inspections throughout the component life cycle. BladeChain is the first system to integrate multi-stakeholder endorsement, automated inspection scheduling, AI model provenance, and cryptographic evidence binding, delivering auditable maintenance traceability for aerospace deployments. Built on a four-stakeholder Hyperledger Fabric network (OEM, Airline, MRO, Regulator), BladeChain captures every life-cycle event in a tamper-evident ledger. A chaincode-enforced state machine governs blade status transitions and automatically triggers inspections when configurable flight hour, cycle, or calendar thresholds are exceeded, eliminating manual scheduling errors. Inspection artifacts are stored off-chain in IPFS and linked to on-chain records via SHA-256 hashes, with each inspection record capturing the AI model name and version used for defect detection. This enables regulators to audit both what defects were found and how they were found. The detection module is pluggable, allowing organizations to adopt or upgrade inspection models without modifying the ledger or workflows. We built a prototype and evaluated it on workloads of up to 100 blades, demonstrating 100% life cycle completion with consistent throughput of 26 operations per minute. A centralized SQL baseline quantifies the consensus overhead and highlights the security trade-off. Security validation confirms tamper detection within 17~ms through hash verification.

2026-03-09T12:06:56Z Mahmoud Hafez Eman Ouda Mohammed A. Mohammed Eltoum Khaled Salah Yusra Abdulrahman http://arxiv.org/abs/2603.08278v1 TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction 2026-03-09T11:49:42Z

Accurate and interpretable mortality risk prediction in intensive care units (ICUs) remains a critical challenge due to the irregular temporal structure of electronic health records (EHRs), the complexity of longitudinal disease trajectories, and the lack of clinically grounded explanations in many data-driven models. To address these challenges, we propose \textit{TA-RNN-Medical-Hybrid}, a time-aware and knowledge-enriched deep learning framework that jointly models longitudinal clinical sequences and irregular temporal dynamics through explicit continuous-time encoding, along with standardized medical concept representations. The proposed framework extends time-aware recurrent modeling by integrating explicit continuous-time embeddings that operate independently of visit indexing, SNOMED-based disease representations, and a hierarchical dual-level attention mechanism that captures both visit-level temporal importance and feature/concept-level clinical relevance. This design enables accurate mortality risk estimation while providing transparent and clinically meaningful explanations aligned with established medical knowledge. We evaluate the proposed approach on the MIMIC-III critical care dataset and compare it against strong time-aware and sequential baselines. Experimental results demonstrate that TA-RNN-Medical-Hybrid consistently improves predictive performance in terms of AUC, accuracy, and recall-oriented F$_2$-score. Moreover, qualitative analysis shows that the model effectively decomposes mortality risk across time and clinical concepts, yielding interpretable insights into disease severity, chronicity, and temporal progression. Overall, the proposed framework bridges the gap between predictive accuracy and clinical interpretability, offering a scalable and transparent solution for high-stakes ICU decision support systems.

2026-03-09T11:49:42Z Zahra Jafari Azadeh Zamanifar Amirfarhad Farhadi http://arxiv.org/abs/2603.08192v1 A Hodge-Based Framework for Service Operational Analysis in Serverless Platforms 2026-03-09T10:19:46Z

In this paper we propose a method for analyzing services deployed in serverless platforms. These services typically consists of orchestrated functions that can exhibit complex and non-conservative information flows due to the interaction of independently deployed functions under coarse-grained control mechanisms. We introduce a topological model of serverless services and make use of the Hodge decomposition to partition observed operational flows into locally correctable components and globally persistent harmonic modes. Our analysis shows that harmonic flows naturally arise from different kind of interactions among functions and should be interpreted as structural properties of serverless systems rather than configuration errors. We present a systematic methodology for analyzing inter-function flows and deriving actionable remediation strategies, including dumping effects to contain the effects of harmonic inefficiencies as an alternative to completely restructure the topological model of the service. Experimental results confirm that the proposed approach can uncover latent architectural structures leading to inefficiencies.

2026-03-09T10:19:46Z Submitted for journal publication Gianluca Reali Mauro Femminella http://arxiv.org/abs/2603.08003v1 SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types 2026-03-09T06:16:44Z

Data replication is a critical aspect of data center design, as it ensures high availability, scalability, and fault tolerance. However, replicas need to be coordinated to maintain convergence and database integrity constraints under transactional workloads. Commutative Replicated Data Types (RDTs) provide convergence for conflict-free objects using relaxed consistency, and Well-coordinated Replicated Data Types (WRDTs) provide convergence and integrity for general objects using a hybrid model, relaxed when possible and strong when necessary. While state-of-the-art hardware acceleration of RDT uses Remote Direct Memory Access (RDMA), we observe that trends towards lower latency and higher throughput have driven recent data center architectures to leverage FPGAs as application accelerators. In contrast to deploying an FPGA-based Smart NIC, this paper connects an FPGA accelerator card directly to the network, which allows a complete redesign of the NIC to match the needs of the FPGA-hosted application. We co-design a network-attached FPGA replication engine with an FPGA-resident network interface, enabling near-network execution of replicated transactions and direct invocation of FPGA-resident operators. Following this approach, we introduce SafarDB, FPGA-accelerated Conflict-Free Replicated Data Types (CRDTs) and WRDTs. SafarDB accelerates both relaxed and strongly ordered replication paths; when strong ordering is required, SafarDB accelerates the underlying consensus control path. SafarDB improves CRDT latency and throughput by 7.0X and 5.3X, and WRDT latency and throughput by 12X and 6.8X compared to a state-of-the-art RDMA-based implementation. Further, experiments demonstrate that SafarDB is more resilient to crash-failures than existing CPU/RDMA-based CRDT and WRDT implementations, and SafarDB can detect leader failures and elect new leaders much faster than previously possible.

2026-03-09T06:16:44Z Javad Saberlatibari Prithviraj Yuvaraj Mohsen Lesani Philip Brisk Mohammad Sadoghi http://arxiv.org/abs/2603.07992v1 SI-ChainFL: Shapley-Incentivized Secure Federated Learning for High-Speed Rail Data Sharing 2026-03-09T05:57:36Z

In high-speed rail (HSR) systems, federated learning (FL) enables cross-departmental flow prediction without sharing raw data. However, existing schemes suffer from two key limitations: (1) insufficient incentives, leading to free-riding and model poisoning; and (2) centralized aggregation, which introduces a single point of failure. We propose a secure and efficient framework SI-ChainFL that addresses these issues by combining contribution-aware incentives with decentralized aggregation. First, we quantify client contributions using a Shapley value metric that jointly considers rare-event utility, data diversity, data quality, and timeliness. To reduce computational overhead, we further develop a rare positive driven client clustering strategy to accelerate Shapley estimation. Moreover, we design a blockchain-based consensus protocol for decentralized aggregation, where aggregation eligibility is tied to Shapley incentives. This design motivates clients to submit high-quality updates and enables efficient and secure global aggregation. Experiments on MNIST, CIFAR 10 and CIFAR 100, and a HSR flow dataset show that SI ChainFL remains effective under 90% malicious clients in PA attacks, achieving 14.12% higher accuracy than RAGA. Theoretical analysis further guarantees an upper bound on performance

2026-03-09T05:57:36Z 17 pages, 19 figures Mingjie Zhao Cheng Dai Fei Chen Xin Chen Kaoru Ota Mianxiong Dong Bing Guo http://arxiv.org/abs/2603.07982v1 ACE-GF-based Attestation Relay for PQC - Lightweight Mempool Propagation Without On-Path Proofs 2026-03-09T05:41:31Z

In post-quantum blockchain settings, objects that require validity proofs (e.g., blob roots, execution-layer or consensus-layer signature aggregates) must be broadcast through mempool and relay networks. Recursive STARKs have been proposed to aggregate such proofs so that each node forwards one proof per tick plus objects without proofs, capping per-node proof bandwidth at roughly 128 KB degree per tick. We observe that propagation does not inherently require validity proofs on the path-only a lightweight assurance that an object is eligible for relay. We present AR-ACE (ACE-GF-based Attestation Relay for PQC), in which relay nodes forward objects plus compact attestations (e.g., identity-bound signatures or commitments) and do not generate, hold, or forward any full validity proof. Only the builder (or final verifier) performs a single aggregated validity proof over the set of objects it includes. This proof-off-path design removes proof overhead from the propagation path entirely, yielding an order-of-magnitude reduction in proof-related relay bandwidth relative to proof-carrying propagation. When instantiated with ACE-GF-derived attestation keys, AR-ACE preserves a unified identity story with on-chain authorization and is PQC-ready. We specify a protocol model, state design goals and security considerations, define security games, and provide a structural bandwidth comparison with recursive-STARK-based propagation.

2026-03-09T05:41:31Z 12 pages Jian Sheng Wang http://arxiv.org/abs/2506.08528v4 EROICA: Online Performance Troubleshooting for Large-scale Model Training 2026-03-09T05:14:48Z

Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. EROICA has been deployed as a production service for large-scale GPU clusters of ~100,000 GPUs for 1.5 years. It has diagnosed a variety of difficult performance issues with 97.5% success.

2025-06-10T07:46:14Z Yu Guan Zhiyu Yin Haoyu Chen Sheng Cheng Chaojie Yang Kun Qian Tianyin Xu Pengcheng Zhang Yang Zhang Hanyu Zhao Yong Li Wei Lin Dennis Cai Ennan Zhai http://arxiv.org/abs/2511.20982v3 DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving 2026-03-09T04:31:26Z

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.

2025-11-26T02:27:10Z 14 pages IEEE Transactions on Services Computing 2026 Junhan Liao Minxian Xu Wanyi Zheng Yan Wang Kejiang Ye Rajkumar Buyya Chengzhong Xu http://arxiv.org/abs/2507.17128v2 Auto-scaling Approaches for Microservice Applications: A Survey and Taxonomy 2026-03-09T04:14:28Z

Microservice applications are created as loosely coupled application components and they leverage cloud elasticity to reduce costs and increase development speed. However, microservice applications exhibit complex interactions among dynamically evolving services and highly variable workloads, posing significant challenges to auto-scaling mechanisms. Key issues include service dependency management, performance profiling, anomaly detection, workload characterization, and fine-grained resource allocation. To address these challenges, recent auto-scaling approaches leverage historical and runtime data to adapt resource provisioning and optimize system efficiency. Since 2018, marked by the graduation of Kubernetes as the first Cloud Native Computing Foundation (CNCF) project, microservice applications have been widely deployed on standardized orchestration platforms, fundamentally shifting auto-scaling from coarse-grained to service-level, dependency-aware strategies. Accordingly, this paper surveys state-of-the-art auto-scaling approaches for microservice applications since 2018 and presents a taxonomy along five dimensions: infrastructure, architecture, scaling methods, optimization objectives, and behavior modeling. These perspectives collectively target key objectives, including resource efficiency, cost efficiency, and Service Level Agreement (SLA) assurance, aiming to balance system optimization with SLA compliance. We further present a comprehensive comparison and in-depth analysis of representative approaches, examining their core features, strengths, limitations, and applicable scenarios, as well as their performance across diverse environments and workload conditions.

2025-07-23T02:04:40Z 14 pages Minxian Xu Junhan Liao Linfeng Wen Huaming Wu Kejiang Ye Rajkumar Buyya Chengzhong Xu http://arxiv.org/abs/2603.07850v1 A Lock-Free, Fully GPU-Resident Architecture for the Verification of Goldbach's Conjecture 2026-03-08T23:58:47Z

We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.

2026-03-08T23:58:47Z 14 pages, 4 figures, 3 tables. The presented work details a major architectural overhaul: migration of the segmented sieve to GPU L1 shared memory and the implementation of a lock-free multi-GPU work pool. Source code available at: https://github.com/isaac-6/goldbach-gpu Isaac Llorente-Saguer http://arxiv.org/abs/2603.07770v1 ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs 2026-03-08T19:20:25Z

Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.

2026-03-08T19:20:25Z 13 figures, 1 table Yuzhuang Xu Xu Han Yuxuan Li Wanxiang Che http://arxiv.org/abs/2510.19805v2 Next Generation Cloud-native In-Memory Stores: From Redis to Valkey and Beyond 2026-03-08T18:02:45Z

In-memory key-value datastores have become indispensable building blocks of modern cloud-native infrastructures, yet their evolution faces scalability, compatibility, and sustainability constraints. The current literature lacks an experimental evaluation of state-of-the-art tools in the domain. This study addressed this timely gap by benchmarking Redis alternatives and systematically evaluating Valkey, KeyDB, and Garnet under realistic workloads within Kubernetes deployments. The results demonstrate clear trade-offs among the benchmarked data systems. Our study presents a comprehensive performance and viability assessment of the emerging in-memory key-value stores. Metrics include throughput, tail latency, CPU and memory efficiency, and migration complexity. We highlight trade-offs between performance, compatibility, and long-term viability, including project maturity, community support, and sustained development.

2025-10-22T17:40:17Z The first author was neither informed nor did he give his consent to the publication of the paper on arXiv. Further, the submission contains multiple major errors in the reported numerical results, and the related conclusions are not supported by the underlying benchmark data. The first author does not stand behind the paper Carl-Johan Fauvelle Munck af Rosensch"old Feras M. Awaysheh Ahmad Awad