https://arxiv.org/api/S3RGp5H4xknVR73f8Y75tmu4vH02026-04-12T10:20:01Z2795351015http://arxiv.org/abs/2603.05666v1Why Ethereum Needs Fairness Mechanisms that Do Not Depend on Participant Altruism2026-03-05T20:32:36ZEthereum's ideals of decentralization and censorship resistance are undermined in practice, motivating ongoing efforts to reestablish these properties. Existing proposals for fairness mechanisms depend on the assumption that a sufficient fraction of block proposers adhere to Ethereum's protocols as intended. We refer to such proposers as altruistic, as this behavior may come at the cost of reduced revenue. Prior analyses indicate that a consistent share of 91 percent of proposers delegate block construction to centralized services, effectively signing externally constructed blocks blindly, and are thus not considered altruistic. To assess whether the remaining 9 percent of proposers genuinely exhibit altruistic behavior, we conducted an empirical analysis and found that an additional 6.1 percent also interact with such external services. Further, we found that less than 1.4 percent of proposers consistently acted in accordance with Ethereum's decentralization and censorship resistance objectives. These findings suggest that relying solely on the mere presence of altruistic proposers is insufficient to ensure that proposed fairness mechanisms reestablish Ethereum's ideals, highlighting the need for additional incentive- or penalty-based mechanisms.2026-03-05T20:32:36Z8 pages, 4 figuresPatrick SpiesbergerNils Henrik BeyerHannes Hartensteinhttp://arxiv.org/abs/2502.09922v3λScale: Enabling Fast Scaling for Serverless Large Language Model Inference2026-03-05T19:50:13ZServerless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.2025-02-14T05:21:48ZMinchen YuRui YangChaobo JiaZhaoyuan SuSheng YaoTingfeng LanYuchen YangZirui WangYue ChengWei WangAo WangRuichuan Chenhttp://arxiv.org/abs/2603.06709v1The Need for Quantitative Resilience Models and Metrics in Classical-Quantum Computing Systems2026-03-05T19:07:43ZIncreasingly deeper integration of HPC resources and QPUs unveils new challenges in computer architecture and engineering. As a consequence, dependability arises again as a concern encompassing resilience, reproducibility and security. The properties of quantum computing systems involve a reinterpretation of these factors in retrodictive, predictive, and prescriptive ways. We state here that resilience must become an \emph{a priori} design constraint rather than an afterthought of HPC-QPU integration. This article describes the need for conceptual and quantitative models to estimate and assess the resilience hybrid classical-quantum computing infrastructure. We suggest how resilience methods in civil engineering can apply at various levels of the classical-quantum computing stack. We also discuss implications of a model of end-user value for the estimation of consequences resulting from the propagation of vulnerabilities from a given level of the stack upwards. Finally, we argue in favor of new resilience models can help the impact of improving specific components in quantum technology stacks to provide a clearer picture about the value of separation of concerns across different layers. Ultimately, HPC-QPU integration will increasingly demand more precise statements about the cost-benefit ratio of specific system improvements and their cascading consequences against estimates of delivered value to users.2026-03-05T19:07:43Z16 pages, 8 figuresSantiago Núñez-Corraleshttp://arxiv.org/abs/2309.09359v2Concurrent Deterministic Skiplist and Other Data Structures2026-03-05T18:38:35ZSkiplists are used in a variety of applications for storing data subject to order criteria. In this article we discuss the design, analysis and performance of a concurrent deterministic skiplist on many-core NUMA nodes. We also evaluate the performance of concurrent lock-free unbounded queue implementation and two concurrent multi-reader,multi-writer(MWMR) hash table implementations and compare them with those from Intel's Thread Building Blocks(TBB) library. We introduce strategies for memory management that reduce page faults and cache misses for the memory access patterns in these data structures. This paper proposes hierarchical usage of concurrent data structures in programs to improve memory latencies by reducing memory accesses from remote NUMA nodes.2023-09-17T19:50:26ZAparna Sasidharanhttp://arxiv.org/abs/2412.10733v2Universal Pattern Formation by Oblivious Robots Under Sequential Schedulers2026-03-05T17:24:36ZWe study the computational power that oblivious robots operating in the plane have under sequential schedulers. We show that this power is much stronger than the obvious capacity these schedulers offer of breaking symmetry, and thus to create a leader. In fact, we prove that under any sequential scheduler, robots are capable of solving problems that are unsolvable even with a leader under the fully synchronous scheduler FSYNC. More precisely, we consider the class of pattern formation problems, and focus on the most general problem in this class, Universal Pattern Formation (UPF), which requires the robots to form every pattern given in input, starting from any initial configuration (where some robots may occupy the same point, hence forming a multiplicity). We first show that UPF is unsolvable under FSYNC, even if the robots are endowed with additional strong capabilities (multiplicity detection, rigid movement, agreement on coordinate systems, presence of a unique leader). On the other hand, we prove that, except for point formation (Gathering), UPF is solvable under any sequential scheduler without any additional assumptions. We then turn our attention to the Gathering problem, and prove that weak multiplicity detection (the ability to detect a multiplicity but not the exact number of robots forming it) is necessary and sufficient for solvability under sequential schedulers. The results obtained show that the computational power of the robots under FSYNC (where Gathering is solvable without any multiplicity detection) and that under sequential schedulers are orthogonal.2024-12-14T08:06:37ZPaola FlocchiniAlfredo NavarraDebasish PattanayakFrancesco PiselliNicola Santorohttp://arxiv.org/abs/2603.05366v1Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI2026-03-05T16:44:53ZWriting efficient distributed code remains a labor-intensive and complex endeavor. To simplify application development, the Flexible Computational Science Infrastructure (FleCSI) framework offers a user-oriented, high-level programming interface that is built upon a task-based runtime model. Internally, FleCSI integrates state-of-the-art parallelization backends, including MPI and the asynchronous many-task runtimes (AMTRs) Legion and HPX, enabling applications to fully leverage asynchronous parallelism. In this work, we benchmark two applications using FleCSI's three backends on up to 1024 nodes, intending to quantify the advantages and overheads introduced by the AMTR backends. As representative applications, we select a simple Poisson solver and the multidimensional radiation hydrodynamics code HARD. In the communication-focused Poisson solver benchmark, FleCSI achieves over 97% parallel efficiency using the MPI backend under weak scaling on up to 131072 cores, indicating that only minimal overhead is introduced by its abstraction layer. While the Legion backend exhibits notable overheads and scaling limitations, the HPX backend introduces only marginal overhead compared to MPI+Kokkos. However, the scalability of the HPX backend is currently limited due to the usage of non-optimized HPX collective operations. In the computation-focused radiation hydrodynamics benchmarks, the performance gap between the MPI and HPX backends fades. On fewer than 64 nodes, the HPX backend outperforms MPI+Kokkos, achieving an average speedup of 1.31 under weak scaling and up to 1.27 under strong scaling. For the hydrodynamics-only HARD benchmark, the HPX backend demonstrates superior performance on fewer than 32 nodes, achieving speedups of up to 1.20 relative to MPI and up to 1.64 relative to MPI+Kokkos.2026-03-05T16:44:53Z10 pages, 7 figures, 1 table, 28th Workshop on Advances in Parallel and Distributed Computational ModelsAlexander StrackHartmut KaiserDirk Pflügerhttp://arxiv.org/abs/2602.15356v2Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation2026-03-05T16:35:20ZRemoving the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.2026-02-17T04:50:06ZPatrick G. BridgesUniversity of New MexicoDerek SchaferUniversity of New MexicoJack LangeOak Ridge National LaboratoryJames B. WhiteOak Ridge National LaboratoryAnthony SkjellumTennessee Technological UniversityEvan SuggsTennessee Technological UniversityThomas HinesTennessee Technological UniversityPurushotham BangaloreUniversity of AlabamaMatthew G. F. DosanjhSandia National LaboratoriesWhit SchonbeinSandia National Laboratorieshttp://arxiv.org/abs/2603.05241v1A monitoring system for collecting and aggregating metrics from distributed clouds2026-03-05T14:56:37ZApplications requiring real-time processing of large volumes of data have been the main driver for rethinking the traditional cloud, giving rise to novel cloud models. Distributed cloud (DC) is a model that allows users to dynamically create and dispose of strategically located ad-hoc clouds that contain resources best tailored to their needs. It is essential for this model to provide a high degree of observability for it to be viable in real-world scenarios. In this paper, we present the design and implementation of a monitoring system that collects metrics from DCs and makes them accessible to diverse clients. Agents running on nodes are responsible for collecting machine-, container-, and application-level metrics. During the health-check protocol, that data is transferred from the node to the DC's control plane running inside the cloud. There, it is persisted and served via multiple APIs, including a streaming API. Moreover, node metrics are aggregated for every DC in order to provide a more comprehensive view of the system's state.2026-03-05T14:56:37Z2025 IEEE 23rd Jubilee International Symposium on Intelligent Systems and Informatics (SISY)Tamara RankovićMateja RilakJanko RakonjacMiloš Simić10.1109/SISY67000.2025.11205413http://arxiv.org/abs/2407.15738v6Parallel Split Learning with Global Sampling2026-03-05T14:34:53ZParallel split learning (PSL) suffers from two intertwined issues: the effective batch size grows with the number of clients, and data that is not identically and independently distributed (non-IID) skews global batches. We present parallel split learning with global sampling (GPSL), a server-driven scheme that fixes the global batch size while computing per-client batch-size schedules using pooled-level proportions. The actual samples are drawn locally without replacement by each selected client. This eliminates per-class rounding, decouples the effective batch from the client count, and makes each global batch distributionally equivalent to centralized uniform sampling without replacement. Consequently, we obtain finite-population deviation guarantees via Serfling's inequality, yielding a zero rounding bias compared to local sampling schemes. GPSL is a drop-in replacement for PSL with negligible overhead and scales to large client populations. In extensive experiments on CIFAR-10/100 and ResNet-18/34 under non-IID splits, GPSL stabilizes optimization and achieves centralized-like accuracy, while fixed local batching trails by up to 60%. Furthermore, GPSL shortens training time by avoiding inflation of training steps induced by data-depletion. These findings suggest GPSL is a promising and scalable approach for learning in resource-constrained environments.2024-07-22T15:41:23ZAccepted at the 2025 IEEE 3rd International Conference on Foundation and Large Language Models (FLLM). This version corresponds to the accepted manuscriptMohammad KohankhakiAhmad AyadMahdi BarhoushAnke Schmeink10.1109/FLLM67465.2025.11391108http://arxiv.org/abs/2603.05217v1Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks2026-03-05T14:30:10ZReal-time city-scale traffic analytics requires processing 100s-1000s of CCTV streams under strict latency, bandwidth, and compute limits. We present a scalable AI-driven Intelligent Transportation System (AIITS) designed to address multi-dimensional scaling on an edge-cloud fabric. Our platform transforms live multi-camera video feeds into a dynamic traffic graph through a DNN inferencing pipeline, complemented by real-time nowcasting and short-horizon forecasting using Spatio-Temporal GNNs. Using a testbed to validate in a Bengaluru neighborhood, we ingest 100+ RTSP feeds from Raspberry Pis, while Jetson Orin edge accelerators perform high-throughput detection and tracking, producing lightweight flow summaries for cloud-based GNN inference. A capacity-aware scheduler orchestrates load-balancing across heterogeneous devices to sustain real-time performance as stream counts increase. To ensure continuous adaptation, we integrate SAM3 foundation-model assisted labeling and Continuous Federated Learning to update DNN detectors on the edge. Experiments show stable ingestion up to 2000 FPS on Jetson Orins, low-latency aggregation, and accurate and scalable ST-GNN forecasts for up to 1000 streams. A planned live demonstration will scale the full pipeline to 1000 streams, showcasing practical, cross-fabric scalability.2026-03-05T14:30:10ZAccepted at TCSC SCALE Challenge 2026. To appear in the Proceedings of IEEE/ACM CCGRID Workshops, Sydney, 2026Akash SharmaPranjal NamanRoopkatha BanerjeePriyanshu PansariSankalp GawaliMayank AryaSharath ChandraArun JosephrajRakshit RameshPunit RathoreAnirban ChakrabortyRaghu KrishnapuramVijay KovvaliYogesh Simmhanhttp://arxiv.org/abs/2603.05118v1Leveraging Structural Knowledge for Solving Election in Anonymous Networks with Shared Randomness2026-03-05T12:44:24ZWe study the classical Election problem in anonymous net- works, where solutions can rely on the use of random bits, which may be either shared or unshared among nodes. We provide a complete char- acterization of the conditions under which a randomized Election algo- rithm exists, for arbitrary structural knowledge. Our analysis considers both Las Vegas and Monte Carlo randomized algorithms, under the as- sumptions of shared and unshared randomness. In our setting, random sources are considered shared if the output bits are identical across spe- cific subsets of nodes. The algorithms and impossibility proofs are extensions of those of [5] for the deterministic setting. Our results are a complete generalization of those from [8]. Moreover, as applications, we consider many specific knowledge: no knowledge, a bound on the size, a bound on the number of nodes sharing a source, the size, or the full topology of the network. For each of them, we show how the general characterizations apply, showing they actually correspond to classes of structural knowledge. We also de- scribe also how randomized Election algorithms from the literature fits in this landscape. We therefore provide a comprehensive picture illustrating how knowledge influences the computability of the Election problem in arbitrary anonymous graphs with shared randomness.2026-03-05T12:44:24ZFull version of Sirocco'2026Jérémie ChalopinEmmanuel Godardhttp://arxiv.org/abs/2603.05087v1PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning2026-03-05T11:58:55ZPrompt tuning has become a prominent strategy for enhancing the performance of Large Language Models (LLMs) on downstream tasks. Many IT enterprises now offer Prompt-Tuning-as-a-Service to fulfill the growing demand for prompt tuning LLMs on downstream tasks. Their primary objective is to satisfy users Service Level Objectives (SLOs) while reducing resource provisioning costs. Nevertheless, our characterization analysis for existing deep learning resource management systems reveals that they are insufficient to optimize these objectives for LLM prompt tuning workloads.
In this paper, we introduce PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning. It contains two innovations. (1) We design a Prompt Bank to identify efficient initial prompts to expedite the convergence of prompt tuning. (2) We develop aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs. In our evaluation, PromptTuner reduces SLO violations by 4.0x and 7.9x, and lowers costs by 1.6x and 4.5x, compared to INFless and ElasticFlow respectively.2026-03-05T11:58:55ZWei GaoPeng SunDmitrii UstiugovTianwei ZhangYonggang Wenhttp://arxiv.org/abs/2602.13046v2Classification of Local Optimization Problems in Directed Cycles2026-03-05T09:29:48ZWe present a complete classification of the distributed computational complexity of local optimization problems in directed cycles for both the deterministic and the randomized LOCAL model. We show that for any local optimization problem $Π$ (that can be of the form min-sum, max-sum, min-max, or max-min, for any local cost or utility function over some finite alphabet), and for any constant approximation ratio $α$, the task of finding an $α$-approximation of $Π$ in directed cycles has one of the following complexities:
1. $O(1)$ rounds in deterministic LOCAL, $O(1)$ rounds in randomized LOCAL,
2. $Θ(\log^* n)$ rounds in deterministic LOCAL, $O(1)$ rounds in randomized LOCAL,
3. $Θ(\log^* n)$ rounds in deterministic LOCAL, $Θ(\log^* n)$ rounds in randomized LOCAL,
4. $Θ(n)$ rounds in deterministic LOCAL, $Θ(n)$ rounds in randomized LOCAL.
Moreover, for any given $Π$ and $α$, we can determine the complexity class automatically, with an efficient (centralized, sequential) meta-algorithm, and we can also efficiently synthesize an asymptotically optimal distributed algorithm.
Before this work, similar results were only known for local search problems (e.g., locally checkable labeling problems). The family of local optimization problems is a strict generalization of local search problems, and it contains numerous commonly studied distributed tasks, such as the problems of finding approximations of the maximum independent set, minimum vertex cover, minimum dominating set, and minimum vertex coloring.2026-02-13T16:03:14Z26 pages, 2 figuresThomas BoudierFabian KuhnAugusto ModaneseRonja StimpertJukka Suomelahttp://arxiv.org/abs/2603.04235v22-Coloring Cycles in One Round2026-03-05T09:04:42ZWe show that there is a one-round randomized distributed algorithm that can 2-color cycles such that the expected fraction of monochromatic edges is less than 0.24118. We also show that a one-round algorithm cannot achieve a fraction less than 0.23879. Before this work, the best upper and lower bounds were 0.25 and 0.2. Our proof was largely discovered and developed by large language models, and both the upper and lower bounds have been formalized in Lean 4.2026-03-04T16:18:30Z9 pages, 3 figuresMaxime FlinAlesya RaevskayaRonja StimpertJukka SuomelaQingxin Yanghttp://arxiv.org/abs/2603.04937v1FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability2026-03-05T08:36:59ZDespite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This challenge is particularly pronounced in large-scale observability platforms handling high-volume, high-velocity data records. For instance, recurrent, expensive filtering queries at query time impose substantial computational and storage overheads in the analytical data plane. In this paper, we propose FluxSieve, a unified architecture that reconciles traditional pull-based query processing with push-based stream processing by embedding a lightweight in-stream precomputation and filtering layer directly into the data ingestion path. This avoids the complexity and operational burden of running queries in dedicated stream processing frameworks. Concretely, this work (i) introduces a foundational architecture that unifies streaming and analytical data planes via in-stream filtering and records enrichment, (ii) designs a scalable multi-pattern matching mechanism that supports concurrent evaluation and on-the-fly updates of filtering rules with minimal per-record overhead, (iii) demonstrates how to integrate this ingestion-time processing with two open-source analytical systems -- Apache Pinot as a Real-Time Online Analytical Processing (RTOLAP) engine and DuckDB as an embedded analytical database, and (iv) performs comprehensive experimental evaluation of our approach. Our evaluation across different systems, query types, and performance metrics shows up to orders-of-magnitude improvements in query performance at the cost of negligible additional storage and very low computational overhead.2026-03-05T08:36:59ZAdriano VogelSören HenningOtmar Ertl