https://arxiv.org/api/oKZFjvbazdUtLj+rmhUiVrYw/2g2026-04-14T16:26:41Z2801361515http://arxiv.org/abs/2604.09591v1Simplicity Scales2026-03-04T03:25:11ZThe dominant data interchange formats encode integers using a variable number of bytes or represent floating-point numbers as variable-length UTF-8 strings. The decoder must inspect each byte for a continuation bit or parse each character individually, producing data-dependent branches that stall modern CPU pipelines. Protocol Buffers pays this cost on every integer, field tag, and length prefix. JSON pays it on every value.
We present Bebop, a serialization format where every data type uses a fixed number of bytes. A 32-bit integer is always four bytes. Decoding becomes a single memory read with no conditionals. Across 19 decode workloads, Bebop decodes 9--213$\times$ faster than Protocol Buffers. On a 1536-dimension embedding vector, Bebop decodes in 2.8 nanoseconds versus 111 nanoseconds for Protocol Buffers and 4.69 microseconds for simdjson, a 1,675$\times$ gap. On records above 64 KB, the decoder achieves 86% of peak memory bandwidth. The CPU is no longer the bottleneck.
We also present a transport-agnostic RPC protocol built on the same wire format. The protocol introduces batch pipelining, where dependent cross-service calls execute in a single round trip with server-side dependency resolution. It deploys over HTTP/1.1, HTTP/2, and binary transports without proxies, removing the HTTP/2 requirement that limits gRPC on serverless platforms and in browsers.2026-03-04T03:25:11ZAndrew Sampson6OVER3 InstituteYuta SaitoGoodNotesRonny Chan6OVER3 Institutehttp://arxiv.org/abs/2603.03592v1SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training2026-03-03T23:51:10ZDecentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.2026-03-03T23:51:10Z70 pages, 22 figures, 20 tablesHadi Mohaghegh DolatabadiThalaiyasingam AjanthanSameera RamasingheChamin P Hewa KoneputugodageGil AvrahamYan ZuoVioletta ShevchenkoAlexander Longhttp://arxiv.org/abs/2510.12469v2Proof of Cloud: Data Center Execution Assurance for Confidential VMs2026-03-03T21:53:51ZConfidential Virtual Machines (CVMs) protect data in use by running workloads within hardware-enforced Trusted Execution Environments (TEEs). However, existing CVM attestation mechanisms only certify what code is running, not where it is running. Commercial TEEs mitigate passive physical attacks through memory encryption but explicitly exclude active hardware tampering (memory interposers, physical side channels, ...). Yet current attestations provide no cryptographic evidence that a CVM executes on hardware residing within a trusted data center where such attacks would not take place. This gap enables proxy attacks in which valid attestations are combined across machines to falsely attest trusted execution.
To bridge this gap, we introduce Data Center Execution Assurance (DCEA), a design that generates a cryptographic Proof of Cloud by binding CVM attestation to platform-level Trusted Platform Module (TPM) evidence. DCEA combines two independent roots of trust. First, the TEE manufacturer, and second, the infrastructure provider, by cross-linking runtime TEE measurements with the vTPM-measured boot CVM state. This binding ensures that CVM execution, vTPM quotes, and platform provenance all originate from the same physical chassis.
We formalize the environment's provenance and show that DCEA prevents advanced relay attacks, including a novel mix-and-match proxy attack. Using the AGATE framework in the Universal Composability model, we prove that DCEA emulates an ideal location-aware TEE even under a malicious host software stack. We implement DCEA on Google Cloud bare-metal Intel TDX instances using Intel TXT and evaluate its performance, demonstrating practical overheads and deployability. DCEA refines the CVM threat model and enables verifiable execution-location guarantees for privacy-sensitive workloads.2025-10-14T13:01:48ZFilip RezabekMoe MahhoukAndrew MillerQuintus KilbournGeorg CarleJonathan Passerat-Palmbachhttp://arxiv.org/abs/2603.03470v1Bisynchronous FIFOs and the FITO Category Mistake: Silicon-Proven Interaction Primitives for Distributed Coordination2026-03-03T19:28:59ZBisynchronous FIFOs -- hardware buffers that mediate data transfer between independent clock domains without a shared global timebase -- have been designed, formally verified, and commercially deployed in silicon for over four decades. We survey this literature from Chapiro's 1984 GALS thesis through Cummings's Gray-code pointer techniques, Chelcea and Nowick's mixed-timing interfaces, Greenstreet's STARI protocol, and the 2015 NVIDIA pausible bisynchronous FIFO, and argue that this body of work constitutes a silicon-proven existence proof against the Forward-In-Time-Only (FITO) assumption that pervades distributed systems. The central claim is that interaction-based synchronization primitives -- handshakes, mutual exclusion, and causal flow control -- can replace timestamp-based coordination at the most demanding levels of digital engineering, directly undermining the FITO assumption in protocols such as PTP, TSN, and conventional Ethernet. We draw a structural parallel between on-chip bisynchronous coordination and the Open Atomic Ethernet (OAE) architecture, and identify the handshake -- not the timestamp -- as the fundamental primitive for coordination between independent causal domains.2026-03-03T19:28:59ZPaul Borrillhttp://arxiv.org/abs/2603.00766v2Black Hole Search: Dynamics, Distribution, and Emergence2026-03-03T19:25:54ZA black hole is a malicious node in a graph that destroys resources entering into it without leaving any trace. The problem of Black Hole Search (BHS) using mobile agents requires that at least one agent survives and terminates after locating the black hole. Recently, this problem has been studied on 1-bounded 1-interval connected dynamic graphs \cite{BHS_gen}, where there is a footprint graph, and at most one edge can disappear from the footprint in a round, provided that the graph remains connected. In this setting, the authors in \cite{BHS_gen} proposed an algorithm that solves the BHS problem when all agents start from a single node (rooted initial configuration). They also proved that at least $2δ_{BH} + 1$ agents are necessary to solve the problem when agents are initially placed arbitrarily across the nodes of the graph (scattered initial configuration), where $δ_{BH}$ denotes the degree of the black hole. In this work, we present an algorithm that solves the BHS problem using $2δ_{BH} + 17$ initially scattered agents. Our result matches asymptotically with the rooted algorithm of \cite{BHS_gen} under the same model assumptions.
Further, we study the Eventual Black Hole Search (\textsc{Ebhs}) problem, in which the black hole may appear at any node and at any time during the execution of the algorithm, destroying all agents located on that node at the time of its appearance. However, the black hole cannot emerge at the home base in round~0, where the home base is the node at which all agents are initially co-located. Once the black hole appears, it remains active at that node for the rest of the execution. This problem has been studied on static rings~\cite{Bonnet25}; here we extend it to arbitrary static graphs and provide a solution using four agents. Moreover, it does not require any knowledge of global parameters or additional model assumptions.2026-02-28T18:22:48ZTanvir KaurAshish SaxenaPartha Sarathi MandalKaushik Mondalhttp://arxiv.org/abs/2603.03089v1Serverless Abstractions for Short-Running, Lightweight Streams2026-03-03T15:31:42ZServerless computing and stream processing represent two dominant paradigms for event-driven data processing, yet both make assumptions that render them inefficient for short-running, lightweight, and unpredictable streams that require stateful processing. We propose stream functions as a novel extension of the Function-as-a-Serivce model that treat short streams as the unit of execution, state, and scaling. Stream functions process streams via an iterator-based interface, enabling seamless inter-event logic while retaining the elasticity and scale-to-zero capabilities offered by serverless platforms. Our evaluation shows that stream functions reduce the processing overhead by ~99 % compared to a mature stream process- ing engine in a video-processing use case. By providing comparable performance to serverless functions with stream semantics, stream functions provide an effective and efficient abstractions for a class of workloads underserved by existing models.2026-03-03T15:31:42ZAccepted for publication at the 4th Workshop on SErverless Systems, Applications and MEthodologies (SESAME '26)Natalie CarlNiklas KowallikConstantin StahlTrever SchirmerTobias PfandzelterDavid Bermbachhttp://arxiv.org/abs/2603.03023v1Dynamic Contract Analysis for Parallel Programming Models2026-03-03T14:15:29ZParallel programming in high-performance computing depends on low-level APIs such as MPI, requiring users to manage synchronization and resources manually. Several correctness checking tools exist to help bug-free code development, though most target a single programming model, limiting their applicability. Our previous work, the static analysis tool CoVer, leverages a contract-based approach enabling users to specify custom error-checking rules and support emerging or unconventional programming models without requiring extensive new tooling. However, static analysis cannot fully reason about runtime-dependent behavior such as pointer aliasing or indirect control flow. To address this, we present CoVer-Dynamic, a dynamic analysis extension that reuses CoVer's contract language to provide a unified static-dynamic verification framework. By enforcing the same contracts at runtime, CoVer-Dynamic improves classification accuracy and eliminates false positives on standardized MPI and OpenSHMEM benchmarks, while detecting errors beyond static analysis only. Our evaluation shows that CoVer-Dynamic consistently outperforms the state-of-the-art correctness checker MUST, averaging a 2x speedup. Finally, our results show limitations in the expressiveness of the contract language, motivating future work to support more error classes.2026-03-03T14:15:29ZA peer-reviewed version is to be published by IEEE as part of the IPDPS HIPS workshop proceedings. This is the originally submitted articleYussur Mustafa OrajiAlexander HückChristian Bischofhttp://arxiv.org/abs/2512.08725v2Spatio-Temporal Shifting to Reduce Carbon, Water, and Land-Use Footprints of Cloud Workloads2026-03-03T14:14:03ZIn this paper, we investigate the potential of spatial and temporal cloud workload shifting to reduce carbon, water, and land use footprints. Specifically, we perform a simulation study leveraging publicly available data on the cloud infrastructure of major providers (AWS and Azure) as well as real-world workload traces (big data analytics and FaaS) and grid mix data to consider two different scenarios. Our simulation results indicate that spatial shifting can substantially lower carbon, water, and land use footprints. In the FaaS applications, shifting the spatiotemporal workload achieves carbon savings of up to 85%, water savings of around 50%, and reductions in land use of up to 45%, all while optimizing for the respective factors. Mixed optimization yields results comparable to those of land use alone. For big data workloads, spatiotemporal shifting delivers reductions of up to 45% in carbon emissions, 40% in water consumption, and nearly 40% in land use when optimized for the respective factors. Temporal shifting also decreases the footprint, though to a lesser extent. When applied together, the two strategies yield the greatest overall reduction, driven mainly by spatial shifting with temporal adjustments providing an additional, incremental benefit. Sensitivity analysis demonstrates that such shifting is robust to prediction errors in grid mix data and to variations across different seasons.2025-12-09T15:39:06ZThis is a pre-print of our paper currently under reviewGiulio AttenniYoussef MoawadNovella BartoliniLauritz Thamsenhttp://arxiv.org/abs/2603.03007v1Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients2026-03-03T14:01:08ZLocal class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.2026-03-03T14:01:08ZTian-Shuang WuShen-Huan LyuNing ChenYi-Xiao HeBing TangBaoliu YeQingfu Zhanghttp://arxiv.org/abs/2603.02971v1Scalable Mesh Coupling for Atmospheric Wave Simulation2026-03-03T13:25:39ZWe describe the application of a scalable algorithm for interpolating solution data in the overlapping mesh region of two solvers. This feature is essential to obtain a globally consistent solution for in-situ coupled atmospheric wave simulation. We provide timings and discuss a real-world application run.2026-03-03T13:25:39Z5 pages, 6 figures, presented at SIAM International Meshing Roundtable 2026Hannes BrandtTim GriesbachMatthew ZettergrenScott AitonJonathan SnivelyDonna CalhounCarsten Bursteddehttp://arxiv.org/abs/2603.02885v1MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing2026-03-03T11:34:49ZParameter-Efficient Fine-Tuning (PEFT) is widely applied as the backend of fine-tuning APIs for large language model (LLM) customization in datacenters. Service providers deploy separate instances for individual PEFT tasks, giving rise to prominent resource inefficiencies, including (1) GPU underutilization from small-scale, PEFT-native operators and (2) device stalls from communication delays and data dependencies in parallelized execution. To address these issues, this paper presents MuxTune, a fine-tuning system that enables resource-efficient concurrent execution of multiple PEFT tasks. The key idea is to multiplex the backbone across independent tasks in a spatial-temporal manner for improved utilization and reduced stalls. Building on flexible, modularized backbone sharing via unified PEFT representations, MuxTune proposes hierarchical co-scheduling scheme with task, operator, and data-level optimizations. Specifically, it fuses tasks through a hybrid of spatial and temporal multiplexing, and orchestrates multi-task operator execution in two-tiered hybrid parallelism. Additionally, MuxTune employs chunk-based data alignment to mitigate inter-task ineffective tokens. Experimental results demonstrate that MuxTune achieves up to $2.33\times$ higher throughput and $5.29\times$ memory reduction compared to three state-of-the-art baselines.2026-03-03T11:34:49ZChunyu XueYi PanWeihao CuiQuan ChenShulai ZhangBingsheng HeMinyi Guohttp://arxiv.org/abs/2512.22420v4Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving2026-03-03T09:33:44ZSpeculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV-cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves average 27.29% higher throughput and up to 20.18% lower latency compared to standard speculative decoding under dynamic request arrival rates in real-time LLM serving scenarios.2025-12-27T00:57:55ZRui LiZhaoning ZhangLibo ZhangHuaimin WangXiang FuZhiquan Laihttp://arxiv.org/abs/2510.14686v2xLLM Technical Report2026-03-03T07:38:09ZWe introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.2025-10-16T13:53:47Z39 pagesTongxuan LiuTao PengPeijun YangXiaoyang ZhaoXiusheng LuWeizhe HuangZirui LiuXiaoyu ChenZhiwei LiangJun XiongDonghe JinMinchao ZhangJinrong GuoYingxu DengXu ZhangXianzhe DongSiqi WangSiyu WuYu WuZihan TangYuting ZengYanshu WangJinguang LiuMeng KangMenxin LiYunlong WangYiming LiuXiaolong MaYifan WangYichen ZhangJinrun YinKeyang ZhengJiawei YinJun ZhangZiyue WangXiaobo LinLiangyu LiuLiwei LanYang LiuChunhua PengHan LiuSongcheng RenXuezhu WangYunheng ShenYi WangGuyue LiuYitao HuHui ChenTong YangHailong YangJing LiGuiguang DingKe Zhanghttp://arxiv.org/abs/2603.02661v1Blockchain Communication Vulnerabilities2026-03-03T06:50:47ZBlockchains are diverse in the way they handle communications between their nodes to disseminate information, mitigate attacks, and agree on the next block. While security vulnerabilities have been identified, they rely on an attack custom-made for a specific blockchain communication protocol. To our knowledge, the vulnerabilities of multiple blockchain communication protocols to adversarial conditions have never been compared.
In this paper, we compare empirically the vulnerabilities of the communication protocols of five modern in-production blockchains, Algorand, Aptos, Avalanche, Redbelly and Solana, when attacked in five different ways. We conclude that Algorand is vulnerable to packet loss attacks, Aptos is vulnerable to targeted load attacks and leader isolation attacks, Avalanche is vulnerable to transient failure attacks, Redbelly's performance is impacted by packet loss attacks and Solana is vulnerable to stopping attacks and leader isolation attacks. Our system is open source.2026-03-03T06:50:47Z17 pages, 11 figuresAndrei LebedevVincent Gramolihttp://arxiv.org/abs/2603.03383v1Accelerating OpenPangu Inference on NPU via Speculative Decoding2026-03-03T06:50:31ZTo mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.2026-03-03T06:50:31ZYuntao DaiJing WuHang GuTeng Wang