https://arxiv.org/api/RuVs8oIw8PDY/YBEwx68btyJEYg 2026-06-10T21:04:25Z 28838 345 15 http://arxiv.org/abs/2512.04320v2 VLCs: Managing Parallelism with Virtualized Libraries 2026-05-22T19:23:20Z

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch. Source code of VLCs is available at https://github.com/pecos/Virtual-Library-Context.

2025-12-03T23:11:02Z In Proceedings of the 2025 ACM Symposium on Cloud Computing (SoCC '25) Proceedings of the 2025 ACM Symposium on Cloud Computing (2025) 629-643 Yineng Yan William Ruys Hochan Lee Ian Henriksen Arthur Peters Sean Stephens Bozhi You Henrique Fingler Martin Burtscher Milos Gligoric Keshav Pingali Mattan Erez George Biros Christopher J. Rossbach 10.1145/3772052.3772265 http://arxiv.org/abs/2605.24096v1 The Time is Here for Just-in-Time Systems: Challenges and Opportunities 2026-05-22T18:03:41Z

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.

2026-05-22T18:03:41Z preprint Shu Liu Alexander Krentsel Shubham Agarwal Mert Cemri Ziming Mao Soujanya Ponnapalli Alexandros G. Dimakis Sylvia Ratnasamy Matei Zaharia Aditya Parameswaran Ion Stoica http://arxiv.org/abs/2605.23850v1 Enhancing Energy Efficiency in Scientific Workflows through CFD based PIVAEs 2026-05-22T17:04:49Z

The growing complexity and scale of scientific workflows in high performance computing (HPC) environments have led to significant challenges in managing energy consumption without compromising computational performance. Traditional scheduling strategies often fail to account for the complex interplay between thermal dynamics, workload diversity, and system scalability, leading to inefficient and unsustainable energy usage. This paper introduces a novel, scalable, and AI-assisted scheduling framework for optimizing energy consumption in HPC environments without compromising performance. Central to our approach is the integration of Computational Fluid Dynamics (CFD) with a Physics-Informed Variational Autoencoder (PIVAE), enabling the generation of physically realistic synthetic workload data that bridges the gap between thermodynamic behavior and scheduler decision-making in complex, multi-scale HPC environments. By categorizing workflows based on resource utilization profiles, we evaluate multiple scheduling strategies such as Locality Aware and Speculative Aware Scheduling. These workflows, ranging from event reconstruction to anomaly detection, represent diverse computational intensities. Our results show that modest reductions in CPU performance (e.g., to 15%) can yield substantial energy savings (up to 10%) with only minor turnaround time increases (approximately 5-6%), identifying an optimal operational sweet spot. This work demonstrates how physics-informed generative modeling can enable adaptive, sustainable, and data-efficient scheduling for next-generation HPC infrastructures.

2026-05-22T17:04:49Z Ali Zahir Ashiq Anjum Mark Wilkinson Jeyan Thiyagalingam http://arxiv.org/abs/2605.23816v1 SDNator is Not Another SDN Controller: Enabling Extensible Data-Driven Control in Cyber-Physical Systems 2026-05-22T16:16:59Z

An SDN-like centralized control architecture is increasingly popular and has been widely explored in cyber-physical systems (CPS) such as manufacturing, internet-of-things, and autonomous vehicle systems for higher flexibility, programmability and scalability. However, no existing frameworks can offer domain-agnostic, easily extensible support for data-driven CPS applications. In this work, we design, implement, and open-source \textit{SDNator}, the first framework to enable extensible, data-driven control in CPS. SDNator embraces an application- and data-driven design where applications function as data consumers and producers to collectively define the workflows of the controller. SDNator also incorporates two data store backends to support both event-driven and data-driven programming patterns. Benchmarks show that SDNator is highly scalable, and delivers comparable performance to Ryu, a widely used SDN controller. Moreover, we demonstrate the capabilities and usability of SDNator through our case studies of manufacturing and networking systems. By integrating applications from respective domains, we build different ``controllers'' for different scenarios. Most notably, we leverage SDNator to implement the first digital-twin-equipped central controller for additive manufacturing fleets. We show through extensive and realistic simulations that SDNator-based scheduling can (1) significantly shorten production time and improve reliability in the presence of anomalies compared to decentralized approaches, and (2) flexibly adjust and optimize production plans upon urgent requests such as producing Personal Protective Equipment during the COVID-19 pandemic.

2026-05-22T16:16:59Z Y. Lin R. Zhang E. Balta X. Zhu J. Zhang K. Barton D. Tilbury Z. Mao http://arxiv.org/abs/2605.23815v1 A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification 2026-05-22T16:16:52Z

Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage by approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.

2026-05-22T16:16:52Z Shubham Vashisth Olivier Michaud Bettina Kemme Oana Balmau http://arxiv.org/abs/2511.15503v2 DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures 2026-05-22T15:57:54Z

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

2025-11-19T14:58:16Z Peiming Yang Sankeerth Durvasula Ivan Fernandez Mohammad Sadrosadati Onur Mutlu Gennady Pekhimenko Christina Giannoula http://arxiv.org/abs/2602.14289v2 Parallel Sparse and Data-Sparse Factorization-based Linear Solvers 2026-05-22T15:57:25Z

Efficient solutions of large-scale, ill-conditioned and indefinite algebraic equations are ubiquitously needed in numerous computational fields, including multiphysics simulations, machine learning, and data science. Because of their robustness and accuracy, direct solvers are crucial components in building a scalable solver toolchain. In this chapter, we will review recent advances of sparse direct solvers along two axes: 1) reducing communication and latency costs in both task- and data-parallel settings, and 2) reducing computational complexity via low-rank and other compression techniques such as hierarchical matrix algebra. In addition to algorithmic principles, we also illustrate the key parallelization challenges and best practices to deliver high speed and reliability on modern heterogeneous parallel machines.

2026-02-15T19:40:14Z Xiaoye Sherry Li Yang Liu http://arxiv.org/abs/2605.23707v1 Flare: Leveraging Serverless Elasticity to Absorb Microservice Load Spikes 2026-05-22T14:54:40Z

Online services strive to maintain application responsiveness even when the traffic is unpredictable and fluctuating. Today's online services are commonly deployed as chains of microservices, each microservice packaged as one or more containers inside virtual machines (VMs). While performant and affordable when the load is steady, VM-based deployments are known to be slow to scale when the load spikes, resulting in degraded performance for end-users of the service. To avoid such performance degradations, service providers can over-provision their deployments; however, such a strategy is costly and inefficient, leaving resources under-utilized for extended periods. To address the challenge of unpredictable load spikes, we propose Flare, a hybrid microservice architecture that combines VMs with serverless computing. Flare utilizes VMs to cost-effectively handle steady workloads and leverages serverless elasticity to absorb traffic spikes. When a spike occurs, Flare detects which specific service(s) are overloaded and shifts the excess load of only those services to serverless, thus minimizing the cost overhead. Flare seamlessly integrates into existing auto-scaling and serverless infrastructure, requiring minimal changes to the control plane and no modifications to the application.

2026-05-22T14:54:40Z Dilina Dehigama Shyam Jesalpura David Schall Antonios Katsarakis Marios Kogias Rakesh Kumar Boris Grot http://arxiv.org/abs/2605.23677v1 AMP: Arc Multi-Proposer Protocol with Bounded Inclusion Guarantees 2026-05-22T14:26:10Z

Blockchain systems that settle financial transactions face a structural tension: the single validator that assembles each block holds unilateral power over transaction inclusion and ordering. Traditional markets curb this very power through front-running and market-manipulation laws. Regulators have flagged the absence of such rules as a first-order concern for blockchain-based financial infrastructure. In response, we introduce AMP, a multi-proposer protocol, on top of the Tendermint consensus algorithm, where no validator can control the flow of transactions into blocks. Instead, dedicated nodes called proposers sit between users and validators. They collect user transactions, group them into payloads, and broadcast the payloads to all validators. Consequently, there is no mempool, and AMP applies the design principle of separating dissemination from agreement, which can lead to higher throughput. Validators publicly attest to receiving payloads and run consensus to decide the set of payloads to include in the next block. When all correct validators attest to a given payload, AMP guarantees that payload will be included in the next block; a block thus contains payloads from multiple proposers, allowing for bulk finalization. This bounded inclusion guarantee along with a deterministic ordering algorithm which is run over all payloads included in a block, curbs the power of any single validator. Validators no longer control what is included in a block, nor can they arbitrarily order the contents of blocks.

2026-05-22T14:26:10Z Daniel Cason Gordon Liao Sergio Mena Nenad Milošević Adi Seredinschi Alessandro Sforzin João Sousa Preston Vander Vos http://arxiv.org/abs/2605.23648v1 Herring: Parallel Batch-Order-Fairness on DAG-based Blockchain Consensus 2026-05-22T14:00:33Z

Transaction ordering attacks extract billions of dollars annually from decentralized finance users in the form of Maximal Extractable Value (MEV). Byzantine Fault-Tolerant (BFT) consensus protocols guarantee total order but place no constraint on how that order is chosen, leaving the door open for adversarial reordering. Batch-order-fairness (batch-OF) protocols close this gap, but existing designs pay a steep performance price for this guarantee. Leader-based protocols such as Themis concentrate all fairness decisions at a single replica, while recent DAG-based proposals FairDAG and DAG of DAGs (DoD) force their fairness layer into strictly serial execution despite running on multi-proposer DAGs. We present Herring, the first $γ$-batch-OF DAG BFT protocol whose fairness layer parallelizes the dominant graph construction cost across committed subdags. Herring combines post-consensus graph construction with explicit missing edge resolution piggybacked on the DAG's reliable broadcast layer, a pairing that turns fair ordering from a per-round serial bottleneck into a CPU-bound task. We also uncover previously unreported liveness vulnerabilities in both FairDAG-RL and DoD that a malicious client can trigger to halt the fairness layer indefinitely, and propose patches that we integrate into our reimplementations. We implement Herring on top of the Rust implementation of Narwhal \& Tusk and evaluate it against FairDAG-RL, DoD-W, and Themis. Herring tracks the throughput of Narwhal \& Tusk closely up to roughly $10{,}000$\,tx/s, achieves roughly $90\%$ higher saturation throughput than FairDAG-RL and $100\%$ higher than DoD-W, and substantially reduces execution latency at saturation.

2026-05-22T14:00:33Z Marko Putnik Jérémie Decouchant http://arxiv.org/abs/2605.23566v1 Multi-Factor Trust-Driven Secure Communication Model for Cloud-Based Digital Twins 2026-05-22T12:31:26Z

Cloud-based Digital Twin (DT) platforms enable real-time monitoring, simulation, and collaborative decision-making across distributed clients. However, ensuring secure and trustworthy communication remains a critical challenge due to heterogeneous client behavior, resource contention, and evolving adversarial threats. This paper proposes the Multi-Factor Trust-Driven Secure Communication (MT-SeCom) framework to enforce resilient and intelligent collaboration in DT-enabled cloud environments. MT-SeCom operates through four coordinated phases: (i) Multi-Factor Trust Monitoring, capturing temporal, contextual, and federated trust signals; (ii) Adaptive Trust Evaluation, adjusting trust weights based on network dynamics and threat intensity; (iii) Transformer-Based Trusted Client Classification, combining anomaly detection with supervised learning to accurately identify malicious or unreliable nodes; and (iv) Resilient Communication Management, optimizing routing, isolating compromised clients, and ensuring service continuity. A real-world testbed and comprehensive experiments demonstrate that MT-SeCom significantly enhances secure communication, mitigates cascading adversarial effects, and maintains high resilience under fluctuating attack conditions. MT-SeCom achieves an average 18.7% improvement in threat detection accuracy and a 24.3% reduction in anomaly occurrences compared to existing methods, confirming its robustness, scalability, and practical suitability for heterogeneous cloud-based DT ecosystems.

2026-05-22T12:31:26Z 10 pages, 5 figures IEEE Transactions on Industrial Informatics, published in 2026 Deepika Saxena Ashutosh Kumar Singh 10.1109/TII.2026.3669993 http://arxiv.org/abs/2602.20887v3 A Morton-Type Space-Filling Curve for Pyramid Subdivision and Hybrid Adaptive Mesh Refinement 2026-05-22T11:47:03Z

The forest-of-refinement-trees approach allows for dynamic adaptive mesh refinement (AMR) at negligible cost. While originally developed for quadrilateral and hexahedral elements, previous work established the theory and algorithms for unstructured meshes of simplicial and prismatic elements. To harness the full potential of tree-based AMR for three-dimensional mixed-element meshes, this paper introduces the pyramid as a new functional element type; its primary purpose is to connect tetrahedral and hexahedral elements without hanging edges. We present a well-defined space-filling curve (SFC) for the pyramid and detail how the unique challenges on the element and forest level associated with the pyramidal refinement are resolved. We propose the necessary functional design and generalize the fundamental global parallel algorithms for refinement, coarsening, partitioning, and face ghost exchange to fully support this new element. Our demonstrations confirm the efficiency and scalability of this complete, hybrid-element dynamic AMR framework.

2026-02-24T13:23:53Z David Knapp Johannes Albrecht Holke Thomas Spenke Carsten Burstedde Lukas Dreyer http://arxiv.org/abs/2605.04842v2 Communication Offloading on SmartNIC DPUs: A Quantitative Approach 2026-05-22T11:19:18Z

SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.

2026-05-06T12:41:56Z To appear in Euro-Par 2026 Jacob Wahlgren Andong Hu Roger Pearce Maya Gokhale Ivy Peng http://arxiv.org/abs/2605.10860v2 Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors 2026-05-22T10:00:35Z

The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.

2026-05-11T17:08:29Z To be published in the 32nd European Conference on Parallel and Distributed Processing(Euro-Par 2026) Ruimin Shi Maya Gokhale Pei-Hung Lin Xavier Teruel Ivy Peng http://arxiv.org/abs/2605.23432v1 Multi-Round Visibility: A Post-Consensus Ordering Layer for DAG-Based BFT 2026-05-22T09:44:34Z

Directed acyclic graph (DAG)-based Byzantine Fault-Tolerant (BFT) protocols achieve high throughput by decoupling dissemination from agreement and allowing many vertices to be committed concurrently. This same concurrency, however, weakens ordering evidence at the execution boundary: once units are committed in a shared DAG frontier, their final linearization is driven by traversal or deterministic tie-breaking rather than verifiable structural precedence. Prior fair-ordering designs address ambiguity by collecting or reconstructing transaction-level ordering evidence within the consensus workflow. While effective, this couples ordering with agreement and places ordering logic on the critical path. This paper presents Multi-Round Visibility (MRV), a post-consensus structural ordering layer for DAG-based BFT that reinterprets the committed DAG as an ordering evidence substrate. Committed vertices inherently carry authenticated creator, round, and ancestry metadata, enabling replicas to derive multi-round structural visibility without extra consensus-path messages. MRV accumulates this visibility within a bounded evidence horizon, compares concurrently committed atomic units of fairness (AUFs) after they coexist in the DAG, and derives precedence constraints from Byzantine-robust visibility advantages. When the DAG lacks such constraints, MRV exposes and resolves the remaining ambiguity through deterministic graph completion rather than hiding it inside traversal rules. We implement MRV on a Narwhal/Tusk-based prototype. Evaluation across 5-50 replicas under various fault settings shows MRV preserves the high-throughput regime of the DAG-BFT stack, proving it provides post-consensus structural ordering without burdening the consensus-critical path.

2026-05-22T09:44:34Z Pengkun Ren Dong Hai Nasrin Sohrabi Zahir Tari