https://arxiv.org/api/RuVs8oIw8PDY/YBEwx68btyJEYg2026-06-10T21:04:25Z2883834515http://arxiv.org/abs/2512.04320v2VLCs: Managing Parallelism with Virtualized Libraries2026-05-22T19:23:20ZAs the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible.
We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process.
In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch. Source code of VLCs is available at https://github.com/pecos/Virtual-Library-Context.2025-12-03T23:11:02ZIn Proceedings of the 2025 ACM Symposium on Cloud Computing (SoCC '25)Proceedings of the 2025 ACM Symposium on Cloud Computing (2025) 629-643Yineng YanWilliam RuysHochan LeeIan HenriksenArthur PetersSean StephensBozhi YouHenrique FinglerMartin BurtscherMilos GligoricKeshav PingaliMattan ErezGeorge BirosChristopher J. Rossbach10.1145/3772052.3772265http://arxiv.org/abs/2605.24096v1The Time is Here for Just-in-Time Systems: Challenges and Opportunities2026-05-22T18:03:41ZCore systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.2026-05-22T18:03:41ZpreprintShu LiuAlexander KrentselShubham AgarwalMert CemriZiming MaoSoujanya PonnapalliAlexandros G. DimakisSylvia RatnasamyMatei ZahariaAditya ParameswaranIon Stoicahttp://arxiv.org/abs/2605.23850v1Enhancing Energy Efficiency in Scientific Workflows through CFD based PIVAEs2026-05-22T17:04:49ZThe growing complexity and scale of scientific workflows in high performance computing (HPC) environments have led to significant challenges in managing energy consumption without compromising computational performance. Traditional scheduling strategies often fail to account for the complex interplay between thermal dynamics, workload diversity, and system scalability, leading to inefficient and unsustainable energy usage. This paper introduces a novel, scalable, and AI-assisted scheduling framework for optimizing energy consumption in HPC environments without compromising performance. Central to our approach is the integration of Computational Fluid Dynamics (CFD) with a Physics-Informed Variational Autoencoder (PIVAE), enabling the generation of physically realistic synthetic workload data that bridges the gap between thermodynamic behavior and scheduler decision-making in complex, multi-scale HPC environments. By categorizing workflows based on resource utilization profiles, we evaluate multiple scheduling strategies such as Locality Aware and Speculative Aware Scheduling. These workflows, ranging from event reconstruction to anomaly detection, represent diverse computational intensities. Our results show that modest reductions in CPU performance (e.g., to 15%) can yield substantial energy savings (up to 10%) with only minor turnaround time increases (approximately 5-6%), identifying an optimal operational sweet spot. This work demonstrates how physics-informed generative modeling can enable adaptive, sustainable, and data-efficient scheduling for next-generation HPC infrastructures.2026-05-22T17:04:49ZAli ZahirAshiq AnjumMark WilkinsonJeyan Thiyagalingamhttp://arxiv.org/abs/2605.23816v1SDNator is Not Another SDN Controller: Enabling Extensible Data-Driven Control in Cyber-Physical Systems2026-05-22T16:16:59ZAn SDN-like centralized control architecture is increasingly popular and has been widely explored in cyber-physical systems (CPS) such as manufacturing, internet-of-things, and autonomous vehicle systems for higher flexibility, programmability and scalability. However, no existing frameworks can offer domain-agnostic, easily extensible support for data-driven CPS applications. In this work, we design, implement, and open-source \textit{SDNator}, the first framework to enable extensible, data-driven control in CPS. SDNator embraces an application- and data-driven design where applications function as data consumers and producers to collectively define the workflows of the controller. SDNator also incorporates two data store backends to support both event-driven and data-driven programming patterns. Benchmarks show that SDNator is highly scalable, and delivers comparable performance to Ryu, a widely used SDN controller.
Moreover, we demonstrate the capabilities and usability of SDNator through our case studies of manufacturing and networking systems. By integrating applications from respective domains, we build different ``controllers'' for different scenarios. Most notably, we leverage SDNator to implement the first digital-twin-equipped central controller for additive manufacturing fleets. We show through extensive and realistic simulations that SDNator-based scheduling can (1) significantly shorten production time and improve reliability in the presence of anomalies compared to decentralized approaches, and (2) flexibly adjust and optimize production plans upon urgent requests such as producing Personal Protective Equipment during the COVID-19 pandemic.2026-05-22T16:16:59ZY. LinR. ZhangE. BaltaX. ZhuJ. ZhangK. BartonD. TilburyZ. Maohttp://arxiv.org/abs/2605.23815v1A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification2026-05-22T16:16:52ZLearned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage by approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.2026-05-22T16:16:52ZShubham VashisthOlivier MichaudBettina KemmeOana Balmauhttp://arxiv.org/abs/2511.15503v2DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures2026-05-22T15:57:54ZHigh-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.2025-11-19T14:58:16ZPeiming YangSankeerth DurvasulaIvan FernandezMohammad SadrosadatiOnur MutluGennady PekhimenkoChristina Giannoulahttp://arxiv.org/abs/2602.14289v2Parallel Sparse and Data-Sparse Factorization-based Linear Solvers2026-05-22T15:57:25ZEfficient solutions of large-scale, ill-conditioned and indefinite algebraic equations are ubiquitously needed in numerous computational fields, including multiphysics simulations, machine learning, and data science. Because of their robustness and accuracy, direct solvers are crucial components in building a scalable solver toolchain. In this chapter, we will review recent advances of sparse direct solvers along two axes: 1) reducing communication and latency costs in both task- and data-parallel settings, and 2) reducing computational complexity via low-rank and other compression techniques such as hierarchical matrix algebra. In addition to algorithmic principles, we also illustrate the key parallelization challenges and best practices to deliver high speed and reliability on modern heterogeneous parallel machines.2026-02-15T19:40:14ZXiaoye Sherry LiYang Liuhttp://arxiv.org/abs/2605.23707v1Flare: Leveraging Serverless Elasticity to Absorb Microservice Load Spikes2026-05-22T14:54:40ZOnline services strive to maintain application responsiveness even when the traffic is unpredictable and fluctuating. Today's online services are commonly deployed as chains of microservices, each microservice packaged as one or more containers inside virtual machines (VMs). While performant and affordable when the load is steady, VM-based deployments are known to be slow to scale when the load spikes, resulting in degraded performance for end-users of the service. To avoid such performance degradations, service providers can over-provision their deployments; however, such a strategy is costly and inefficient, leaving resources under-utilized for extended periods.
To address the challenge of unpredictable load spikes, we propose Flare, a hybrid microservice architecture that combines VMs with serverless computing. Flare utilizes VMs to cost-effectively handle steady workloads and leverages serverless elasticity to absorb traffic spikes. When a spike occurs, Flare detects which specific service(s) are overloaded and shifts the excess load of only those services to serverless, thus minimizing the cost overhead. Flare seamlessly integrates into existing auto-scaling and serverless infrastructure, requiring minimal changes to the control plane and no modifications to the application.2026-05-22T14:54:40ZDilina DehigamaShyam JesalpuraDavid SchallAntonios KatsarakisMarios KogiasRakesh KumarBoris Grothttp://arxiv.org/abs/2605.23677v1AMP: Arc Multi-Proposer Protocol with Bounded Inclusion Guarantees2026-05-22T14:26:10ZBlockchain systems that settle financial transactions face a structural tension: the single validator that assembles each block holds unilateral power over transaction inclusion and ordering. Traditional markets curb this very power through front-running and market-manipulation laws. Regulators have flagged the absence of such rules as a first-order concern for blockchain-based financial infrastructure. In response, we introduce AMP, a multi-proposer protocol, on top of the Tendermint consensus algorithm, where no validator can control the flow of transactions into blocks. Instead, dedicated nodes called proposers sit between users and validators. They collect user transactions, group them into payloads, and broadcast the payloads to all validators. Consequently, there is no mempool, and AMP applies the design principle of separating dissemination from agreement, which can lead to higher throughput. Validators publicly attest to receiving payloads and run consensus to decide the set of payloads to include in the next block. When all correct validators attest to a given payload, AMP guarantees that payload will be included in the next block; a block thus contains payloads from multiple proposers, allowing for bulk finalization. This bounded inclusion guarantee along with a deterministic ordering algorithm which is run over all payloads included in a block, curbs the power of any single validator. Validators no longer control what is included in a block, nor can they arbitrarily order the contents of blocks.2026-05-22T14:26:10ZDaniel CasonGordon LiaoSergio MenaNenad MiloševićAdi SeredinschiAlessandro SforzinJoão SousaPreston Vander Voshttp://arxiv.org/abs/2605.23648v1Herring: Parallel Batch-Order-Fairness on DAG-based Blockchain Consensus2026-05-22T14:00:33ZTransaction ordering attacks extract billions of dollars annually from decentralized finance users in the form of Maximal Extractable Value (MEV). Byzantine Fault-Tolerant (BFT) consensus protocols guarantee total order but place no constraint on how that order is chosen, leaving the door open for adversarial reordering. Batch-order-fairness (batch-OF) protocols close this gap, but existing designs pay a steep performance price for this guarantee. Leader-based protocols such as Themis concentrate all fairness decisions at a single replica, while recent DAG-based proposals FairDAG and DAG of DAGs (DoD) force their fairness layer into strictly serial execution despite running on multi-proposer DAGs.
We present Herring, the first $γ$-batch-OF DAG BFT protocol whose fairness layer parallelizes the dominant graph construction cost across committed subdags. Herring combines post-consensus graph construction with explicit missing edge resolution piggybacked on the DAG's reliable broadcast layer, a pairing that turns fair ordering from a per-round serial bottleneck into a CPU-bound task. We also uncover previously unreported liveness vulnerabilities in both FairDAG-RL and DoD that a malicious client can trigger to halt the fairness layer indefinitely, and propose patches that we integrate into our reimplementations.
We implement Herring on top of the Rust implementation of Narwhal \& Tusk and evaluate it against FairDAG-RL, DoD-W, and Themis. Herring tracks the throughput of Narwhal \& Tusk closely up to roughly $10{,}000$\,tx/s, achieves roughly $90\%$ higher saturation throughput than FairDAG-RL and $100\%$ higher than DoD-W, and substantially reduces execution latency at saturation.2026-05-22T14:00:33ZMarko PutnikJérémie Decouchanthttp://arxiv.org/abs/2605.23566v1Multi-Factor Trust-Driven Secure Communication Model for Cloud-Based Digital Twins2026-05-22T12:31:26ZCloud-based Digital Twin (DT) platforms enable real-time monitoring, simulation, and collaborative decision-making across distributed clients. However, ensuring secure and trustworthy communication remains a critical challenge due to heterogeneous client behavior, resource contention, and evolving adversarial threats. This paper proposes the Multi-Factor Trust-Driven Secure Communication (MT-SeCom) framework to enforce resilient and intelligent collaboration in DT-enabled cloud environments. MT-SeCom operates through four coordinated phases: (i) Multi-Factor Trust Monitoring, capturing temporal, contextual, and federated trust signals; (ii) Adaptive Trust Evaluation, adjusting trust weights based on network dynamics and threat intensity; (iii) Transformer-Based Trusted Client Classification, combining anomaly detection with supervised learning to accurately identify malicious or unreliable nodes; and (iv) Resilient Communication Management, optimizing routing, isolating compromised clients, and ensuring service continuity. A real-world testbed and comprehensive experiments demonstrate that MT-SeCom significantly enhances secure communication, mitigates cascading adversarial effects, and maintains high resilience under fluctuating attack conditions. MT-SeCom achieves an average 18.7% improvement in threat detection accuracy and a 24.3% reduction in anomaly occurrences compared to existing methods, confirming its robustness, scalability, and practical suitability for heterogeneous cloud-based DT ecosystems.2026-05-22T12:31:26Z10 pages, 5 figuresIEEE Transactions on Industrial Informatics, published in 2026Deepika SaxenaAshutosh Kumar Singh10.1109/TII.2026.3669993http://arxiv.org/abs/2602.20887v3A Morton-Type Space-Filling Curve for Pyramid Subdivision and Hybrid Adaptive Mesh Refinement2026-05-22T11:47:03ZThe forest-of-refinement-trees approach allows for dynamic adaptive mesh refinement (AMR) at negligible cost. While originally developed for quadrilateral and hexahedral elements, previous work established the theory and algorithms for unstructured meshes of simplicial and prismatic elements. To harness the full potential of tree-based AMR for three-dimensional mixed-element meshes, this paper introduces the pyramid as a new functional element type; its primary purpose is to connect tetrahedral and hexahedral elements without hanging edges. We present a well-defined space-filling curve (SFC) for the pyramid and detail how the unique challenges on the element and forest level associated with the pyramidal refinement are resolved. We propose the necessary functional design and generalize the fundamental global parallel algorithms for refinement, coarsening, partitioning, and face ghost exchange to fully support this new element. Our demonstrations confirm the efficiency and scalability of this complete, hybrid-element dynamic AMR framework.2026-02-24T13:23:53ZDavid KnappJohannes Albrecht HolkeThomas SpenkeCarsten BursteddeLukas Dreyerhttp://arxiv.org/abs/2605.04842v2Communication Offloading on SmartNIC DPUs: A Quantitative Approach2026-05-22T11:19:18ZSmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.2026-05-06T12:41:56ZTo appear in Euro-Par 2026Jacob WahlgrenAndong HuRoger PearceMaya GokhaleIvy Penghttp://arxiv.org/abs/2605.10860v2Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors2026-05-22T10:00:35ZThe RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.2026-05-11T17:08:29ZTo be published in the 32nd European Conference on Parallel and Distributed Processing(Euro-Par 2026)Ruimin ShiMaya GokhalePei-Hung LinXavier TeruelIvy Penghttp://arxiv.org/abs/2605.23432v1Multi-Round Visibility: A Post-Consensus Ordering Layer for DAG-Based BFT2026-05-22T09:44:34ZDirected acyclic graph (DAG)-based Byzantine Fault-Tolerant (BFT) protocols achieve high throughput by decoupling dissemination from agreement and allowing many vertices to be committed concurrently. This same concurrency, however, weakens ordering evidence at the execution boundary: once units are committed in a shared DAG frontier, their final linearization is driven by traversal or deterministic tie-breaking rather than verifiable structural precedence. Prior fair-ordering designs address ambiguity by collecting or reconstructing transaction-level ordering evidence within the consensus workflow. While effective, this couples ordering with agreement and places ordering logic on the critical path. This paper presents Multi-Round Visibility (MRV), a post-consensus structural ordering layer for DAG-based BFT that reinterprets the committed DAG as an ordering evidence substrate. Committed vertices inherently carry authenticated creator, round, and ancestry metadata, enabling replicas to derive multi-round structural visibility without extra consensus-path messages. MRV accumulates this visibility within a bounded evidence horizon, compares concurrently committed atomic units of fairness (AUFs) after they coexist in the DAG, and derives precedence constraints from Byzantine-robust visibility advantages. When the DAG lacks such constraints, MRV exposes and resolves the remaining ambiguity through deterministic graph completion rather than hiding it inside traversal rules. We implement MRV on a Narwhal/Tusk-based prototype. Evaluation across 5-50 replicas under various fault settings shows MRV preserves the high-throughput regime of the DAG-BFT stack, proving it provides post-consensus structural ordering without burdening the consensus-critical path.2026-05-22T09:44:34ZPengkun RenDong HaiNasrin SohrabiZahir Tari