https://arxiv.org/api/0e+spXwfCR/FPXZfkAE/r0rYTfM2026-03-24T11:11:12Z50734515http://arxiv.org/abs/2603.04027v1Performance Optimization in Stream Processing Systems: Experiment-Driven Configuration Tuning for Kafka Streams2026-03-04T13:04:03ZConfiguring stream processing systems for efficient performance, especially in cloud-native deployments, is a challenging and largely manual task. We present an experiment-driven approach for automated configuration optimization that combines three phases: Latin Hypercube Sampling for initial exploration, Simulated Annealing for guided stochastic search, and Hill Climbing for local refinement. The workflow is integrated with the cloud-native Theodolite benchmarking framework, enabling automated experiment orchestration on Kubernetes and early termination of underperforming configurations. In an experimental evaluation with Kafka Streams and a Kubernetes-based cloud testbed, our approach identifies configurations that improve throughput by up to 23% over the default. The results indicate that Latin Hypercube Sampling with early termination and Simulated Annealing are particularly effective in navigating the configuration space, whereas additional fine-tuning via Hill Climbing yields limited benefits.2026-03-04T13:04:03ZAccepted for the 9th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2026) at ACM/SPEC ICPE 2026David ChenSören HenningKassiano MatteussiRick Rabiser10.1145/3777911.3800636http://arxiv.org/abs/2603.03932v1Selecting Offline Reinforcement Learning Algorithms for Stochastic Network Control2026-03-04T10:41:10ZOffline Reinforcement Learning (RL) is a promising approach for next-generation wireless networks, where online exploration is unsafe and large amounts of operational data can be reused across the model lifecycle. However, the behavior of offline RL algorithms under genuinely stochastic dynamics -- inherent to wireless systems due to fading, noise, and traffic mobility -- remains insufficiently understood. We address this gap by evaluating Bellman-based (Conservative Q-Learning), sequence-based (Decision Transformers), and hybrid (Critic-Guided Decision Transformers) offline RL methods in an open-access stochastic telecom environment (mobile-env). Our results show that Conservative Q-Learning consistently produces more robust policies across different sources of stochasticity, making it a reliable default choice in lifecycle-driven AI management frameworks. Sequence-based methods remain competitive and can outperform Bellman-based approaches when sufficient high-return trajectories are available. These findings provide practical guidance for offline RL algorithm selection in AI-driven network control pipelines, such as O-RAN and future 6G functions, where robustness and data availability are key operational constraints.2026-03-04T10:41:10ZLong version 12 pages, double column including Appendix. Short version accepted at NOMS2026-IPSN, Rome, ItalyNicolas HelsonPegah AlizadehAnastasios Giovanidishttp://arxiv.org/abs/2603.08745v1ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization2026-03-04T03:34:12ZCompute-in-Memory (CIM) architectures have been widely studied for deep neural network (DNN) acceleration by reducing data transfer overhead between the memory and computing units. In conventional CIM design flows, system-level CIM simulators (such as NeuroSim) are leveraged for design space exploration (DSE) across different hardware configurations and DNN workloads. However, CIM designers need to invest substantial effort in interpreting simulator manuals and understanding complex parameter dependencies. Moreover, extensive design-simulation iterations are often required to identify optimal CIM configurations under hardware constraints. These challenges severely prolong the DSE cycle and hinder rapid CIM deployment. To address these challenges, this work proposes ChatNeuroSim, a large language model (LLM)-based agent framework for automated CIM accelerator deployment and optimization. ChatNeuroSim automates the entire CIM workflow, including task scheduling, request parsing and adjustment, parameter dependency checking, script generation, and simulation execution. It also integrates the proposed CIM optimizer using design space pruning, enabling rapid identification of optimal configurations for different DNN workloads. ChatNeuroSim is evaluated on extensive request-level testbenches and demonstrates correct simulation and optimization behavior, validating its effectiveness in automatic request parsing and task execution. Furthermore, the proposed design space pruning technique accelerates CIM optimization process compared to no-pruning baseline. In the case study optimizing Swin Transformer Tiny under 22 nm technology, the proposed CIM optimizer achieves a 0.42$\times$-0.79$\times$ average runtime reduction compared to the same optimization algorithm without design space pruning.2026-03-04T03:34:12Z30 pages, 16 figuresMing-Yen LeeShimeng Yuhttp://arxiv.org/abs/2603.02510v1ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution2026-03-03T01:41:07ZThe transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling.
We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a "Critic-Refine" pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the "last mile" of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers.
On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO.2026-03-03T01:41:07ZLiu YangZeyu NieAndrew LiuFelix ZouDeniz AltinbükenAmir YazdanbakhshQuanquan C. Liuhttp://arxiv.org/abs/2603.03376v1Comparison of Credential Management Systems Based on the Standards of IEEE, ETSI, and YD/T 3957-20212026-03-03T00:19:32ZAs V2X (Vehicle-to-Everything) technology becomes increasingly prevalent, the security of V2X networks has garnered growing attention worldwide. In North America, the IEEE 1609 series standards are primarily used, while Europe adopts the ETSI series standards, and China has also established its industry standard, YD/T 3957-2021, among others. Although these standards share some commonalities, they also exhibit differences. To achieve compatibility across these standards, analyzing their similarities and differences is a crucial issue. Therefore, this study focuses on analyzing the three major standards mentioned above, discussing aspects such as certificate formats, signed message formats, and certificate request processes. Additionally, this research evaluates the efficiency of different cryptographic methods, including NIST P-256 and SM2-256, SHA-256 and SM3-256, as well as AES-128 and SM4-128. Finally, the study implements these three major standards on V2X devices and compares the efficiency of message signing and signature verification in V2X systems, providing a reference for the development of a secure certificate management system for V2X networks.2026-03-03T00:19:32ZAbel C. H. Chenhttp://arxiv.org/abs/2603.02621v1GoldbachGPU: An Open Source GPU-Accelerated Framework for Verification of Goldbach's Conjecture2026-03-02T15:51:57ZWe present GoldbachGPU, an open-source framework for large-scale computational verification of Goldbach's conjecture using commodity GPU hardware. Prior GPU-based approaches reported a hard memory ceiling near 10^11 due to monolithic prime-table allocation. We show that this limitation is architectural rather than fundamental: a dense bit-packed prime representation provides a 16x reduction in memory footprint, and a segmented double-sieve design removes the VRAM ceiling entirely. By inverting the verification loop and combining a GPU fast-path with a multi-phase primality oracle, the framework achieves exhaustive verification up to 10^12 on a single NVIDIA RTX 3070 (8 GB VRAM), with no counterexamples found. Each segment requires 14 MB of VRAM, yielding O(N) wall-clock time and O(1) memory in N. A rigorous CPU fallback guarantees mathematical completeness, though it was never invoked in practice. An arbitrary-precision checker using GMP and OpenMP extends single-number verification to 10^10000 via a synchronised batch-search strategy. The segmented architecture also exhibits clean multi-GPU scaling on data-centre hardware (tested on 8 x H100). All code is open-source, documented, and reproducible on both commodity and high-end hardware.2026-03-02T15:51:57Z11 pages, 7 tables, 2 figures. Accompanies the v1.1.0 release of GoldbachGPU (Zenodo DOI: https://zenodo.org/records/18837081)Isaac Llorente-Saguerhttp://arxiv.org/abs/2603.01915v1Fast Entropy Decoding for Sparse MVM on GPUs2026-03-02T14:28:48ZWe present a novel, practical approach to speed up sparse matrix-vector multiplication (SpMVM) on GPUs. The novel key idea is to apply lossless entropy coding to further compress the sparse matrix when stored in one of the commonly supported formats. Our method is based on dtANS, our new lossless compression method that improves the entropy coding technique of asymmetric numeral systems (ANS) specifically for fast parallel GPU decoding when used in tandem with SpMVM. We apply dtANS on the widely used CSR format and present extensive benchmarks on the SuiteSparse collection of matrices against the state-of-the-art cuSPARSE library. On matrices with at least 2^(15) entries and at least 10 entries per row on average, our compression reduces the matrix size over the smallest cuSPARSE format (CSR, COO and SELL) in almost all cases and up to 11.77 times. Further, we achieve an SpMVM speedup for the majority of matrices with at least 2^(25) nonzero entries. The best speedup is 3.48x. We also show that we can improve over the AI-based multi-format AlphaSparse in an experiment that is limited due to its extreme computation overhead. We provide our code as an open source C++/CUDA header library, which includes both compression and multiplication kernels.2026-03-02T14:28:48ZTo appear in 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2026. Reproducibility Appendix available at https://doi.org/10.5281/zenodo.18694064Emil SchätzleTommaso PegolottiMarkus Püschelhttp://arxiv.org/abs/2601.10729v2OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration2026-03-02T08:37:18ZServing long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce OrbitFlow, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. OrbitFlow employs a lightweight ILP solver to decide which layers' KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active plan becomes suboptimal during token generation. Under heavy load, OrbitFlow invokes a fallback mechanism to temporarily defer in-flight requests with large memory footprints, preserving overall SLO attainment. Our experiments demonstrate that OrbitFlow improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively, while reducing the 95th percentile latency by 38% and achieving up to 3.3x higher throughput compared to existing offloading methods.2026-01-05T04:02:34ZAccepted at the 52nd International Conference on Very Large Data Bases (VLDB 2026). Xinyue Ma and Heelim Hong contributed equally (co-first authors)Xinyue MaHeelim HongTaegeon UmJongseop LeeSeoyeong ChoyWoo-Yeon LeeMyeongjae Jeonhttp://arxiv.org/abs/2603.02271v1Characterizing VLA Models: Identifying the Action Generation Bottleneck for Edge AI Architectures2026-03-01T01:09:55ZVision-Language-Action (VLA) models are an emerging class of workloads critical for robotics and embodied AI at the edge. As these models scale, they demonstrate significant capability gains, yet they must be deployed locally to meet the strict latency requirements of real-time applications. This paper characterizes VLA performance on two generations of edge hardware, viz. the Nvidia Jetson Orin and Thor platforms. Using MolmoAct-7B, a state-of-the-art VLA model, we identify a primary execution bottleneck: up to 75% of end-to-end latency is consumed by the memory-bound action-generation phase. Through analytical modeling and simulations, we project the hardware requirements for scaling to 100B parameter models. We also explore the impact of high-bandwidth memory technologies and processing-in-memory (PIM) as promising future pathways in edge systems for embodied AI.2026-03-01T01:09:55Z3 Pages 4 Figures for Workshop paperManoj VishwanathanSuvinay SubramanianAnand Raghunathanhttp://arxiv.org/abs/2603.00551v1GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning2026-02-28T09:03:18ZGPU architectural simulation is orders of magnitude slower than native execution, necessitating workload sampling for practical speedups. Existing methods rely on hand-crafted features with limited expressiveness, yielding either aggressive sampling with high errors or conservative sampling with constrained speedups. To address these issues, we propose GCL-Sampler, a sampling framework that leverages Relational Graph Convolutional Networks with contrastive learning to automatically discover high-dimensional kernel similarities from trace graphs. By encoding instruction sequences and data dependencies into graph embeddings, GCL-Sampler captures rich structural and semantic properties of program execution, enabling both high fidelity and substantial speedup. Evaluations on extensive benchmarks show that GCL-Sampler achieves 258.94x average speedup against full workload with 0.37% error, outperforming state-of-the-art methods, PKA (129.23x, 20.90%), Sieve (94.90x, 4.10%) and STEM+ROOT (56.57x, 0.38%).2026-02-28T09:03:18ZJiaqi WangJingwei SunJiyu LuoHan LiGuangzhong Sunhttp://arxiv.org/abs/2603.00549v1PM2Lat: Highly Accurate and Generalized Prediction of DNN Execution Latency on GPUs2026-02-28T08:56:09ZWe present PM2Lat, a fast and generalized framework for accurately predicting the latency of deep neural network models on GPUs, with special focus on NVIDIA. Unlike prior methods that rely on deep learning models or handcrafted heuristics, PM2Lat leverages the Single-Instruction-Multiple-Thread architecture of GPUs to model execution time of DNN models. First, we dive into fine-grained GPU operation modeling by studying computational behavior and memory access patterns. After identifying these characteristics, we found that different GPU kernels exhibit significant performance disparities, even when serving the same purpose. Hence, the core idea of PM2Lat is to differentiate kernels based on their configurations and analyze them accordingly. This kernel-aware modeling enables PM2Lat to achieve consistently low prediction error across diverse data types and hardware platforms. In addition, PM2Lat generalizes beyond standard matrix multiplication to support complex GPU kernels such as Triton, Flash Attention, and Cutlass Attention. Experimental results show that PM2Lat consistently achieves error rates below 10% across different data types and hardware platforms on Transformer models, outperforming the state-of-the-art NeuSight by 10-20% for FP32 and by at least 50% for BF16. When applying to diverse kernels, the error rate is maintained at 3-8%.2026-02-28T08:56:09ZTruong-Thanh LeHoang-Loc LaAmir TaherkordiFrank EliassenPhuong Hoai Ha andPeiyuan Guanhttp://arxiv.org/abs/2603.00326v1Vectorized Adaptive Histograms for Sparse Oblique Forests2026-02-27T21:36:44ZClassification using sparse oblique random forests provides guarantees on uncertainty and confidence while controlling for specific error types. However, they use more data and more compute than other tree ensembles because they create deep trees and need to sort or histogram linear combinations of data at runtime. We provide a method for dynamically switching between histograms and sorting to find the best split. We further optimize histogram construction using vector intrinsics. Evaluating this on large datasets, our optimizations speedup training by 1.7-2.5x compared to existing oblique forests and 1.5-2x compared to standard random forests. We also provide a GPU and hybrid CPU-GPU implementation.2026-02-27T21:36:44ZAriel LubonjaJungsang YoonHaoyin XuYue WanYilin XuRichard StotzMathieu Guillame-BertJoshua T. VogelsteinRandal Burnshttp://arxiv.org/abs/2601.17551v2GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference2026-02-27T12:53:15ZLarge language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements.
This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline.
We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available here: \href{https://github.com/TZData1/llm-inference-router}{GitHub}2026-01-24T18:42:16ZPaper under submisisonThomas ZillerShashikant IlagerAlessandro TundoEzio BartocciLeonardo MarianiIvona Brandichttp://arxiv.org/abs/2602.23935v1Green or Fast? Learning to Balance Cold Starts and Idle Carbon in Serverless Computing2026-02-27T11:35:15ZServerless computing simplifies cloud deployment but introduces new challenges in managing service latency and carbon emissions. Reducing cold-start latency requires retaining warm function instances, while minimizing carbon emissions favors reclaiming idle resources. This balance is further complicated by time-varying grid carbon intensity and varying workload patterns, under which static keep-alive policies are inefficient. We present LACE-RL, a latency-aware and carbon-efficient management framework that formulates serverless pod retention as a sequential decision problem. LACE-RL uses deep reinforcement learning to dynamically tune keep-alive durations, jointly modeling cold-start probability, function-specific latency costs, and real-time carbon intensity. Using the Huawei Public Cloud Trace, we show that LACE-RL reduces cold starts by 51.69% and idle keep-alive carbon emissions by 77.08% compared to Huawei's static policy, while achieving better latency-carbon trade-offs than state-of-the-art heuristic and single-objective baselines, approaching Oracle performance.2026-02-27T11:35:15ZBowen SunChristos D. AntonopoulosEvgenia SmirniBin RenNikolaos BellasSpyros Lalishttp://arxiv.org/abs/2602.23598v1QoSFlow: Ensuring Service Quality of Distributed Workflows Using Interpretable Sensitivity Models2026-02-27T01:59:05ZWith the increasing importance of distributed scientific workflows, there is a critical need to ensure Quality of Service (QoS) constraints, such as minimizing time or limiting execution to resource subsets. However, the unpredictable nature of workflow behavior, even with similar configurations, makes it difficult to provide QoS guarantees. For effective reasoning about QoS scheduling, we introduce QoSFlow, a performance modeling method that partitions a workflow's execution configuration space into regions with similar behavior. Each region groups configurations with comparable execution times according to a given statistical sensitivity, enabling efficient QoS-driven scheduling through analytical reasoning rather than exhaustive testing. Evaluation on three diverse workflows shows that QoSFlow's execution recommendations outperform the best-performing standard heuristic by 27.38%. Empirical validation confirms that QoSFlow's recommended configurations consistently match measured execution outcomes across different QoS constraints.2026-02-27T01:59:05Zto be published in 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2026Md Hasanur RashidJesun FirozNathan R. TallentLuanzheng GuoMeng TangDong Dai