Staging Blocked Evaluation over Structured Sparse Matrices

2026-02-12T15:46:43Z

The matrices used in many computational settings are naturally sparse, holding a small percentage of nonzero elements. Storing such matrices in specialized sparse formats enables algorithms that avoid wasting computation on zeros, significantly accelerating common matrix computations like sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). In many real-world sparse matrices, however, nonzero elements are densely clustered in subregions of the matrix. For matrices that feature this sort of structured sparsity, hybrid formats can further improve performance by representing these subregions as dense blocks. Existing hybrid formats either fix the dimensions of dense blocks, padding irregular regions with zeros and wasting computation, or incur run-time overhead when iterating over variable-sized blocks. This paper presents SABLE, a framework for accelerating structured sparse matrix computations by using staging to achieve the best of both of these approaches. Ahead of execution, SABLE inspects the matrix to identify variable-sized dense subregions, which it stores using a new hybrid format. It then eliminates the overhead typically associated with variable-sized blocks by using staging to generate specialized code that is amenable to vectorization. We evaluate SABLE on SpMV and SpMM kernels using matrices from the popular SuiteSparse data set. SABLE outperforms the best available SpMV baseline by ${\sim}$10\% on average, and SpMM baselines by ${\sim}$20\%. When parallelized, SABLE achieves further speedups of up to ${\sim}7\times$ on SpMV and SpMM over the best fully-sparse baseline when using 8 threads.

Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions

2026-02-12T09:11:08Z

Designing a rate limiter that is simultaneously accurate, available, and scalable presents a fundamental challenge in distributed systems, primarily due to the trade-offs between algorithmic precision, availability, consistency, and partition tolerance. This article presents a concrete architecture for a distributed rate limiting system in a production-grade environment. Our design chooses the in-memory cache database, the Redis, along with its Sorted Set data structure, which provides $O(log (N))$ time complexity operation for the key-value pair dataset with efficiency and low latency, and maintains precision. The core contribution is quantifying the accuracy and memory cost trade-off of the chosen Rolling Window as the implemented rate limiting algorithm against the Token Bucket and Fixed Window algorithms. In addition, we explain how server-side Lua scripting is critical to bundling cleanup, counting, and insertion into a single atomic operation, thereby eliminating race conditions in concurrent environments. In the system architecture, we propose a three-layer architecture that manages the storage and updating of the limit rules. Through script load by hashing the rule parameters, rules can be changed without modifying the cached scripts. Furthermore, we analyze the deployment of this architecture on a Redis Cluster, which provides the availability and scalability by data sharding and replication. We explain the acceptance of AP (Availability and Partition Tolerance) from the CAP theorem as the pragmatic engineering trade-off for this use case.

Resource-Efficient RGB-Only Action Recognition for Edge Deployment

2026-02-11T13:01:56Z

Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

Supercharging Packet-level Network Simulation of Large Model Training via Memoization and Fast-Forwarding

2026-02-11T08:06:52Z

Packet-level discrete-event simulation (PLDES) is a prevalent tool for evaluating detailed performance of large model training. Although PLDES offers high fidelity and generality, its slow performance has plagued networking practitioners. Existing optimization techniques either simplify the network model, resulting in large errors; or execute it in parallel using multiple processors, with an upper bound on speedup. This paper explores an alternative optimization direction that reduces the computational loads of PLDES while maintaining high fidelity. Our key insight is that, in distributed LLM training, packet-level traffic behaviors often exhibit repetitive contention patterns and steady-states where flow rates stabilize, ignoring these redundant discrete events speeds up the simulation considerably and the error is negligible. We realize this idea by proposing Wormhole, a user-transparent PLDES kernel capable of automatically memoization for unsteady-states and skipping for steady-states. Wormhole adopts network partitioning, state memoization and reuse, and rate-based steady-state identification to accurately determine the periods of each flow's steady-state, while maintaining simulation consistency after fast-forwarding. Experiments demonstrate that Wormhole can achieve a 744x speedup over the original ns-3 (510x for MoE workload), with a bounded error of <1%. Applying current multithreading parallel techniques and Wormhole together allows a 1012x speedup, reducing the simulation time for one GPT-13B training under 128 GPUs from 9 hours to 5 minutes.

Mitigating GIL Bottlenecks in Edge AI Systems

2026-02-11T06:49:13Z

Deploying Python-based AI agents on resource-constrained edge devices presents a critical runtime optimization challenge: high thread counts are needed to mask I/O latency, yet Python's Global Interpreter Lock (GIL) serializes execution. We demonstrate that naive thread pool scaling causes a "saturation cliff": a performance degradation of >= 20% at overprovisioned thread counts (N >= 512) on edge representative configurations. We present a lightweight profiling tool and adaptive runtime system that uses a Blocking Ratio metric (beta) to distinguish genuine I/O wait from GIL contention. Our library-based solution achieves 96.5% of optimal performance without manual tuning, outperforming multiprocessing (which is limited by ~8x memory overhead on devices with 512 MB-2 GB RAM) and asyncio (which blocks during CPU bound phases). Evaluation across seven edge AI workload profiles, including real ML inference with ONNX Runtime MobileNetV2, demonstrates 93.9% average efficiency. Comparative experiments with Python 3.13t (free-threading) show that while GIL elimination enables ~4x throughput on multi-core edge devices, the saturation cliff persists on single-core devices due to context switching overhead, validating our beta metric for both GIL and no-GIL environments. This work provides a practical optimization strategy for memory-constrained edge AI systems where traditional solutions fail.

XLB: A High Performance Layer-7 Load Balancer for Microservices using eBPF-based In-kernel Interposition

2026-02-10T07:12:24Z

L7 load balancers are a fundamental building block in microservices as they enable fine-grained traffic distribution. Compared to monolithic applications, microservices demand higher performance and stricter isolation from load balancers. This is due to the increased number of instances, longer service chains, and the necessity for co-location with services on the same host. Traditional sidecar-based load balancers are ill-equipped to meet these demands, often resulting in significant performance degradation. In this work, we present XLB, a novel architecture that reshapes L7 load balancers as in-kernel interposition operating on the socket layer. We leverage eBPF to implement the core load balancing logic in the kernel, and address the connection management and state maintenance challenges through novel socket layer redirection and nested eBPF maps designs. XLB eliminates the extra overhead of scheduling, communication, and data movement, resulting in a more lightweight, scalable, and efficient L7 load balancer architecture. Compared to the widely used microservices load balancers (Istio and Cilium), over 50 microservice instances, XLB achieves up to 1.5x higher throughput and 60% lower end-to-end latency.

Generalizing Scaling Laws for Dense and Sparse Large Language Models

2026-02-09T19:31:28Z

Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.

A Machine Learning accelerated geophysical fluid solver

2026-02-09T13:55:26Z

Machine learning methods have been successful in many areas, like image classification and natural language processing. However, it still needs to be determined how to apply ML to areas with mathematical constraints, like solving PDEs. Among various approaches to applying ML techniques to solving PDEs, the data-driven discretization method presents a promising way of accelerating and improving existing PDE solver on structured grids where it predicts the coefficients of quasi-linear stencils for computing values or derivatives of a function at given positions. It can improve the accuracy and stability of low-resolution simulation compared with using traditional finite difference or finite volume schemes. Meanwhile, it can also benefit from traditional numerical schemes like achieving conservation law by adapting finite volume type formulations. In this thesis, we have implemented the shallow water equation and Euler equation classic solver under a different framework. Experiments show that our classic solver performs much better than the Pyclaw solver. Then we propose four different deep neural networks for the ML-based solver. The results indicate that two of these approaches could output satisfactory solutions.

Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

2026-02-09T06:37:04Z

Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

A quantum-inspired multi-level tensor-train monolithic space-time method for nonlinear PDEs

2026-02-08T12:38:16Z

We propose a multilevel tensor-train (TT) framework for solving nonlinear partial differential equations (PDEs) in a global space-time formulation. While space-time TT solvers have demonstrated significant potential for compressed high-dimensional simulations, the literature contains few systematic comparisons with classical time-stepping methods, limited error convergence analyses, and little quantitative assessment of the impact of TT rounding on numerical accuracy. Likewise, existing studies fail to demonstrate performance across a diverse set of PDEs and parameter ranges. In practice, monolithic Newton iterations may stagnate or fail to converge in strongly nonlinear, stiff, or advection-dominated regimes, where poor initial guesses and severely ill-conditioned space-time Jacobians hinder robust convergence. We overcome this limitation by introducing a coarse-to-fine multilevel strategy fully embedded within the TT format. Each level refines both spatial and temporal resolutions while transferring the TT solution through low-rank prolongation operators, providing robust initializations for successive Newton solves. Residuals, Jacobians, and transfer operators are represented directly in TT and solved with the adaptive-rank DMRG algorithm. Numerical experiments for a selection of nonlinear PDEs including Fisher-KPP, viscous Burgers, sine-Gordon, and KdV cover diffusive, convective, and dispersive dynamics, demonstrating that the multilevel TT approach consistently converges where single-level space-time Newton iterations fail. In dynamic, advection-dominated (nonlinear) scenarios, multilevel TT surpasses single-level TT, achieving high accuracy with significantly reduced computational cost, specifically when high-fidelity numerical simulation is required.

On Resolving Non-Preemptivity in Multitask Scheduling: An Optimal Algorithm in Deterministic and Stochastic Worlds

2026-02-08T01:21:37Z

The efficient scheduling of multi-task jobs across multiprocessor systems has become increasingly critical with the rapid expansion of computational systems. This challenge, known as Multiprocessor Multitask Scheduling (MPMS), is essential for optimizing the performance and scalability of applications in fields such as cloud computing and deep learning. In this paper, we study the MPMS problem under both deterministic and stochastic models, where each job is composed of multiple tasks and can only be completed when all its tasks are finished. We introduce $\mathsf{NP}$-$\mathsf{SRPT}$, a non-preemptive variant of the SRPT algorithm, designed to accommodate scenarios with non-preemptive tasks. Our algorithm achieves a competitive ratio of $\ln α+ β+ 1$ for minimizing response time, where $α$ represents the ratio of the largest to the smallest job workload, and $β$ captures the ratio of the largest non-preemptive task workload to the smallest job workload. We further establish that this competitive ratio is order-optimal when the number of processors is fixed. For the stochastic $\mathsf{M}$/$\mathsf{G}$/$\mathsf{N}$ system, we prove that $\mathsf{NP}$-$\mathsf{SRPT}$ achieves asymptotically optimal mean response time as the traffic intensity approaches $1$, assuming task size distribution with finite support. Moreover, the asymptotic optimality extends to infinite task size distributions under mild probabilistic assumptions, including the standard $\mathsf{M}$/$\mathsf{M}$/$\mathsf{N}$ model. Finally, we extend the analysis to the setting of unknown job sizes, proving that non-preemptive adaptations of the $\mathsf{M\text{-}Gittins}$ and $\mathsf{M\text{-}SERPT}$ policies achieve asymptotic optimality and near-optimality, respectively, for a broad class of job size distributions. Experimental results validate the effectiveness of $\mathsf{NP}$-$\mathsf{SRPT}$.

BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks

2026-02-07T06:32:44Z

The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.

Multiserver-job Response Time under Multilevel Scaling

2026-02-06T17:24:47Z

We study the multiserver-job setting in the load-focused multilevel scaling limit, where system load approaches capacity much faster than the growth of the number of servers $n$. We consider the ``1 and $n$'' system, where each job requires either one server or all $n$. Within the multilevel scaling limit, we examine three regimes: load dominated by $n$-server jobs, 1-server jobs, or balanced. In each regime, we characterize the asymptotic growth rate of the boundary of the stability region and the scaled mean queue length. We demonstrate that mean queue length peaks near balanced load via theory, numerics, and simulation.

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

2026-02-06T06:18:29Z

Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep the experts that dominate runtime traffic resident at higher precision, while maintaining a low-precision fallback for the remaining experts, so the system can reduce transfer volume and avoid the waiting latency that limits offloading and prefetching under dense activation. DynaExq estimates long-horizon expert hotness from router traces, selects a per-layer high-precision resident set via a budget-feasible top-$n$ rule, and applies promotions and demotions asynchronously through stable expert handles so the forward pass always executes on a fully materialized expert version. Across Qwen3-MoE-30B/80B and six benchmarks, DynaExq improves accuracy over static PTQ on Qwen3-80B (73.09% to 77.57%) under comparable device-memory budgets and achieves up to 2.73x higher throughput than offloading/prefetch baselines at batch size 32.

End-to-End Throughput Benchmarking of Portable Deterministic CNN-Based Signal Processing Pipelines

2026-02-05T21:51:42Z

This paper presents a benchmarking methodology for evaluating end-to-end performance of deterministic signal-processing pipelines expressed using CNN-compatible primitives. The benchmark targets phased-array workloads such as ultrasound imaging and evaluates complete RF-to-image pipelines under realistic execution conditions. Performance is reported using sustained input throughput (MB/s), effective frame rate (FPS), and, where available, incremental energy per run and peak memory usage. Using this methodology, we benchmark a single deterministic, training-free CNN-based signal-processing pipeline executed unmodified across heterogeneous accelerator platforms, including an NVIDIA RTX 5090 GPU and a Google TPU v5e-1. The results demonstrate how different operator formulations (dynamic indexing, fully CNN-expressed, and sparse-matrix-based) impact performance and portability across architectures. This work is motivated by the need for portable, certifiable signal-processing implementations that avoid hardware-specific refactoring while retaining high performance on modern AI accelerators.