https://arxiv.org/api/sWqz9yVRFMh13zKzpxeGD2thxKQ 2026-04-07T17:55:15Z 27913 285 15 http://arxiv.org/abs/2404.19725v8 CurvFed: Curvature-Aligned Federated Learning for Fairness without Demographics 2026-03-17T23:12:52Z Modern human sensing applications often rely on data distributed across users and devices, where privacy concerns prevent centralized training. Federated Learning (FL) addresses this challenge by enabling collaborative model training without exposing raw data or attributes. However, achieving fairness in such settings remains difficult, as most human sensing datasets lack demographic labels, and FL's privacy guarantees limit the use of sensitive attributes. This paper introduces CurvFed: Curvature Aligned Federated Learning for Fairness without Demographics, a theoretically grounded framework that promotes fairness in FL without requiring any demographic or sensitive attribute information, a concept termed Fairness without Demographics (FWD), by optimizing the underlying loss landscape curvature. Building on the theory that equivalent loss landscape curvature corresponds to consistent model efficacy across sensitive attribute groups, CurvFed regularizes the top eigenvalue of the Fisher Information Matrix (FIM) as an efficient proxy for loss landscape curvature, both within and across clients. This alignment promotes uniform model behavior across diverse bias inducing factors, offering an attribute agnostic route to algorithmic fairness. CurvFed is especially suitable for real world human sensing FL scenarios involving single or multi user edge devices with unknown or multiple bias factors. We validated CurvFed through theoretical and empirical justifications, as well as comprehensive evaluations using three real world datasets and a deployment on a heterogeneous testbed of resource constrained devices. Additionally, we conduct sensitivity analyses on local training data volume, client sampling, communication overhead, resource costs, and runtime performance to demonstrate its feasibility for practical FL edge device deployment. 2024-04-30T17:19:52Z *equal contribution Harshit Sharma Shaily Roy Asif Salekin http://arxiv.org/abs/2603.17168v1 HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage 2026-03-17T21:59:59Z Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks. 2026-03-17T21:59:59Z 15 pages, 12 figures Haidong Rong Jiashu Yao Matthias Langer Shijie Liu Li Fan Dongxin Wang Jia He Jinglin Chen Jiaheng Rang Julian Qian Mengyao Xu Fan Yu Minseok Lee Zehuan Wang Even Oldridge http://arxiv.org/abs/2603.16850v1 Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks 2026-03-17T17:55:01Z Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story. 2026-03-17T17:55:01Z PhD Dissertation; Stanford University Xavier Gonzalez 10.25740/vf943fc9855 http://arxiv.org/abs/2502.20692v3 MonadBFT: Fast, Responsive, Fork-Resistant Streamlined Consensus 2026-03-17T17:21:20Z This paper introduces MonadBFT, a novel Byzantine Fault Tolerant (BFT) consensus protocol that advances both performance and robustness. MonadBFT is implemented as the consensus protocol in the Monad blockchain. As a HotStuff-family protocol, MonadBFT has linear message complexity in the common case and is optimistically responsive, operating as quickly as the network allows. A central feature of MonadBFT is its tail-forking resistance. In pipelined BFT protocols, when a leader goes offline, the previous proposal is abandoned. Malicious leaders can exploit this tail-forking behavior as a form of Maximal Extractable Value (MEV) attack by deliberately discarding their predecessor's block, depriving that proposer of rewards and enabling transaction reordering, censorship or theft. MonadBFT prevents such tail-forking attacks, preserving both fairness and integrity in transaction execution. Another related feature of MonadBFT is its notion of speculative finality, which enables parties to execute ordered transactions after a single round (i.e., a single view), with reverts occurring only in the rare case of provable leader equivocation. This mechanism reduces user-perceived latency. Additionally, we introduce the leader fault isolation property, which ensures that the protocol can quickly recover from a failure. To our knowledge, no prior pipelined, leader-based BFT consensus protocol combines all of these properties in a single design. 2025-02-28T03:50:14Z Mohammad Mussadiq Jalalzai Kushal Babel Jovan Komatovic Tobias Klenze Sourav Das Fatima Elsheimy Mike Setrin John Bergschneider Babak Gilkalaye http://arxiv.org/abs/2603.16812v1 ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation 2026-03-17T17:16:41Z Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes increasingly challenging due to complex validation framework setup, large design scale, high concurrency, non-deterministic execution, and intricate protocol interactions at chiplet boundaries, often resulting in long integration cycles. This paper presents a replay-driven validation methodology developed during the integration of a CPU subsystem, multiple Xe GPU cores, and a configurable Network-on-Chip (NoC) within a foundational SoC building block targeting the ODIN integrated chiplet architecture. By leveraging deterministic waveform capture and replay across both simulation and emulation using a single design database, complex GPU workloads and protocol sequences can be reproduced reliably at the system level. This approach significantly accelerates debug, improves integration confidence, and enables end-to-end system boot and workload execution within a single quarter, demonstrating the effectiveness of replay-based validation as a scalable methodology for chiplet-based systems. 2026-03-17T17:16:41Z Nij Dorairaj Debabrata Chatterjee Hong Wang Hong Jiang Alankar Saxena Altug Koker Thiam Ern Lim Cathrane Teoh Chuan Yin Loo Bishara Shomar Anthony Lester http://arxiv.org/abs/2603.16721v1 Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer 2026-03-17T16:09:05Z Cancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency. 2026-03-17T16:09:05Z Ritvik Prabhu Emil Vatai Bernard Moussad Emmanuel Jeannot Ramu Anandakrishnan Wu-chun Feng Mohamed Wahib http://arxiv.org/abs/2602.02335v3 Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents 2026-03-17T16:04:54Z Lakehouses are now the default substrate for analytics and AI, but they remain fragile under concurrent, untrusted change: schema mismatches often surface only at runtime, development and production easily diverge, and multi-table pipelines can expose partial results after failure. We present Bauplan, a code-first lakehouse that aims to eliminate a broad class of these failures by construction. Bauplan builds on a storage substrate that already provides atomic single-table snapshot evolution, and adds three pipeline-level correctness mechanisms: typed table contracts to make transformation boundaries checkable, Git-like data versioning to support reproducible collaboration and review, and transactional runs that guarantee atomic publication of an entire pipeline execution. We describe the system design, show how these abstractions fit together into a unified programming model for humans and agents, and report early results from a lightweight Alloy model that both validates key intuitions and exposes subtle counterexamples around transactional branch visibility. Our experience suggests that correctness in the lakehouse is best addressed not by patching failures after the fact, but by restricting the programming model so that many illegal states become unrepresentable. 2026-02-02T16:58:38Z Submission pre-print, data conference Weiming Sheng Jinlang Wang Manuel Barros Aldrin Montana Jacopo Tagliabue Luca Bigon http://arxiv.org/abs/2603.16692v1 Dataflow-Oriented Classification and Performance Analysis of GPU-Accelerated Homomorphic Encryption 2026-03-17T15:49:32Z Fully Homomorphic Encryption (FHE) enables secure computation over encrypted data, but its computational cost remains a major obstacle to practical deployment. To mitigate this overhead, many studies have explored GPU acceleration for the CKKS scheme, which is widely used for approximate arithmetic. In CKKS, CKKS parameters are configured for each workload by balancing multiplicative depth, security requirements, and performance. These parameters significantly affect ciphertext size, thereby determining how the memory footprint fits within the GPU memory hierarchy. Nevertheless, prior studies typically apply their proposed optimization methods uniformly, without considering differences in CKKS parameter configurations. In this work, we demonstrate that the optimal GPU optimization strategy for CKKS depends on the CKKS parameter configuration. We first classify prior optimizations by two aspects of dataflows which affect memory footprint and then conduct both qualitative and quantitative performance analyses. Our analysis shows that even on the same GPU architecture, the optimal strategy varies with CKKS parameters with performance differences of up to 1.98 $\times$ between strategies, and that the criteria for selecting an appropriate strategy differ across GPU architectures. 2026-03-17T15:49:32Z This work has been submitted to the IEEE for possible publication Ai Nozaki Takuya Kojima Hideki Takase Hiroshi Nakamura http://arxiv.org/abs/2603.16624v1 Accelerating the Particle-In-Cell code ECsim with OpenACC 2026-03-17T14:58:51Z The Particle-In-Cell (PIC) method is a computational technique widely used in plasma physics to model plasmas at the kinetic level. In this work, we present our effort to prepare the semi-implicit energy-conserving PIC code ECsim for exascale architectures. To achieve this, we adopted a pragma-based acceleration strategy using OpenACC, which enables high performance while requiring minimal code restructuring. On the pre-exascale Leonardo system, the accelerated code achieves a $5 \times$ speedup and a $3 \times$ reduction in energy consumption compared to the CPU reference code. Performance comparisons across multiple NVIDIA GPU generations show substantial benefits from the GH200 unified memory architecture. Finally, strong and weak scaling tests on Leonardo demonstrate efficiency of $70 \%$ and $78 \%$ up to 64 and 1024 GPUs, respectively. 2026-03-17T14:58:51Z Elisabetta Boella Nitin Shukla Filippo Spiga Mozhgan Kabiri Chimeh Matt Bettencourt Maria Elena Innocenti http://arxiv.org/abs/2603.16514v1 FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism 2026-03-17T13:41:21Z Modern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload's prompt-length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically, then deploy it in practice. The analytical core models each pool as an M/G/c queue and derives that the minimum-cost fleet is a two-pool architecture -- a short-context pool and a long-context pool -- with an optimal boundary B* satisfying an equal marginal GPU cost condition across both pools. The fundamental barrier to achieving B* is the cost cliff: a hard routing step where requests just above B* consume 8x--42x more GPU capacity than requests just below it (depending on the context window ratio), creating a structural disincentive to lower the boundary. Compress-and-Route (C&R) is the implementation mechanism that resolves this barrier. Gateway-layer extractive compression trims borderline requests below B* before the engine ever sees them, converting the hard hardware boundary into a software parameter read from the workload CDF. The two components are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal (n_s*, n_l*, B*, gamma*) in under 1 ms. On three production traces, the combined framework reduces total GPU cost by 6--82% versus a homogeneous fleet, with C&R contributing 1--44 percentage points beyond plain pool routing depending on workload archetype. The analytical model is validated against a discrete-event simulator (inference-fleet-sim) with <= 3% error on predicted GPU utilization across all pools and workloads. 2026-03-17T13:41:21Z Work in progress Huamin Chen Xunzhuo Liu Yuhan Liu Junchen Jiang Bowei He Xue Liu http://arxiv.org/abs/2603.16428v1 An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU 2026-03-17T12:05:17Z Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs. 2026-03-17T12:05:17Z 7 pages Ruijia Yang Zeyi Wen http://arxiv.org/abs/2603.16353v1 Biased Compression in Gradient Coding for Distributed Learning 2026-03-17T10:36:02Z Communication bottlenecks and the presence of stragglers pose significant challenges in distributed learning (DL). To deal with these challenges, recent advances leverage unbiased compression functions and gradient coding. However, the significant benefits of biased compression remain largely unexplored. To close this gap, we propose Compressed Gradient Coding with Error Feedback (COCO-EF), a novel DL method that combines gradient coding with biased compression to mitigate straggler effects and reduce communication costs. In each iteration, non-straggler devices encode local gradients from redundantly allocated training data, incorporate prior compression errors, and compress the results using biased compression functions before transmission. The server aggregates these compressed messages from the non-stragglers to approximate the global gradient for model updates. We provide rigorous theoretical convergence guarantees for COCO-EF and validate its superior learning performance over baseline methods through empirical evaluations. As far as we know, we are among the first to rigorously demonstrate that biased compression has substantial benefits in DL, when gradient coding is employed to cope with stragglers. 2026-03-17T10:36:02Z Chengxi Li Ming Xiao Mikael Skoglund http://arxiv.org/abs/2603.28780v1 Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding 2026-03-17T10:22:04Z In this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency. 2026-03-17T10:22:04Z Chengxi Li Youssef Allouah Rachid Guerraoui Mikael Skoglund Ming Xiao http://arxiv.org/abs/2603.13671v2 Grassroots Bonds: A Grassroots Foundation for Market Liquidity 2026-03-17T10:06:06Z Global cryptocurrencies are unbacked and have high transaction cost incurred by global consensus. In contrast, grassroots cryptocurrencies are backed by the goods and services of their issuers -- any person, natural or legal -- and have no transaction cost beyond operating a smartphone. Liquidity in grassroots cryptocurrencies arises from mutual credit via coin exchange among issuers. However, as grassroots coins are redeemable 1-for-1 against any other grassroots coin, the credit-forming exchange must also be 1-for-1, lest prompt redemption after exchange would leave the parties with undue profit or loss. Thus, grassroots coins are incongruent with liquidity through interest-bearing credit. Here we introduce grassroots bonds, which extend grassroots coins with a maturity date, reframing grassroots coins -- cash -- as mature grassroots bonds. Bond redemption generalises coin redemption, allowing the lending of liquid coins in exchange for interest-bearing future-maturity bonds. We show that digital social contracts -- voluntary agreements among persons, specified, fulfilled, and enforced digitally -- can express the full gamut of financial instruments as the voluntary swap of grassroots bonds, including credit lines, loans, sale of debt, forward contracts, options, and escrow-based instruments, and that classical liquidity ratios are applicable just as well to grassroots bonds. Grassroots bonds may thus allow local digital economies to form and grow without initial capital or external credit, harnessing mutual trust within communities into liquidity. The formal specification presented here was used by AI to derive a working implementation of grassroots bonds in GLP, a concurrent logic programming language implemented in Dart for smartphone deployment. The implementation is illustrated by a running multiagent village market scenario, also implemented in GLP by AI. 2026-03-14T00:44:25Z Ehud Shapiro http://arxiv.org/abs/2511.21859v2 Equivalence and Separation between Heard-Of and Asynchronous Message-Passing Models 2026-03-17T09:43:10Z We revisit the relationship between two fundamental models of distributed computation: the asynchronous message-passing model with up to $f$ crash failures ($\operatorname{AMP}_f$) and the Heard-Of model with up to $f$ message omissions ($\operatorname{HO}_f$). We show that for $n > 2f$, the two models are equivalent with respect to the solvability of colorless tasks, and that for colored tasks the equivalence holds only when $f = 1$ (and $n > 2$). The separation for larger $f$ arises from the presence of silenced processes in $\operatorname{HO}_f$, which may lead to incompatible decisions. The proofs proceed through bidirectional simulations between $\operatorname{AMP}_f$ and $\operatorname{HO}_f$ via an intermediate model that captures this notion of silencing. The results extend to randomized protocols against a non-adaptive adversary, indicating that the expressive limits of canonical rounds are structural rather than probabilistic. Together, these results delineate precisely where round-based abstractions capture asynchronous computation, and where they do not. 2025-11-26T19:35:34Z 18 pages; revised arguments in Section 3 and Appendix C, added acknowledgements; accepted at SIROCCO 2026 Hagit Attiya Armando CastaƱeda Dhrubajyoti Ghosh Thomas Nowak