https://arxiv.org/api/sWqz9yVRFMh13zKzpxeGD2thxKQ2026-04-07T17:55:15Z2791328515http://arxiv.org/abs/2404.19725v8CurvFed: Curvature-Aligned Federated Learning for Fairness without Demographics2026-03-17T23:12:52ZModern human sensing applications often rely on data distributed across users and devices, where privacy concerns prevent centralized training. Federated Learning (FL) addresses this challenge by enabling collaborative model training without exposing raw data or attributes. However, achieving fairness in such settings remains difficult, as most human sensing datasets lack demographic labels, and FL's privacy guarantees limit the use of sensitive attributes. This paper introduces CurvFed: Curvature Aligned Federated Learning for Fairness without Demographics, a theoretically grounded framework that promotes fairness in FL without requiring any demographic or sensitive attribute information, a concept termed Fairness without Demographics (FWD), by optimizing the underlying loss landscape curvature. Building on the theory that equivalent loss landscape curvature corresponds to consistent model efficacy across sensitive attribute groups, CurvFed regularizes the top eigenvalue of the Fisher Information Matrix (FIM) as an efficient proxy for loss landscape curvature, both within and across clients. This alignment promotes uniform model behavior across diverse bias inducing factors, offering an attribute agnostic route to algorithmic fairness. CurvFed is especially suitable for real world human sensing FL scenarios involving single or multi user edge devices with unknown or multiple bias factors. We validated CurvFed through theoretical and empirical justifications, as well as comprehensive evaluations using three real world datasets and a deployment on a heterogeneous testbed of resource constrained devices. Additionally, we conduct sensitivity analyses on local training data volume, client sampling, communication overhead, resource costs, and runtime performance to demonstrate its feasibility for practical FL edge device deployment.2024-04-30T17:19:52Z*equal contributionHarshit SharmaShaily RoyAsif Salekinhttp://arxiv.org/abs/2603.17168v1HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage2026-03-17T21:59:59ZTraditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.2026-03-17T21:59:59Z15 pages, 12 figuresHaidong RongJiashu YaoMatthias LangerShijie LiuLi FanDongxin WangJia HeJinglin ChenJiaheng RangJulian QianMengyao XuFan YuMinseok LeeZehuan WangEven Oldridgehttp://arxiv.org/abs/2603.16850v1Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks2026-03-17T17:55:01ZMassively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.2026-03-17T17:55:01ZPhD Dissertation; Stanford UniversityXavier Gonzalez10.25740/vf943fc9855http://arxiv.org/abs/2502.20692v3MonadBFT: Fast, Responsive, Fork-Resistant Streamlined Consensus2026-03-17T17:21:20ZThis paper introduces MonadBFT, a novel Byzantine Fault Tolerant (BFT) consensus protocol that advances both performance and robustness. MonadBFT is implemented as the consensus protocol in the Monad blockchain. As a HotStuff-family protocol, MonadBFT has linear message complexity in the common case and is optimistically responsive, operating as quickly as the network allows. A central feature of MonadBFT is its tail-forking resistance. In pipelined BFT protocols, when a leader goes offline, the previous proposal is abandoned. Malicious leaders can exploit this tail-forking behavior as a form of Maximal Extractable Value (MEV) attack by deliberately discarding their predecessor's block, depriving that proposer of rewards and enabling transaction reordering, censorship or theft. MonadBFT prevents such tail-forking attacks, preserving both fairness and integrity in transaction execution. Another related feature of MonadBFT is its notion of speculative finality, which enables parties to execute ordered transactions after a single round (i.e., a single view), with reverts occurring only in the rare case of provable leader equivocation. This mechanism reduces user-perceived latency. Additionally, we introduce the leader fault isolation property, which ensures that the protocol can quickly recover from a failure. To our knowledge, no prior pipelined, leader-based BFT consensus protocol combines all of these properties in a single design.2025-02-28T03:50:14ZMohammad Mussadiq JalalzaiKushal BabelJovan KomatovicTobias KlenzeSourav DasFatima ElsheimyMike SetrinJohn BergschneiderBabak Gilkalayehttp://arxiv.org/abs/2603.16812v1ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation2026-03-17T17:16:41ZIntegration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes increasingly challenging due to complex validation framework setup, large design scale, high concurrency, non-deterministic execution, and intricate protocol interactions at chiplet boundaries, often resulting in long integration cycles. This paper presents a replay-driven validation methodology developed during the integration of a CPU subsystem, multiple Xe GPU cores, and a configurable Network-on-Chip (NoC) within a foundational SoC building block targeting the ODIN integrated chiplet architecture. By leveraging deterministic waveform capture and replay across both simulation and emulation using a single design database, complex GPU workloads and protocol sequences can be reproduced reliably at the system level. This approach significantly accelerates debug, improves integration confidence, and enables end-to-end system boot and workload execution within a single quarter, demonstrating the effectiveness of replay-based validation as a scalable methodology for chiplet-based systems.2026-03-17T17:16:41ZNij DorairajDebabrata ChatterjeeHong WangHong JiangAlankar SaxenaAltug KokerThiam Ern LimCathrane TeohChuan Yin LooBishara ShomarAnthony Lesterhttp://arxiv.org/abs/2603.16721v1Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer2026-03-17T16:09:05ZCancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency.2026-03-17T16:09:05ZRitvik PrabhuEmil VataiBernard MoussadEmmanuel JeannotRamu AnandakrishnanWu-chun FengMohamed Wahibhttp://arxiv.org/abs/2602.02335v3Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents2026-03-17T16:04:54ZLakehouses are now the default substrate for analytics and AI, but they remain fragile under concurrent, untrusted change: schema mismatches often surface only at runtime, development and production easily diverge, and multi-table pipelines can expose partial results after failure. We present Bauplan, a code-first lakehouse that aims to eliminate a broad class of these failures by construction. Bauplan builds on a storage substrate that already provides atomic single-table snapshot evolution, and adds three pipeline-level correctness mechanisms: typed table contracts to make transformation boundaries checkable, Git-like data versioning to support reproducible collaboration and review, and transactional runs that guarantee atomic publication of an entire pipeline execution. We describe the system design, show how these abstractions fit together into a unified programming model for humans and agents, and report early results from a lightweight Alloy model that both validates key intuitions and exposes subtle counterexamples around transactional branch visibility. Our experience suggests that correctness in the lakehouse is best addressed not by patching failures after the fact, but by restricting the programming model so that many illegal states become unrepresentable.2026-02-02T16:58:38ZSubmission pre-print, data conferenceWeiming ShengJinlang WangManuel BarrosAldrin MontanaJacopo TagliabueLuca Bigonhttp://arxiv.org/abs/2603.16692v1Dataflow-Oriented Classification and Performance Analysis of GPU-Accelerated Homomorphic Encryption2026-03-17T15:49:32ZFully Homomorphic Encryption (FHE) enables secure computation over encrypted data, but its computational cost remains a major obstacle to practical deployment. To mitigate this overhead, many studies have explored GPU acceleration for the CKKS scheme, which is widely used for approximate arithmetic. In CKKS, CKKS parameters are configured for each workload by balancing multiplicative depth, security requirements, and performance. These parameters significantly affect ciphertext size, thereby determining how the memory footprint fits within the GPU memory hierarchy. Nevertheless, prior studies typically apply their proposed optimization methods uniformly, without considering differences in CKKS parameter configurations. In this work, we demonstrate that the optimal GPU optimization strategy for CKKS depends on the CKKS parameter configuration. We first classify prior optimizations by two aspects of dataflows which affect memory footprint and then conduct both qualitative and quantitative performance analyses. Our analysis shows that even on the same GPU architecture, the optimal strategy varies with CKKS parameters with performance differences of up to 1.98 $\times$ between strategies, and that the criteria for selecting an appropriate strategy differ across GPU architectures.2026-03-17T15:49:32ZThis work has been submitted to the IEEE for possible publicationAi NozakiTakuya KojimaHideki TakaseHiroshi Nakamurahttp://arxiv.org/abs/2603.16624v1Accelerating the Particle-In-Cell code ECsim with OpenACC2026-03-17T14:58:51ZThe Particle-In-Cell (PIC) method is a computational technique widely used in plasma physics to model plasmas at the kinetic level. In this work, we present our effort to prepare the semi-implicit energy-conserving PIC code ECsim for exascale architectures. To achieve this, we adopted a pragma-based acceleration strategy using OpenACC, which enables high performance while requiring minimal code restructuring. On the pre-exascale Leonardo system, the accelerated code achieves a $5 \times$ speedup and a $3 \times$ reduction in energy consumption compared to the CPU reference code. Performance comparisons across multiple NVIDIA GPU generations show substantial benefits from the GH200 unified memory architecture. Finally, strong and weak scaling tests on Leonardo demonstrate efficiency of $70 \%$ and $78 \%$ up to 64 and 1024 GPUs, respectively.2026-03-17T14:58:51ZElisabetta BoellaNitin ShuklaFilippo SpigaMozhgan Kabiri ChimehMatt BettencourtMaria Elena Innocentihttp://arxiv.org/abs/2603.16514v1FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism2026-03-17T13:41:21ZModern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload's prompt-length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically, then deploy it in practice.
The analytical core models each pool as an M/G/c queue and derives that the minimum-cost fleet is a two-pool architecture -- a short-context pool and a long-context pool -- with an optimal boundary B* satisfying an equal marginal GPU cost condition across both pools. The fundamental barrier to achieving B* is the cost cliff: a hard routing step where requests just above B* consume 8x--42x more GPU capacity than requests just below it (depending on the context window ratio), creating a structural disincentive to lower the boundary.
Compress-and-Route (C&R) is the implementation mechanism that resolves this barrier. Gateway-layer extractive compression trims borderline requests below B* before the engine ever sees them, converting the hard hardware boundary into a software parameter read from the workload CDF. The two components are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal (n_s*, n_l*, B*, gamma*) in under 1 ms.
On three production traces, the combined framework reduces total GPU cost by 6--82% versus a homogeneous fleet, with C&R contributing 1--44 percentage points beyond plain pool routing depending on workload archetype. The analytical model is validated against a discrete-event simulator (inference-fleet-sim) with <= 3% error on predicted GPU utilization across all pools and workloads.2026-03-17T13:41:21ZWork in progressHuamin ChenXunzhuo LiuYuhan LiuJunchen JiangBowei HeXue Liuhttp://arxiv.org/abs/2603.16428v1An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU2026-03-17T12:05:17ZFine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.2026-03-17T12:05:17Z7 pagesRuijia YangZeyi Wenhttp://arxiv.org/abs/2603.16353v1Biased Compression in Gradient Coding for Distributed Learning2026-03-17T10:36:02ZCommunication bottlenecks and the presence of stragglers pose significant challenges in distributed learning (DL). To deal with these challenges, recent advances leverage unbiased compression functions and gradient coding. However, the significant benefits of biased compression remain largely unexplored. To close this gap, we propose Compressed Gradient Coding with Error Feedback (COCO-EF), a novel DL method that combines gradient coding with biased compression to mitigate straggler effects and reduce communication costs. In each iteration, non-straggler devices encode local gradients from redundantly allocated training data, incorporate prior compression errors, and compress the results using biased compression functions before transmission. The server aggregates these compressed messages from the non-stragglers to approximate the global gradient for model updates. We provide rigorous theoretical convergence guarantees for COCO-EF and validate its superior learning performance over baseline methods through empirical evaluations. As far as we know, we are among the first to rigorously demonstrate that biased compression has substantial benefits in DL, when gradient coding is employed to cope with stragglers.2026-03-17T10:36:02ZChengxi LiMing XiaoMikael Skoglundhttp://arxiv.org/abs/2603.28780v1Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding2026-03-17T10:22:04ZIn this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency.2026-03-17T10:22:04ZChengxi LiYoussef AllouahRachid GuerraouiMikael SkoglundMing Xiaohttp://arxiv.org/abs/2603.13671v2Grassroots Bonds: A Grassroots Foundation for Market Liquidity2026-03-17T10:06:06ZGlobal cryptocurrencies are unbacked and have high transaction cost incurred by global consensus. In contrast, grassroots cryptocurrencies are backed by the goods and services of their issuers -- any person, natural or legal -- and have no transaction cost beyond operating a smartphone. Liquidity in grassroots cryptocurrencies arises from mutual credit via coin exchange among issuers. However, as grassroots coins are redeemable 1-for-1 against any other grassroots coin, the credit-forming exchange must also be 1-for-1, lest prompt redemption after exchange would leave the parties with undue profit or loss. Thus, grassroots coins are incongruent with liquidity through interest-bearing credit.
Here we introduce grassroots bonds, which extend grassroots coins with a maturity date, reframing grassroots coins -- cash -- as mature grassroots bonds. Bond redemption generalises coin redemption, allowing the lending of liquid coins in exchange for interest-bearing future-maturity bonds. We show that digital social contracts -- voluntary agreements among persons, specified, fulfilled, and enforced digitally -- can express the full gamut of financial instruments as the voluntary swap of grassroots bonds, including credit lines, loans, sale of debt, forward contracts, options, and escrow-based instruments, and that classical liquidity ratios are applicable just as well to grassroots bonds. Grassroots bonds may thus allow local digital economies to form and grow without initial capital or external credit, harnessing mutual trust within communities into liquidity.
The formal specification presented here was used by AI to derive a working implementation of grassroots bonds in GLP, a concurrent logic programming language implemented in Dart for smartphone deployment. The implementation is illustrated by a running multiagent village market scenario, also implemented in GLP by AI.2026-03-14T00:44:25ZEhud Shapirohttp://arxiv.org/abs/2511.21859v2Equivalence and Separation between Heard-Of and Asynchronous Message-Passing Models2026-03-17T09:43:10ZWe revisit the relationship between two fundamental models of distributed computation: the asynchronous message-passing model with up to $f$ crash failures ($\operatorname{AMP}_f$) and the Heard-Of model with up to $f$ message omissions ($\operatorname{HO}_f$). We show that for $n > 2f$, the two models are equivalent with respect to the solvability of colorless tasks, and that for colored tasks the equivalence holds only when $f = 1$ (and $n > 2$). The separation for larger $f$ arises from the presence of silenced processes in $\operatorname{HO}_f$, which may lead to incompatible decisions. The proofs proceed through bidirectional simulations between $\operatorname{AMP}_f$ and $\operatorname{HO}_f$ via an intermediate model that captures this notion of silencing. The results extend to randomized protocols against a non-adaptive adversary, indicating that the expressive limits of canonical rounds are structural rather than probabilistic. Together, these results delineate precisely where round-based abstractions capture asynchronous computation, and where they do not.2025-11-26T19:35:34Z18 pages; revised arguments in Section 3 and Appendix C, added acknowledgements; accepted at SIROCCO 2026Hagit AttiyaArmando CastaƱedaDhrubajyoti GhoshThomas Nowak