GreenGNN: Energy-Aware Windowed Communication Optimization for Distributed GNN Training

2026-06-01T21:41:57Z

Large-scale graph neural network (GNN) training often requires distributed clusters because graph structure and feature tensors no longer fit in a single node's memory. In sampling-based training, each mini-batch expands into a receptive field that spans partitions and triggers thousands of remote feature fetches per epoch. This wastes energy for two main reasons: each small RPC pays a fixed initiation and protocol cost, and GPUs continue drawing substantial baseline power while waiting for remote features. We present GreenGNN, an energy-aware distributed GNN training system that reduces communication energy by exploiting the bursty, short-lived temporal locality of neighbor sampling. GreenGNN groups training into windows of W consecutive mini-batches, stages each window's hot features in a local cache, and merges remote requests from each partition owner into a small number of bulk transfers. This amortizes RPC overhead across many features while preserving an on-demand path for cache misses. Because window size controls the trade-off between communication amortization and hot-set staleness, GreenGNN selects W offline using a discrete-event simulator that replays a deterministic one-epoch access trace with a hybrid energy model. We implement GreenGNN on DGL and evaluate it on a 4-node GPU cluster with benchmark datasets. Across datasets and batch sizes, GreenGNN reduces total system energy by 27--43% relative to baseline while improving end-to-end throughput by up to 3.9x. GPU energy drops by 36--71%, driven by fewer RPC initiations and lower GPU stall time.

Angelfish: Leader, DAG, or Anywhere in Between

2026-06-01T21:37:18Z

To maximize performance, many modern blockchain systems rely on eventually-synchronous, Byzantine fault-tolerant (BFT) consensus protocols. Two protocol designs have emerged in this space: protocols that minimize latency using a leader that drives both data dissemination and consensus, and protocols that maximize throughput using a separate, asynchronous data dissemination layer. Recent protocols such as Partially-Synchronous Bullshark and Sailfish combine elements of both approaches by using a DAG to enable parallel data dissemination and a leader that paces DAG formation. This improves latency while achieving state-of-the-art throughput. Yet the latency of leader-based protocols is still better under moderate loads, which are common in practice. We present Angelfish, a hybrid protocol that adapts smoothly across this design space, from leader-based to Sailfish-like DAG-based consensus. Angelfish lets a dynamically adjusted subset of parties use best-effort broadcast to issue lightweight votes instead of reliably broadcasting costlier DAG vertices. This reduces communication, helps lagging nodes catch up, and lowers latency in practice compared to prior DAG-based protocols. Our empirical evaluation shows that Angelfish attains state-of-the-art peak throughput while significantly lowering latency under moderate throughput, delivering the best of both worlds.

Supervised Distributed Computing: Efficiency and Robustness under a Majority of Adversarial Workers

2026-06-01T21:05:20Z

We consider a recently proposed \emph{supervised distributed computing} paradigm \cite{augustine2025supervised} that extends and refines the standard master-worker paradigm for parallel computations. In this paradigm, there is a supervisor, a source, a target, and a collection of workers. The distributed computation is given as an acyclic task graph that is known to the supervisor. The source initially stores the input and the target is supposed to store the output of the computation. The individual tasks of the computation are supposed to be executed by the workers under the guidance of the supervisor. The source, target and supervisor are assumed to be reliable, while a $β$-fraction of the workers might be adversarial, for some $β\in [0,1)$. This covers, for example, the case where a supervisor has to work with untrusted volunteers. In the standard master-worker approach, the master checks whether the workers correctly execute the assigned tasks, creating a severe bottleneck, whereas in the supervised approach, the supervisor outsources this checking to the workers. Prior to this work, only supervised solutions were known for the case that $β$ is a sufficiently small constant. We show that robust and efficient supervised solutions are possible for \emph{any} constant $β<1$ while the expected work for the honest workers is close to a \emph{single} execution per task, given that there is a lightweight verification mechanism that allows honest workers to check the correctness of task outputs, which is significantly better than all robust master-worker as well as peer-to-peer approaches known so far.

Leader Election via Unique Sink Orientation

2026-06-01T21:03:32Z

A Locally Checkable Labeling (LCL) is a distributed constraint satisfaction problem defined on a bounded-degree graph that relates a finite set of input labels to a finite set of output labels through a finite set of locally checkable constraints. In this work we define labels and local constraints that encode solutions to two classical problems: leader election and spanning tree construction. It is known that leader election cannot be expressed as an LCL in arbitrary graphs using constant-size labels. In fact, it is known that there does not exist a finite set of labels and local constraints for leader election even for the class of rings. On the other hand, there exists a finite set of labels and local constraints characterizing leader election on trees. In this work, we prove that there exists a finite set of labels and local constraints for leader election also in the much larger class of dismantlable graphs. Our labels need one bit per edge or equivalently $O(Δ)$ bits per node (where $Δ$ is the maximum degree in the graph) and are checkable within the graph induced by the 1-neighborhood of each node. To the best of our knowledge, these are the first local labeling results tailored to dismantlable graphs, potentially highlighting structural properties useful for designing labels and constraints for additional LCL problems. Finally, we present a generic transformation that converts any finite set of labels and local constraints into a silent self-stabilizing algorithm by adding only one extra state, assuming a Gouda fair scheduler. This transformation may be of independent interest.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

2026-06-01T18:38:21Z

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

MineDraft: A Framework for Batch Parallel Speculative Decoding

2026-06-01T17:56:47Z

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

IntraShuffler: A Privacy Preserving Framework for Heterogeneous DP Federated Learning

2026-06-01T17:54:10Z

Heterogeneous Differential Privacy (HDP) in Federated Learning (FL) allows clients to select individual privacy budgets ($\varepsilon_i$) according to institutional policies and data sensitivity. In practice, many HDP-FL systems employ $\varepsilon$-aware server aggregation to improve model utility by re-weighting client updates according to their declared privacy budgets. However, gradient updates in FL retain structural patterns induced by non-independent and identically-distributed (non-IID) data, and these additional signals exposed by $\varepsilon$-aware aggregation create new opportunities for inference by an honest-but-curious server. In this work, we first show that a server equipped with gradient denoising and surrogate modeling can mount a \emph{Privacy Inference Attack} that infers distributional attributes of clients and links updates from the same client across training rounds, measured via surrogate inference accuracy and linkage success, under realistic knowledge constraints. The Shuffle-Model has been widely studied as a defense against such inference risks by anonymizing update sources, but it is fundamentally incompatible with HDP-FL $\varepsilon$-aware aggregation. To address this challenge, we propose \textbf{IntraShuffler}, a middleware defense framework designed for HDP-FL systems. IntraShuffler introduces a privacy-aware shuffling mechanism that groups clients into privacy-compatible buckets and performs parameter-level shuffling within each bucket to disrupt persistent gradient structure while preserving $\varepsilon$-aware aggregation. Experiments across four different datasets show that IntraShuffler reduces gradient recoverability by over 60% and decreases surrogate inference accuracy from 0.78 to 0.33 while maintaining comparable model utility across multiple FL aggregation rules.

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

2026-06-01T16:04:51Z

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

Strategies for Molecular Dynamics using Hybrid Systems: LAMMPS Use Case

2026-06-01T14:34:28Z

The complexity of biomolecular simulations has substantially increased the demand for High-Performance Computing (HPC) infrastructures, particularly in molecular dynamics and coarse-grained modeling. This work presents a systematic performance and scalability analysis of the LAMMPS simulator for coarse-grained biomolecular simulations, using the antimicrobial peptide Tritrpticin (PDB ID: 1D6X) as the experimental workload. Pure MPI and hybrid MPI+OpenMP executions were evaluated in HPC environments comprising up to 8 compute nodes and 1024 simultaneous cores. Metrics of execution time, speedup, parallel efficiency, statistical variability, and internal time decomposition were investigated. Results showed that pure MPI executions deliver excellent performance in single-node environments but suffer scalability degradation in multi-node executions due to communication overhead and inter-process synchronization. Hybrid MPI+OpenMP configurations proved more efficient at large scale, reducing communication costs and better exploiting the NUMA memory hierarchy. The computational breakdown revealed that communication and electrostatic interaction routines accounted for the largest fraction of execution time at the largest pure-MPI scales. These results reinforce that performance of biomolecular HPC applications depends directly on the balance among parallelization granularity, spatial decomposition, and distributed communication costs. Hybrid MPI+OpenMP strategies represent a more sustainable alternative for coarse-grained biomolecular simulations on modern many-core architectures.

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

2026-06-01T14:09:03Z

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent approaches rely on dynamic communication graphs built using Random Peer Sampling (RPS) protocols which have been proven to accelerate convergence. However, we show that these approaches are vulnerable to a dual attack: Byzantine nodes can poison models and manipulate peer sampling to amplify their influence. We address this combination of threats with GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of Byzantine nodes. GRANITE accumulates knowledge about encountered node identifiers over time and dynamically adjusts local aggregation thresholds based on estimated Byzantine density in the neighbourhood of each node. We demonstrate that under GRANITE, the Byzantine presence in local neighborhoods exhibits an exponential decay. We further derive the robustness conditions of the graphs generated by GRANITE. Empirically, our results indicate that GRANITE converges within 5% of non-Byzantine accuracy under 30% Byzantines nodes, offers faster convergence and operates on graphs with up to 9x lower communication cost.

EES-CND: Collaborative Neural Decision-Making for Drift-Aware Fault-Tolerant Edge-Cloud Service Placement

2026-06-01T13:48:04Z

The edge-cloud paradigm improves service delivery by orchestrating resources across edge nodes and cloud data centres. These environments consist of heterogeneous, interconnected computing nodes that cooperate to deliver continuous services. However, their scale and complexity increase vulnerability to failures from hardware malfunctions, software defects, and dynamic operating conditions. These failures can disrupt system configurations and service execution, leading to reduced reliability, performance degradation, and violations of service-level objectives. Ensuring service execution requires adaptive service placement strategies across edge-cloud resources. This study introduces a fault-tolerant service placement approach (Enhanced Evolution Strategy for Collaborative Neural Decision-making, EES-CND) for edge-cloud environments. The method employs collaborative decision-making, wherein multiple lightweight neural networks jointly infer redeployment strategies during failure events. To address the system dynamics and mitigate performance drift, adaptive models are updated online using an enhanced evolution strategy. Extensive simulations show that EES-CND effectively handles performance drift and significantly outperforms existing methods in service recovery time, response time, and reliability, achieving a 44.8\% reduction in fault-tolerance cost compared to standalone models.

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

2026-06-01T12:23:42Z

Faults in high-performance systems are expected to be very frequent in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency, resulting in an excessive amount of overhead, which would not be sustainable for many scientific applications. To improve application efficiency in such high-failure environments, the mechanism of replication of MPI processes was proposed. Replication allows for fast recovery from failures by simply dropping the failed processes and using their replicas to continue the regular operation of the application. In this paper, we have implemented FTHP-MPI (Fault Tolerance and High Performance MPI), a novel fault-tolerant MPI library that augments checkpoint/restart with replication to provide resilience from failures. The novelty of our work is that it is designed to provide fault tolerance in a native MPI library that does not provide support for fault tolerance. This lets application developers achieve fault tolerance at high failure rates while also using efficient communication protocols in the native MPI libraries that are generally fine-tuned for specific HPC platforms. We have also implemented efficient parallel communication techniques that involve replicas. Our framework deals with the unique challenges of integrating support for checkpointing and partial replication. We conducted experiments with three applications, HPCG, PIC, and CloverLeaf. We show that, for large-scale systems where failure intervals are expected to be within an hour, our replication-based library achieves higher efficiency and performance than checkpoint-based approaches. We show that, under failure-free conditions, the additional overheads from replication are negligible in our library.

TAPAAL SMC: Statistical Model Checking of Stochastic Timed-Arc Petri Nets

2026-06-01T10:01:22Z

Timed-Arc Petri net (TAPN) is a timed extension of the classical Petri net model where tokens have their age and input arcs are associated with time intervals restricting the ages of tokens available for transition firing. Additionally, a TAPN can also contain place invariants constraining the ages of tokens in places, inhibitor arcs preventing a transition from firing and transport arcs that preserve token ages upon firing. This set of features, as much as it allows us to model complex systems, also often makes verification problems computationally hard or even undecidable. Moreover, in order to model real-life examples, additional stochastic aspects are often necessary to capture the desired behaviour. We suggest the first stochastic semantics for TAPNs and design and implement the quantitative and qualitative Statistical Model Checking (SMC) algorithms in the model checker TAPAAL. We argue for the semantic choices we made in the stochastic semantics and prove that the semantics is well-behaving. On a number of case studies we demonstrate the practical applicability of our modelling formalism and its SMC implementation.

Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads

2026-06-01T08:58:23Z

Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping. We identify and validate an empirical optimal TP degree t_e that balances these effects. We present Albireo, a parallel inference system that raises the attainable t_e by shrinking the non-scalable portion via overlap of scheduling and I/O with compute and sequence-parallel sampling, without changing model architectures. Across models and benchmarks, Albireo achieves up to 1.9x higher throughput, 48% lower latency, 28% higher GPU utilization, and 54% lower energy than vLLM; in production it yields up to 2x higher throughput.

Boosting Multimodal Federated Learning via Chained Modality Optimization

2026-06-01T08:07:09Z

Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data and modality availability. However, most existing MMFL methods cast multimodal training as a joint optimization problem, overlooking a key bottleneck: modality competition, where dominant modalities suppress weaker ones and lead to suboptimal global models. To address this, we propose FedMChain, a balanced MMFL framework that structures federated multimodal training as a chain of modality-wise phases. This phase-wise design gives each modality a dedicated local optimization window on multimodal clients to mitigate modality competition, and further promotes cross-modal complementarity via an error-compensated regularizer. On the server side, we employ a sparse sign-guided aggregation strategy that leverages directional sign agreement for robust intra-modality aggregation, avoids destructive averaging, and supports less frequent synchronization to reduce communication overhead. Extensive experiments on multimodal benchmarks demonstrate that FedMChain consistently improves predictive performance while requiring less frequent communication than baselines.