Communication-Efficient Federated Fine-Tuning

2026-05-14T17:58:08Z

Federated Learning (FL) enables the utilization of vast, previously inaccessible data sources. At the same time, pre-trained Language Models (LMs) have taken the world by storm and for good reason. They exhibit remarkable emergent abilities and are readily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent challenge in FL is the frequent, rigid communication of parameters -- a problem magnified by the sheer size of these contemporary models. The FedOpt family of algorithms has become the go-to approach for FL, relying on fixed but arbitrary intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic approach by monitoring the training progress. However, it introduced a hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this work, we address these limitations by proposing the FDA-Opt family of algorithms -- a unified generalization of both FDA and FedOpt. Our experimental evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates that FDA-Opt outperforms FedOpt even when it is configured with hyper-parameters specifically optimized for the latter. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

2026-05-14T17:40:20Z

Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation

2026-05-14T17:17:02Z

The classical simulation of quantum algorithms is a crucial tool for circuit development, testing, and validation. Although acceleration using GPUs significantly reduces simulation time, most high-performance simulators rely on vendor-specific frameworks that target data-center hardware. To broaden access to quantum simulation, this work proposes a vendor-agnostic approach targeting the integrated GPUs commonly found in consumer-grade laptops. A primary challenge in state-vector simulation is its inherently poor spatial locality, which creates a memory bandwidth bottleneck. Consequently, baseline implementations experience a severe degradation in relative GPU speedup as the number of simulated qubits increases. To address this limitation, we introduce a state partitioning optimization that reorganizes the quantum state vector to maximize the last-level cache locality and minimize costly main memory fetches. We evaluate this strategy using a Quantum Phase Estimation algorithm across diverse architectures from Intel, AMD, and Apple. The experimental results demonstrate that the proposed optimization successfully mitigates performance degradation at larger qubit scales. In particular, for a 28-qubit simulation, the optimization reversed a performance deficit on an Intel Core i5, improving the GPU speedup over the CPU from 0.95x to 1.89x, and increased the Apple M1 Pro speedup from 3.71x to 5.88x. Overall, this approach yields consistent execution time improvements, demonstrating the viability of integrated GPUs for efficient quantum circuit simulation.

Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning

2026-05-14T15:59:24Z

Traditional federated learning (FL) relies on a central aggregator server, which can create performance bottlenecks and privacy risks. Decentralized mix-and-forward designs remove the server, but repeated local mixing can attenuate global information under heterogeneity and expose peer-to-peer neighborhoods as a privacy attack surface. To preserve FedAvg-style aggregation semantics over updates reconstructable by the round deadline while scaling dissemination, we present FLTorrent, a BitTorrent-based dissemination layer for serverless FL with a short warm-up. Warm-up hardens within-round source unlinkability, a dissemination-layer goal orthogonal to content protections such as DP or secure aggregation, via pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling with the tracker off the data path, before switching to vanilla BitTorrent swarming. We upper-bound the per-transfer attribution posterior by the fraction of owner chunks in a sender's eligible cover set, and derive a tighter high-probability bound that improves with early non-owner mass. A simple heuristic, GreedyFastestFirst, attains about 92% of a bandwidth-optimal max-flow upper bound, while warm-up remains a stable about 12% share of a round across 100-500 peers. Under an observation-only local adversary, FLTorrent drives attribution success close to neighborhood-level random guessing for typical nodes, improves with network size, and remains robust under collusion. In LLM-scale dissemination stress tests over 7-10 Gbps access links, FLTorrent adds only about 6-10% round-time overhead relative to BitTorrent-only. Overall, FLTorrent shows that within-round unlinkability and BitTorrent-level efficiency can co-exist with predictable, low overheads at scale.

Grassroots Federation: Fair Democratic Governance at Scale

2026-05-14T15:56:05Z

We propose a framework for the fair democratic governance of federated digital communities that form and evolve dynamically, where small groups self-govern and larger groups are represented by assemblies selected via sortition. Prior work addressed static fairness conditions; here, we formalize a dynamic setting where federations evolve over time through communities forming, joining, and splitting, in all directions -- bottom-up, top-down, and middle-out -- and adapt the fairness guarantees. The main technical challenge is reconciling integral seat allocations with dynamic, overlapping federations, so that child communities always meet their persistent floors while long-run averages converge to proportional fairness. Overcoming these challenges, we introduce a protocol that ensures fair participation and representation both persistently (at all times) and eventually (in the limit after stabilization), extending the static fairness properties to handle structural changes. Prior work shows how grassroots federations can be specified via atomic transactions among assembly members, Constitutional Consensus can realize these transactions and the democratic processes leading to them, and Constitutional Governance in Metric Spaces lets a community govern itself and amend its own constitution. Together, these works form a comprehensive design for an egalitarian, fairly governed, large-scale decentralized sovereign digital community platform.

Constitutional Governance in Metric Spaces

2026-05-14T15:12:56Z

Computational social choice and algorithmic decision theory offer rich aggregation theory but no comprehensive process for egalitarian self-governance: aggregation, deliberation, amendment, and consensus are each considered in isolation, with key metric-space aggregators being NP-hard. Here, we propose constitutional governance in metric spaces, integrating these stages into a coherent polynomial-time protocol for constitutional governance. The constitution assigns, per amendable component including itself, a metric space, aggregation rule, and supermajority threshold. Amendments proceed by members voting with their ideal elements, followed by members submitting public proposals carrying supermajority public support under the revealed votes. Public proposals can be sourced from deliberation among members, vote aggregation, or AI mediation. The constitutional rule adopts a supported proposal with positive maximal score, if there is one, else retains the status quo. With Constitutional Consensus, a community can run the constitutional governance protocol on members' personal computing devices (e.g., smartphones), achieving digital sovereignty. We focus on the utility of the generalised median, prove that at majority threshold no misreport weakly dominates sincere voting, and study the compromise gap between best peak and unconstrained optimum. We instantiate the framework to seven canonical settings -- electing officers, setting rates, allocating budgets, ranking priorities, selecting boards, drafting bylaws, and amending the constitution. By unifying metric-space aggregation, reality-aware social choice, supermajority amendment, constitutional consensus, deliberative coalition formation, and AI mediation, this work delivers a comprehensive solution to the constitutional governance of digital communities and organisations.

Embedded Made Easy -- Rethinking Embedded + Cloud Software Development (WIP)

2026-05-14T14:09:56Z

The process of engineering and deploying applications in the edge/embedded space is massively complicated by the non-homogeneous nature of the software stack and the complexity of diagnostics & debugging. Often different languages and runtimes are used for different components of the system forcing designers to, irrevocably, make decisions about what components run on the periphery and what components run in the cloud. Further complications arise when handling and diagnosing failures in the system. Multiple stacks and, often, limited support for debugging complicate the already difficult task of analyzing distributed applications. This paper presents a work-in-progress vision for a unified language and runtime system that allows applications to scale seamlessly across the edge and cloud. Using a single language and runtime, applications can be developed and tested in a single environment, and then deployed to any component of the system -- from resource limited controllers to large cloud servers. Further, we outline how this retargetable stack can provide integrated diagnostics and debugging tools that allow developers to record and replay distributed events locally for analysis and debugging.

Supervised Distributed Computing: Efficiency and Robustness under a Majority of Adversarial Workers

2026-05-14T12:54:34Z

We consider a recently proposed \emph{supervised distributed computing} paradigm \cite{augustine2025supervised} that extends and refines the standard master-worker paradigm for parallel computations. In this paradigm, there is a supervisor, a source, a target, and a collection of workers. The distributed computation is given as an acyclic task graph that is known to the supervisor. The source initially stores the input and the target is supposed to store the output of the computation. The individual tasks of the computation are supposed to be executed by the workers under the guidance of the supervisor. The source, target and supervisor are assumed to be reliable, while a $β$-fraction of the workers might be adversarial, for some $β\in [0,1)$. This covers, for example, the case where a supervisor has to work with untrusted volunteers. In the standard master-worker approach, the master checks whether the workers correctly execute the assigned tasks, creating a severe bottleneck, whereas in the supervised approach, the supervisor outsources this checking to the workers. Prior to this work, only supervised solutions were known for the case that $β$ is a sufficiently small constant. We show that robust and efficient supervised solutions are possible for \emph{any} constant $β<1$ while the expected work for the honest workers is close to a \emph{single} execution per task, given that there is a lightweight verification mechanism that allows honest workers to check the correctness of task outputs, which is significantly better than all robust master-worker as well as peer-to-peer approaches known so far.

Mat2Boundary: Treating User-Defined Boundary Condition as SpMV for Distributed PDE Solvers on Block-Structured Grids

2026-05-14T12:49:09Z

Boundary-condition (BC) handling is a major source of complexity in PDE solvers on structured and block-structured grids, especially for high-order methods and distributed-memory execution. We present Mat2Boundary, a DSL and compiler for boundary computations that models a broad class of boundary-conditions as affine sparse linear operators. This abstraction unifies halo copying, circular and symmetric mappings, zero padding, block-edge synchronization, and user-defined interpolation, while exposing a modular basic sub-matrix interface for declarative composition. To make this representation efficient, Mat2Boundary combines multi-stage programming and polyhedral analysis to generate matrix-free kernels for structured cases, support user-defined sparse matrices for irregular cases, eliminate redundant boundary work, and synthesize reusable communication schedules for distributed execution. Evaluated on two shallow-water equation solvers on cubed-sphere grids and HPCG, Mat2Boundary achieves up to 7.6$\times$ BC-kernel speedup, reduces BC code by over 70%, and scales to 1,344 CPU cores with 72%-88% efficiency.

Self-Evolving Distributed Memory Architecture for Scalable AI Systems

2026-05-14T11:25:40Z

Distributed AI systems face critical memory management challenges across computation, communication, and deployment layers. RRAM based in memory computing suffers from scalability limitations due to device non idealities and fixed array sizes. Decentralized AI frameworks struggle with memory efficiency across NAT constrained networks due to static routing that ignores computational load. Multi agent deployment systems tightly couple application logic with execution environments, preventing adaptive memory optimization. These challenges stem from a fundamental lack of coordinated memory management across architectural layers. We introduce Self Evolving Distributed Memory Architecture for Scalable AI Systems, a three layer framework that unifies memory management across computation, communication, and deployment. Our approach features (1) memory guided matrix processing with dynamic partitioning based on device characteristics, (2) memory aware peer selection considering network topology and computational capacity, and (3) runtime adaptive deployment optimization through continuous reconfiguration. The framework maintains dual memory systems tracking both long term performance patterns and short term workload statistics. Experiments on COCO 2017, ImageNet, and SQuAD show that our method achieves 87.3 percent memory utilization efficiency and 142.5 operations per second compared to Ray Distributed at 72.1 percent and 98.7 operations per second, while reducing communication latency by 30.2 percent to 171.2 milliseconds and improving resource utilization to 82.7 percent. Our contributions include coordinated memory management across three architectural layers, workload adaptive resource allocation, and a dual memory architecture enabling dynamic system optimization.

Malleable Molecular Dynamics Simulations with GROMACS and DMR

2026-05-14T10:10:35Z

Static resource allocations in high-performance computing (HPC) lead to inefficiencies for time-varying workloads, causing idle resources, queue delays, and higher node-hour costs. The Dynamic Management of Resources (DMR) middleware enables MPI process malleability in Slurm via a simple API decoupled from scheduler internals. In this work, we integrate DMR into the GROMACS molecular dynamics engine to obtain a malleable variant that can dynamically adapt its MPI process count by combining communication-efficiency-aware reconfiguration with GROMACS' native checkpoint/restart mechanism. We evaluate this design on the MareNostrum~5 supercomputer, comparing dynamic runs against static executions and quantifying reconfiguration overheads, time-to-solution, and node-hour savings for bursty GROMACS workloads.

Multi-objective application placement in fog computing using graph neural network-based reinforcement learning

2026-05-14T10:06:49Z

We propose a framework designed to tackle a multi-objective optimization challenge related to the placement of applications in fog computing, employing a deep reinforcement learning (DRL) approach. Unlike other optimization techniques, such as integer linear programming or genetic algorithms, DRL models are applied in real time to solve similar problem situations after training. Our model comprises a learning process featuring a graph neural network and two actor-critics, providing a holistic perspective on the priorities concerning interconnected services that constitute an application. The learning model incorporates the relationships between services as a crucial factor in placement decisions: Services with higher dependencies take precedence in location selection. Our experimental investigation involves illustrative cases where we compare our results with baseline strategies and genetic algorithms. We observed a comparable Pareto set with negligible execution times, measured in the order of milliseconds, in contrast to the hours required by alternative approaches.

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

2026-05-14T09:06:42Z

Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.

Analysis of wireless network access logs for a hierarchical characterization of user mobility

2026-05-14T08:23:00Z

This paper presents a method that generates a hierarchical user mobility model from the analysis of the data available from Wi-Fi connections. The data obtained from the Wi-Fi infrastructure is defined in terms of the coverage areas of the access points that the users move through. These access points are recursively grouped into different levels of granularity based on their geospatial features. The track of a user is defined as a sequence of Wi-Fi access points, which is enough to simulate user mobility in, for example, fog scenarios. The hierarchical definition of the region under study is proposed to reduce the complexity of the model in high-scale scenarios and to increase the adaptability between scenarios with different geospatial features. The model creation is based on a user profiling method that uses a clustering algorithm and each user type is defined with a transition matrix between coverage areas and a time length vector for the areas. The method is applied to the case of the campus of the University of the Balearic Islands. From the analysis of the mean square error of the results, we determined that the proposed method obtains good results for the transition matrices, but that the time vector definition should be improved. The results also show lower complexity in the case of the hierarchical model, with one area for each building and three levels, in regard to a non-hierarchical model, with only one area and one level for the whole campus.

DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration

2026-05-14T08:09:42Z

Differentiable simulation of soft bodies is a foundation for system identification, trajectory optimization, and Real2Sim transfer. Yet, existing methods such as the differentiable Projective Dynamics (DiffPD) struggle when faced with heterogeneous materials with extreme stiffness contrasts, hyperelasticity under large deformations, and contact-rich interactions, which are common scenarios in the real world. We present DiffPhD, a unified GPU-accelerated differentiable Projective Dynamics framework for heterogeneous materials that tackles these intertwined challenges simultaneously. Our key insight is a careful integration of: (i) stiffness-aware projective weights to embed heterogeneity into the global system; (ii) trust-region eigenvalue filtering lifted to the backward pass for stable hyperelastic gradients and a type-II Anderson Acceleration scheme with dual-gate convergence to stabilize forward iteration under large stiffness contrasts; and (iii) a unified GPU pipeline that reuses a single sparse factor across forward, backward, and contact computations, with stiffness-amplified Rayleigh damping folded into the same factor for heterogeneity-aware dissipation at zero recurring cost. DiffPhD achieves strict gradient accuracy while delivering up to an order-of-magnitude speedup over prior differentiable solvers on heterogeneous, hyperelastic, contact-rich benchmarks. Crucially, this speedup does not come at the cost of stability: DiffPhD remains convergent on stiffness contrasts up to 100x where prior PD solvers degrade. This unlocks end-to-end gradient-based optimization on regimes previously bottlenecked by either solver fragility or per-iteration cost -- shell--joint composite creatures, soft characters wielding stiff weapons, and soft-gripper robotic manipulation -- all handled within a single forward--backward pass.