PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

2026-05-14T07:49:27Z

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16x-2.16x speedup and reducing energy consumption by 14.3%-25.3%.

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

2026-05-14T06:18:54Z

Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving

2026-05-14T05:22:30Z

As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty$\unicode{x2015}$yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

Fast Gossip-based Rumor Spreading using Small Messages

2026-05-14T04:55:05Z

We study gossip algorithms for the fundamental rumor spreading problem, where the goal is to disseminate a rumor from a given source node to all nodes in an arbitrary (and unknown) graph. Gossip algorithms allow each node to call only one neighbor per round and are therefore highly message-efficient, with low per-node communication overhead per round. The state of the art present fast gossip algorithms, however they typically leverage large-sized messages. This undermines the light-weight communication advantage of gossip, since even though only one neighbor is contacted per round, the message size can be linear in $n$, the network size. Hence, a fundamental question is whether one can perform fast gossip using small messages. The main contribution of this paper is to answer the above question in the affirmative and present two gossip algorithms that achieve fast rumor spreading using messages of polylog{n} size. Specifically, we present the following algorithms: 1. An algorithm that runs in $O(c \log n / Φ_c)$ rounds for every $c \geq 1$, and $Φ_c$ is the weak conductance. Our bound in terms of weak conductance is essentially optimal. 2. An algorithm that depends on the network diameter (and is independent of the graph's conductance), which runs in $\tilde{O}(D+\sqrt{n})$ rounds with high probability. Our algorithm can be modified to output a minimum spanning tree (MST) in the same number of rounds, which is essentially round-optimal (even for non-gossip algorithms). Our gossip algorithms use graph sketches [Ahn, Guha, McGregor, SODA 2012] in a novel way to overcome communication bottlenecks and achieve small communication overhead with small message sizes.

Semi-Synchronous Exploration in Dynamic Graphs

2026-05-14T04:54:34Z

We study the fundamental problem of graph exploration in dynamic graphs using mobile agents. We consider $1$-interval connected dynamic graphs, where the topology may change arbitrarily from round to round as long as the graph remains connected, and edges are assigned with the dynamic port labeling at each round. The execution follows a semi-synchronous scheduler, under which an adversary may deactivate an arbitrary subset of agents in each round. For a graph with $n$ nodes and $k$ agents, we show that exploration is impossible if the adversary can deactivate at least $ \left\lceil \frac{k}{n-2} \right\rceil - 1$ agents per round, even when agents are equipped with unbounded memory, have global communication and full visibility. This yields an upper bound, implying that exploration is solvable only when the adversary deactivates at most $\left\lceil \frac{k}{n-2} \right\rceil - 2$ agents per round. We further establish that achieving exploration at this threshold requires agents to have both $1$-hop visibility and $1$-hop communication. Finally, we present the exploration algorithm using $k$ agents when the adversary deactivates at most $ \left\lceil \frac{k}{n-2} \right\rceil - 2$ agents, assuming agents are equipped with $1$-hop visibility and global communication, and matches the adversarial deactivation bound implied by the impossibility results.

Hierarchical Transformer Preconditioning for Interactive Physics Simulation

2026-05-14T03:47:09Z

Neural preconditioners for real-time physics simulation offer promising data-driven priors, but they often fail to capture long-range couplings efficiently because they inherit local message passing or sparse-operator access patterns. We introduce the Hierarchical Transformer Preconditioner, a neural preconditioner anchored to a weak-admissibility H-matrix partition. The partition provides a multiscale structural prior (dense diagonal leaves plus coarsening off-diagonal tiles) that enables full-graph approximate-inverse computation with O(N) scaling at fixed block sizes. The network models the inverse through low-rank far-field factors and uses highway connections (axial buffers plus a global summary token) to propagate context across transformer depth. At each PCG iteration, preconditioner application reduces to batched dense GEMMs with regular memory access. The key training contribution is a cosine-Hutchinson probe objective that learns the action of MA on convergence-critical spectral subspaces, optimizing angular alignment of MAz with z rather than forcing eigenvalue clusters to a prescribed location. This removes unnecessary spectral-placement constraints from SAI-style objectives and improves conditioning on irregular spectra. Because both inference and apply are dense, dependency-free tensor programs, the full solve loop is captured as a single CUDA Graph. On stiff multiphase Poisson systems (up to 100:1 density contrast, N = 1,024-16,384), the solver runs from ~143 to ~21 fps. At N = 8,192, it reaches 17.9 ms/frame, with 2.2x speedup over GPU Jacobi, ~28x over GPU IC/DILU (AMGX multicolor_dilu), and 2.7x over neural SPAI retrained per scale on the same benchmark.

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

2026-05-14T03:34:38Z

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup over PyTorch Eager (1.85x over torch.compile) on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. Code is publicly available at: https://github.com/pike-project/pike

Toward Optimal-Complexity Hash-Based Asynchronous MVBA with Optimal Resilience

2026-05-13T22:36:02Z

Multi-valued validated Byzantine agreement (MVBA), a fundamental primitive of distributed computing, allows $n$ processes to agree on a valid $\ell$-bit value, despite $t$ faulty processes behaving maliciously. Among hash-based solutions for the asynchronous setting with adaptive faults, the state-of-the-art HMVBA protocol achieves optimal $O(n^2)$ message complexity, (near-)optimal $O(n \ell + n^2 λ\log n)$ bit complexity, and optimal $O(1)$ time complexity. However, it only tolerates $t < \frac15 n$ failures. In contrast, the best-known optimally-resilient protocol, SQ, incurs a higher bit complexity of $O(n^2 \ell + n^3 λ)$. This poses a fundamental question: Can a hash-based protocol be designed for the asynchronous setting with adaptive faults that simultaneously achieves optimal complexity and optimal resilience? This paper takes a significant step toward answering this question. Namely, we introduce Reducer, an MVBA protocol that retains HMVBA's optimal complexity while improving its resilience to $t < \frac14 n$. Like HMVBA and SQ, Reducer relies exclusively on collision-resistant hash functions. A key innovation in Reducer's design is its internal use of strong multi-valued Byzantine agreement (SMBA), a new variant of Byzantine agreement we introduce and construct, which ensures that the decided value was proposed by a correct process. To further advance resilience toward the optimal one-third bound, we then propose Reducer++, an MVBA protocol that tolerates up to $t < (\frac13 - ε)n$ adaptive failures, for any fixed constant $ε> 0$. Unlike Reducer, Reducer++ does not rely on SMBA. Instead, it employs a novel approach involving hash functions modeled as random oracles to ensure termination. Reducer++ maintains constant time complexity, quadratic message complexity, and quasi-quadratic bit complexity, with constants dependent on $ε$.

Analog computation with transcriptional networks

2026-05-13T21:07:48Z

Transcriptional networks represent one of the most extensively studied types of systems in synthetic biology. Although the completeness of transcriptional networks for digital logic is well-established, *analog* computation plays a crucial role in biological systems and offers significant potential for synthetic biology applications. While transcriptional circuits typically rely on cooperativity and highly non-linear behavior of transcription factors to regulate *production* of proteins, they are often modeled with simple linear *degradation* terms. In contrast, general analog dynamics require both non-linear positive as well as negative terms, seemingly necessitating control over not just transcriptional (i.e., production) regulation but also the degradation rates of transcription factors. Surprisingly, we prove that controlling transcription factor production (i.e., transcription rate) without explicitly controlling degradation is mathematically complete for analog computation, achieving equivalent capabilities to systems where both production and degradation are programmable. We demonstrate our approach on several examples including oscillatory and chaotic dynamics, analog sorting, memory, PID controller, and analog extremum seeking. Our result provides a systematic methodology for engineering novel analog dynamics using synthetic transcriptional networks without the added complexity of degradation control and informs our understanding of the capabilities of natural transcriptional circuits. We provide a compiler, in the form of a Python package that can take any system of polynomial ODEs and convert it to an equivalent transcriptional network implementing the system *exactly*, under appropriate conditions.

Distributed Statistical Zero-Knowledge Proofs via Sumcheck

2026-05-13T18:27:32Z

We study distributed zero-knowledge proofs, introduced by Bick, Kol, and Oshman (SODA 2022). While distributed interactive proofs have advanced rapidly, general-purpose techniques for distributed zero-knowledge remain limited and mostly problem-specific. We address this gap by introducing distributed statistical zero-knowledge, requiring that each node's view be simulatable within negligible statistical distance, and by lifting the classical Sumcheck protocol (Lund, Fortnow, Karloff, and Nisan, FOCS 1990) into a modular primitive for distributed zero-knowledge proofs. Our main contribution is a distributed zero-knowledge implementation of Sumcheck. Given oracle access to a polynomial F over a finite field $\mathbb{F}$ with N variables, we design a protocol verifying claims of the form $\sum_{x\in\mathbb{F}} F(x)=a$ using $O(N)$ rounds of $O(\log |\mathbb{F}|)$-bit messages, while achieving statistical zero-knowledge and small soundness error. We apply this primitive to two problems. For non-k-colorability, we obtain an $O(n)$-round distributed statistical zero-knowledge proof deciding whether a graph is not k-colorable, for any constant k, using $O(log^{1+o(1)} n)$-bit messages. This is the first nontrivial distributed interactive proof for this problem, even without zero-knowledge guarantees. For Subgraph Counting, we obtain an $O(k \log n)$-round, $O(k \log n)$-bit distributed statistical zero-knowledge proof for counting copies of a given k-node pattern, improving previous distributed interactive proofs while additionally providing statistical zero-knowledge. Finally, we show that additional round compression of Sumcheck is problem-dependent: for non-3-colorability on constant-degree graphs, we prove a lower bound excluding $o(n/\log n)$ rounds under polynomial-time local computation.

EMA: Efficient Model Adaptation for Learning-based Systems

2026-05-13T17:26:06Z

Machine learning (ML) is increasingly applied to optimize system performance in tasks such as resource management and network simulation. Unlike traditional ML tasks (e.g., image classification), networked systems often operate in heterogeneous, long-running, and dynamic environment states, where input conditions (e.g., network loads) and operational objectives can shift over time and across settings. Existing learning-based systems offer little support for adaptation, resulting in costly model training, extensive data collection, degraded system performance, and slow responsiveness. This paper presents EMA, the first model adaptation system supporting learning-based systems to adapt to evolving environments with minimal operational overhead. EMA takes a system-driven, data-centric approach that accommodates diverse system and model designs while addressing two key deployment challenges. First, it reduces expensive model training by introducing state transformers that align the input state of a new environment with previously similar states, allowing models to warm-start adaptation. Second, it addresses the often-overlooked yet costly process of data labeling--collecting ground truth for exploring and training on various system decisions--by prioritizing labeling high-utility data while balancing the tradeoff between training and labeling cost. Evaluations on eight representative learning-based systems show that EMA reduces adaptation costs (e.g., GPU training time) by 14.9-42.4% while improving system performance (e.g., network throughput) by 6.9-31.3%.

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

2026-05-13T16:59:08Z

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

2026-05-13T16:20:33Z

The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world's most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the Sherpa.ai Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

2026-05-13T16:12:33Z

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.

DisAgg: Distributed Aggregators for Efficient Secure Aggregation in Federated Learning

2026-05-13T15:56:12Z

Federated learning enables collaborative model training across distributed clients, yet vanilla FL exposes client updates to the central server. Secure-aggregation schemes protect privacy against an honest-but-curious server, but existing approaches often suffer from many communication rounds, heavy public-key operations, or difficulty handling client dropouts. Recent methods like One-Shot Private Aggregation (OPA) cut rounds to a single server interaction per FL iteration, yet they impose substantial cryptographic and computational overhead on both server and clients. We propose a new protocol called DisAgg that leverages a small committee of clients called Aggregators to perform the aggregation itself: each client secret-shares its update vector to Aggregators, which locally compute partial sums and return only aggregated shares for server-side reconstruction. This design eliminates local masking and expensive homomorphic encryption, reducing endpoint computation while preserving privacy against a curious server and a limited fraction of colluding clients. By leveraging optimal trade-offs between communication and computation costs, DisAgg processes 100k-dimensional update vectors from 100k 5G clients with a 4.6x speedup compared to OPA, the previous best protocol.