https://arxiv.org/api/cIW2re+coDFe4tDsHiqrMsIrxCg 2026-06-09T22:26:57Z 28823 45 15 http://arxiv.org/abs/2606.07957v1 Demand-Driven Vulnerability Detection for Cloud Security Posture Management: Removing Human Rule Authoring from the Disclosure-to-Protection Critical Path 2026-06-06T03:26:34Z Cloud Security Posture Management (CSPM) systems detect known vulnerabilities by maintaining a rule set, distributing it to customers, and evaluating it against periodically-collected asset inventories. To our knowledge, in publicly documented architectures the rule set is environment-agnostic and curated centrally by the vendor; updates are batched into release cycles and shipped on a cadence ranging from hours to days depending on detection severity. The disclosure-to-protection window -- from a CVE being published to the customer's system being capable of detecting affected assets -- is therefore bounded by the vendor's release cadence for version-match detections, and by additional human authoring time for richer detections incorporating configuration predicates beyond the affected-software string. We propose an architecture in which the rule set is not vendor-distributed but continuously derived, within the customer's tenant, from the intersection of public catalogue feeds and the live asset graph. A rule comes into existence when a catalogue entry and an applicable asset are simultaneously present, and goes out of existence when either input ceases to support it. Derivation is bidirectional: new catalogue entries and new assets both trigger it. It incorporates the full structured-field content of catalogue entries, not only the affected-software predicate. The live rule set is bounded by environment diversity rather than catalogue breadth. Prior systems incrementally evaluate a static rule set; we incrementally derive the rule set itself. We present the threat model, the architecture, formal semantics with an equivalence theorem, complexity analysis, a worked example, and an evaluation methodology. The contribution is the architectural shift and its latency and resource consequences; rule correctness and alert prioritization are out of scope. 2026-06-06T03:26:34Z 13 pages, 3 figures. Preprint. Under review at IEEE Transactions on Cloud Computing Prashant Kumar Pathak http://arxiv.org/abs/2605.26323v3 Totoro$^+$: An Adaptive and Scalable Edge Federated Learning System 2026-06-05T22:00:22Z Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro$^+$, a novel scalable FL system that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro$^+$ assigns a dedicated parameter server to each application. Any edge node can act as any application's coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro$^+$ introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribe-based forest abstraction, and a game-theoretic path planning model with a guarantee of an $ε$-approximate Nash equilibrium. Real-world experiments on 500 Amazon EC2 servers show that Totoro$^+$ scales gracefully with the number of FL applications and $N$ edge nodes speeds up the total training time by $1.2\times-14.0\times$, achieves $\mathcal{O}(\log N)$ hops for model dissemination and gradient aggregation with millions of nodes, and efficiently adapts to the practical edge networks and churns. 2026-05-25T20:53:05Z Accepted to IEEE Transactions on Parallel and Distributed Systems (TPDS). This version includes the appendix IEEE Transactions on Parallel and Distributed Systems, vol. 37, no. 7, pp. 1740-1757, July 2026 Cheng-Wei Ching Xin Chen Taehwan Kim Jian-Jhih Kuo Dilma Da Silva Liting Hu 10.1109/TPDS.2026.3696917 http://arxiv.org/abs/2606.04819v2 The Usefulness Gap in Proof-of-Useful-Work: An Empirical Study of Pearl's cuPOW Protocol 2026-06-05T21:17:13Z Pearl, a Layer-1 blockchain with high-profile AI industry endorsements, markets its Proof-of-Useful-Work (PoUW) protocol as simultaneously securing the network and performing AI inference. We present the first systematic empirical measurement of a deployed PoUW system, finding that Pearl's 24 EH/s network -- representing approximately 320,000 GPU-equivalents consuming an estimated 112 MW -- produces zero useful AI computation. Budget GPU rental prices rose 38% and utilization surged from 57% to 94% following the mining software's public release, displacing legitimate research workloads. Our measurements span five dimensions: (1) network composition analysis of 8,012 workers shows all have inference-capable hardware, yet the dominant mining software contains no inference code; (2) the verification protocol accepts random matrices by design, confirmed by 44 pool-accepted shares from our open-source miner across NVIDIA, AMD, CPU, and Apple Silicon hardware; (3) statistical distribution checks are trivially defeated by adversarial Gaussian sampling; (4) mining economics are marginal at current PRL prices ($0.76), with ROI ranging from -1% to +67% depending on GPU tier -- near breakeven for most hardware; and (5) the mining computation is commodity integer arithmetic portable to any hardware platform, offering no vendor lock-in. These findings quantify the verifiability-usefulness tension identified theoretically by Leinweber et al., providing concrete measurements of its magnitude and economic consequences in a deployed system. 2026-06-03T12:42:29Z Abhinaba Basu http://arxiv.org/abs/2606.07846v1 Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method 2026-06-05T21:13:47Z LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior. 2026-06-05T21:13:47Z Faisal Fareed http://arxiv.org/abs/2606.07777v1 Large-Scale Regularized Matching on GPU Clusters 2026-06-05T18:41:38Z Production decision systems such as ad allocation or content matching involve millions of users and thousands of items, reducing to large-scale linear programs with sparse block-diagonal structure across users. These LPs are solved repeatedly on recurring cadences over slowly evolving inputs. Three system gaps stand out. Scale: production instances routinely exceed the memory capacity of GPU solvers such as cuPDLP and D-PDLP under fixed hardware budgets. Temporal instability: solution variability across runs induces downstream churn and complicates SLAs, yet existing solvers provide no explicit control. Extensibility: CPU-based solvers such as DuaLip-Scala converge slowly and couple problem formulation to fixed schemas, making new constraint families difficult to express. We present a distributed multi-GPU LP solver built natively in PyTorch with systems-algorithm co-design for this structure. It adopts column-sharded parallelism with fused Triton kernels and batched operations to reduce per-iteration overhead. As users grow, only local computation increases, while communication is limited to a reduction of item-level dual variables, yielding near-linear scaling with GPU count at fixed item size. We also adopt ridge-regularized LPs to improve stability, a control absent from existing GPU solvers. A continuation schedule over the regularization parameter balances convergence speed and solution fidelity. Finally, we introduce an operator-centric programming model that replaces DuaLip-Scala's schema-bound interface with composable primitives, enabling new formulations without modifying the solve loop or distributed infrastructure. On synthetic workloads, our system achieves order-of-magnitude wall-clock speedup over DuaLip-Scala, near-linear multi-GPU scaling (3.86x on 4 GPUs), and scales beyond the reach of existing GPU solvers. 2026-06-05T18:41:38Z Aida Rahmattalabi Gregory Dexter Sanjana Garg Qinquan Song Shenyinying Tu Yuan Gao Zhipeng Wang Rahul Mazumder http://arxiv.org/abs/2606.07491v1 Twelve quick tips for designing AI-driven HPC workflows 2026-06-05T17:46:32Z High-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic, linear pipelines optimised for predictable performance. However, the pervasive integration of artificial intelligence (AI) and foundation models into scientific research has introduced a fundamentally new computational paradigm. AI-driven workflows are characteristically iterative, data-driven, and probabilistic, introducing unique challenges regarding data gravity, heterogeneous resource management, and complex workflow orchestration. This guide provides twelve practical tips designed to help researchers design efficient, scalable, and reproducible AI-driven HPC workflows. By addressing critical system-level bottlenecks - such as containerisation for environment portability, strategic deployment of job arrays, explicit feedback loop mechanics, and I/O optimisation for small files - this article offers a framework for transitioning from rigid execution pipelines to adaptive, intelligent computational environments. While these architectural principles are broadly applicable across distributed environments, they are particularly tailored to the resource-intensive throughput demands of modern computational biology. 2026-06-05T17:46:32Z 12 pages, 1 figure. Formatted using the bioRxiv LaTeX preprint style Jamie J. Alnasir http://arxiv.org/abs/2606.01183v2 The World's Fastest Matching Engine Algorithm 2026-06-05T16:06:43Z A single CPU core sustains 32 million order messages per second at sub-microsecond median wire-to-wire latency, up to 11 times faster than the best open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding each entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in ordered event streams and electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants. 2026-05-31T11:51:22Z 20 pages, 5 figures, 7 tables Jake Yoon http://arxiv.org/abs/2602.17834v2 Distributed Triangle and Simplex Enumeration in Hypergraphs 2026-06-05T15:55:35Z In the last decade, subgraph detection and enumeration have emerged as central problems in distributed graph algorithms. This is largely due to the problems' theoretical challenges and practical applications. In this paper, we initiate the systematic study of distributed sub-hypergraph enumeration in hypergraphs. To this end, we (1) introduce several computational models for hypergraphs that generalize the CONGEST model for graphs and evaluate their relative computational power, (2) devise algorithms for distributed triangle and simplex enumeration in our computational models and prove their optimality in two such models by showing matching lower bounds, (3) introduce classes of sparse and "everywhere sparse" hypergraphs and describe efficient distributed algorithms for triangle and simplex enumeration in these classes, and (4) describe general techniques that we believe to be useful for designing efficient algorithms in our hypergraph models. 2026-02-19T20:57:26Z Added new results for simplex enumeration. Various improvements for presentation Duncan Adamson Will Rosenbaum Paul G. Spirakis http://arxiv.org/abs/2602.21411v2 General Convex Agreement with Near-Optimal Communication 2026-06-05T14:52:37Z Byzantine Agreement (BA) considers a setting of $n$ parties out of which up to $t$ can be byzantine (malicious), and requires the honest parties to agree on an input subject to a condition called \emph{validity}: if all honest parties have input $v$, the output agreed upon must be $v$. Convex Agreement (CA) strengthens BA by requiring the output agreed upon to lie in the convex hull of the honest parties' inputs. This validity condition captures aggregation tasks, such as robust learning and sensor fusion, where honest inputs may differ but should still constrain the final decision. Existing protocols for CA over general convexity spaces require at least $O(L \cdot n^2)$ bits of communication for $L$-bit inputs, leaving a gap with BA's $Ω(L \cdot n)$ lower bound. We investigate this gap, and we present deterministic synchronous CA protocols with near-optimal communication complexity in the long-message regime. When $L=Ω(n\cdotκ)$, where $κ$ is a security parameter, our protocols use $\mathcal{O}(L\cdot n\log n)$ bits of communication for finite convexity spaces and $\mathcal{O}(L\cdot n^{1+o(1)})$ communication for Euclidean spaces $\mathbb{R}^d$. Our protocols also have asymptotically optimal round complexity $\mathcal{O}(n)$. If an upper bound $L$ on the honest inputs' length in bits is known in advance, we achieve near-optimal resilience $t<n/(ω+\varepsilon)$ for any constant $\varepsilon>0$, where $ω$ is the Helly number of the convexity space. When no such bound is known, we achieve resilience $t<n/(ω+\varepsilon+1)$. As a sample application, we show how our protocols can be used to obtain efficient solutions for parallel instances of BA. Our main technical contribution is the use of extractor graphs to obtain a deterministic assignment of parties to committees, which is robust against adaptive adversaries. 2026-02-24T22:31:20Z Working paper Marc Dufay Diana Ghinea Anton Paramonov http://arxiv.org/abs/2606.07316v1 Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration 2026-06-05T14:35:58Z Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n<3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03. 2026-06-05T14:35:58Z 27 pages, 3 figures, 8 tables Haoran Xu Lei Zhang Iadh Ounis Xianbin Wang http://arxiv.org/abs/2606.07248v1 Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends 2026-06-05T13:19:05Z Serial LLM inference backends -- such as Ollama -- process requests one at a time under FCFS admission, causing Head-of-Line Blocking (HOLB) under mixed workloads at high utilisation: short factual queries can be delayed by minutes behind long generation jobs. While cloud-scale deployments mitigate HOLB via continuous batching (vLLM, Orca), these solutions require tens of GB of VRAM for concurrent KV-caches -- infeasible for memory-constrained edge and local deployments that rely on serial request dispatch. We present \clairvoyant, a drop-in sidecar proxy for any serial OpenAI-compatible backend (e.g., Ollama, llama.cpp). \clairvoyant predicts response length from 19 lightweight lexical features via an ONNX-exported XGBoost classifier, achieving 0.029\,ms per-request latency (four orders of magnitude below typical generation time). Because admission scheduling depends on relative ordering rather than exact prediction, the system optimises ranking fidelity, achieving 62--96\% in-distribution and 52--66\% cross-distribution accuracy across natural conversation datasets. We find that curated instruction datasets are degenerate training sources for length prediction: GPT-imposed brevity constraints reduce Long-class representation to under 0.02\% of examples, making natural conversation logs the only viable training source. End-to-end GPU benchmarks on an RTX~4090 show 70--76\% P50 latency reduction for short requests under maximum queue pressure (100 concurrent requests) and 17\% under steady-state Poisson arrivals ($ρ=0.74$). \clairvoyant is open-source and requires no modifications to the inference backend. 2026-06-05T13:19:05Z 17 pages, 3 figures, 8 tables. Code: https://github.com/Aravind0403/clairvoyant-scheduler Aravind Sundaresan http://arxiv.org/abs/2605.25645v3 Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines 2026-06-05T11:34:29Z We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure. 2026-05-25T09:51:59Z Jatin Kishnani Mayank Goel Amit Singh Pulkit Agrawal Sairanjan Mishra http://arxiv.org/abs/2606.07046v1 Predictive Autoscaling in Cloud-Native and Federated Cloud-Edge Computing Environments: A Taxonomy and Future Directions 2026-06-05T08:43:25Z Autoscaling is a key capability in cloud-native systems, where dynamic workloads, heterogeneous environments, and latency-sensitive applications require efficient and adaptive resource management. Traditional reactive approaches based on fixed thresholds often respond too late, leading to resource imbalance, performance degradation, and unstable scaling behavior. Recent advances in predictive models, Kubernetes Custom Resource Definitions (CRDs), Monitor-Analyse-Plan-Execute (MAPE) based control loops, and federated learning (FL) have enabled more proactive and autonomous autoscaling strategies. This paper presents a structured review of these developments. It first introduces a taxonomy of autoscaling techniques based on triggers, targets, prediction models, and evaluation metrics. It then examines predictive autoscaling approaches and CRD-based mechanisms, including Kubernetes operators and reconciliation workflows. Further, it analyses autoscaling in federated learning environments, highlighting reactive and proactive strategies alongside privacy-preserving techniques and container-level isolation. The paper also discusses drift-aware and uncertainty-aware autoscaling, incorporating concepts such as the Autoscaling Drift Index (ADI), feedback-driven correction, and stability control for heterogeneous workloads. Finally, it outlines open challenges and future research directions, providing a foundation for next-generation intelligent predictive autoscaling in cloud-edge environments. 2026-06-05T08:43:25Z Bablu Kumar Anshul Verma Rajkumar Buyya http://arxiv.org/abs/2606.07019v1 PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer 2026-06-05T08:08:56Z Distributed machine learning has become increasingly important due to the massive scale of large-scale generative models. Both model parameters and data are distributed across many compute devices, which requires frequent collective communications to synchronize activations and parameter updates. Such collective communications have become a major bottleneck. While the performance of the collective algorithm depends on the physical network topology, the baseline collective algorithms in collective communication libraries are largely topology-agnostic. Collective algorithm synthesizers address this inefficiency by automatically generating topology-aware collective algorithms. However, prior works have largely overlooked that collective communication typically occurs only among a subset of devices, known as process groups. Additionally, most existing synthesizers are limited in the range of target collective patterns they can generate. We propose PCCL, a scalable and generic framework for synthesizing topology-aware collective algorithms. PCCL is process group-aware and capable of generating near-optimal collective algorithms even when only a subset of devices participates in collective operations. PCCL synthesizes arbitrary collective patterns, including 512-NPU All-to-All synthesis in 11.68 minutes. 2026-06-05T08:08:56Z Contains 11 main pages, 19 figures, three tables, three algorithms William Won Kartik Lakhotia Madhu Kumar Sudarshan Srinivasan Tushar Krishna http://arxiv.org/abs/2606.06996v1 Mission-Level Runtime Assurance Framework for Autonomous Driving 2026-06-05T07:35:10Z This paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions. 2026-06-05T07:35:10Z Chieh Tsai Salim Hariri