https://arxiv.org/api/cIW2re+coDFe4tDsHiqrMsIrxCg2026-06-09T22:26:57Z288234515http://arxiv.org/abs/2606.07957v1Demand-Driven Vulnerability Detection for Cloud Security Posture Management: Removing Human Rule Authoring from the Disclosure-to-Protection Critical Path2026-06-06T03:26:34ZCloud Security Posture Management (CSPM) systems detect known vulnerabilities by maintaining a rule set, distributing it to customers, and evaluating it against periodically-collected asset inventories. To our knowledge, in publicly documented architectures the rule set is environment-agnostic and curated centrally by the vendor; updates are batched into release cycles and shipped on a cadence ranging from hours to days depending on detection severity. The disclosure-to-protection window -- from a CVE being published to the customer's system being capable of detecting affected assets -- is therefore bounded by the vendor's release cadence for version-match detections, and by additional human authoring time for richer detections incorporating configuration predicates beyond the affected-software string. We propose an architecture in which the rule set is not vendor-distributed but continuously derived, within the customer's tenant, from the intersection of public catalogue feeds and the live asset graph. A rule comes into existence when a catalogue entry and an applicable asset are simultaneously present, and goes out of existence when either input ceases to support it. Derivation is bidirectional: new catalogue entries and new assets both trigger it. It incorporates the full structured-field content of catalogue entries, not only the affected-software predicate. The live rule set is bounded by environment diversity rather than catalogue breadth. Prior systems incrementally evaluate a static rule set; we incrementally derive the rule set itself. We present the threat model, the architecture, formal semantics with an equivalence theorem, complexity analysis, a worked example, and an evaluation methodology. The contribution is the architectural shift and its latency and resource consequences; rule correctness and alert prioritization are out of scope.2026-06-06T03:26:34Z13 pages, 3 figures. Preprint. Under review at IEEE Transactions on Cloud ComputingPrashant Kumar Pathakhttp://arxiv.org/abs/2605.26323v3Totoro$^+$: An Adaptive and Scalable Edge Federated Learning System2026-06-05T22:00:22ZFederated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro$^+$, a novel scalable FL system that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro$^+$ assigns a dedicated parameter server to each application. Any edge node can act as any application's coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro$^+$ introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribe-based forest abstraction, and a game-theoretic path planning model with a guarantee of an $ε$-approximate Nash equilibrium. Real-world experiments on 500 Amazon EC2 servers show that Totoro$^+$ scales gracefully with the number of FL applications and $N$ edge nodes speeds up the total training time by $1.2\times-14.0\times$, achieves $\mathcal{O}(\log N)$ hops for model dissemination and gradient aggregation with millions of nodes, and efficiently adapts to the practical edge networks and churns.2026-05-25T20:53:05ZAccepted to IEEE Transactions on Parallel and Distributed Systems (TPDS). This version includes the appendixIEEE Transactions on Parallel and Distributed Systems, vol. 37, no. 7, pp. 1740-1757, July 2026Cheng-Wei ChingXin ChenTaehwan KimJian-Jhih KuoDilma Da SilvaLiting Hu10.1109/TPDS.2026.3696917http://arxiv.org/abs/2606.04819v2The Usefulness Gap in Proof-of-Useful-Work: An Empirical Study of Pearl's cuPOW Protocol2026-06-05T21:17:13ZPearl, a Layer-1 blockchain with high-profile AI industry endorsements, markets its Proof-of-Useful-Work (PoUW) protocol as simultaneously securing the network and performing AI inference. We present the first systematic empirical measurement of a deployed PoUW system, finding that Pearl's 24 EH/s network -- representing approximately 320,000 GPU-equivalents consuming an estimated 112 MW -- produces zero useful AI computation. Budget GPU rental prices rose 38% and utilization surged from 57% to 94% following the mining software's public release, displacing legitimate research workloads.
Our measurements span five dimensions: (1) network composition analysis of 8,012 workers shows all have inference-capable hardware, yet the dominant mining software contains no inference code; (2) the verification protocol accepts random matrices by design, confirmed by 44 pool-accepted shares from our open-source miner across NVIDIA, AMD, CPU, and Apple Silicon hardware; (3) statistical distribution checks are trivially defeated by adversarial Gaussian sampling; (4) mining economics are marginal at current PRL prices ($0.76), with ROI ranging from -1% to +67% depending on GPU tier -- near breakeven for most hardware; and (5) the mining computation is commodity integer arithmetic portable to any hardware platform, offering no vendor lock-in. These findings quantify the verifiability-usefulness tension identified theoretically by Leinweber et al., providing concrete measurements of its magnitude and economic consequences in a deployed system.2026-06-03T12:42:29ZAbhinaba Basuhttp://arxiv.org/abs/2606.07846v1Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method2026-06-05T21:13:47ZLLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.2026-06-05T21:13:47ZFaisal Fareedhttp://arxiv.org/abs/2606.07777v1Large-Scale Regularized Matching on GPU Clusters2026-06-05T18:41:38ZProduction decision systems such as ad allocation or content matching involve millions of users and thousands of items, reducing to large-scale linear programs with sparse block-diagonal structure across users. These LPs are solved repeatedly on recurring cadences over slowly evolving inputs. Three system gaps stand out. Scale: production instances routinely exceed the memory capacity of GPU solvers such as cuPDLP and D-PDLP under fixed hardware budgets. Temporal instability: solution variability across runs induces downstream churn and complicates SLAs, yet existing solvers provide no explicit control. Extensibility: CPU-based solvers such as DuaLip-Scala converge slowly and couple problem formulation to fixed schemas, making new constraint families difficult to express. We present a distributed multi-GPU LP solver built natively in PyTorch with systems-algorithm co-design for this structure. It adopts column-sharded parallelism with fused Triton kernels and batched operations to reduce per-iteration overhead. As users grow, only local computation increases, while communication is limited to a reduction of item-level dual variables, yielding near-linear scaling with GPU count at fixed item size. We also adopt ridge-regularized LPs to improve stability, a control absent from existing GPU solvers. A continuation schedule over the regularization parameter balances convergence speed and solution fidelity. Finally, we introduce an operator-centric programming model that replaces DuaLip-Scala's schema-bound interface with composable primitives, enabling new formulations without modifying the solve loop or distributed infrastructure. On synthetic workloads, our system achieves order-of-magnitude wall-clock speedup over DuaLip-Scala, near-linear multi-GPU scaling (3.86x on 4 GPUs), and scales beyond the reach of existing GPU solvers.2026-06-05T18:41:38ZAida RahmattalabiGregory DexterSanjana GargQinquan SongShenyinying TuYuan GaoZhipeng WangRahul Mazumderhttp://arxiv.org/abs/2606.07491v1Twelve quick tips for designing AI-driven HPC workflows2026-06-05T17:46:32ZHigh-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic, linear pipelines optimised for predictable performance. However, the pervasive integration of artificial intelligence (AI) and foundation models into scientific research has introduced a fundamentally new computational paradigm. AI-driven workflows are characteristically iterative, data-driven, and probabilistic, introducing unique challenges regarding data gravity, heterogeneous resource management, and complex workflow orchestration.
This guide provides twelve practical tips designed to help researchers design efficient, scalable, and reproducible AI-driven HPC workflows. By addressing critical system-level bottlenecks - such as containerisation for environment portability, strategic deployment of job arrays, explicit feedback loop mechanics, and I/O optimisation for small files - this article offers a framework for transitioning from rigid execution pipelines to adaptive, intelligent computational environments. While these architectural principles are broadly applicable across distributed environments, they are particularly tailored to the resource-intensive throughput demands of modern computational biology.2026-06-05T17:46:32Z12 pages, 1 figure. Formatted using the bioRxiv LaTeX preprint styleJamie J. Alnasirhttp://arxiv.org/abs/2606.01183v2The World's Fastest Matching Engine Algorithm2026-06-05T16:06:43ZA single CPU core sustains 32 million order messages per second at sub-microsecond median wire-to-wire latency, up to 11 times faster than the best open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding each entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in ordered event streams and electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.2026-05-31T11:51:22Z20 pages, 5 figures, 7 tablesJake Yoonhttp://arxiv.org/abs/2602.17834v2Distributed Triangle and Simplex Enumeration in Hypergraphs2026-06-05T15:55:35ZIn the last decade, subgraph detection and enumeration have emerged as central problems in distributed graph algorithms. This is largely due to the problems' theoretical challenges and practical applications. In this paper, we initiate the systematic study of distributed sub-hypergraph enumeration in hypergraphs. To this end, we (1) introduce several computational models for hypergraphs that generalize the CONGEST model for graphs and evaluate their relative computational power, (2) devise algorithms for distributed triangle and simplex enumeration in our computational models and prove their optimality in two such models by showing matching lower bounds, (3) introduce classes of sparse and "everywhere sparse" hypergraphs and describe efficient distributed algorithms for triangle and simplex enumeration in these classes, and (4) describe general techniques that we believe to be useful for designing efficient algorithms in our hypergraph models.2026-02-19T20:57:26ZAdded new results for simplex enumeration. Various improvements for presentationDuncan AdamsonWill RosenbaumPaul G. Spirakishttp://arxiv.org/abs/2602.21411v2General Convex Agreement with Near-Optimal Communication2026-06-05T14:52:37ZByzantine Agreement (BA) considers a setting of $n$ parties out of which up to $t$ can be byzantine (malicious), and requires the honest parties to agree on an input subject to a condition called \emph{validity}: if all honest parties have input $v$, the output agreed upon must be $v$. Convex Agreement (CA) strengthens BA by requiring the output agreed upon to lie in the convex hull of the honest parties' inputs. This validity condition captures aggregation tasks, such as robust learning and sensor fusion, where honest inputs may differ but should still constrain the final decision. Existing protocols for CA over general convexity spaces require at least $O(L \cdot n^2)$ bits of communication for $L$-bit inputs, leaving a gap with BA's $Ω(L \cdot n)$ lower bound.
We investigate this gap, and we present deterministic synchronous CA protocols with near-optimal communication complexity in the long-message regime. When $L=Ω(n\cdotκ)$, where $κ$ is a security parameter, our protocols use $\mathcal{O}(L\cdot n\log n)$ bits of communication for finite convexity spaces and $\mathcal{O}(L\cdot n^{1+o(1)})$ communication for Euclidean spaces $\mathbb{R}^d$. Our protocols also have asymptotically optimal round complexity $\mathcal{O}(n)$. If an upper bound $L$ on the honest inputs' length in bits is known in advance, we achieve near-optimal resilience $t<n/(ω+\varepsilon)$ for any constant $\varepsilon>0$, where $ω$ is the Helly number of the convexity space. When no such bound is known, we achieve resilience $t<n/(ω+\varepsilon+1)$.
As a sample application, we show how our protocols can be used to obtain efficient solutions for parallel instances of BA.
Our main technical contribution is the use of extractor graphs to obtain a deterministic assignment of parties to committees, which is robust against adaptive adversaries.2026-02-24T22:31:20ZWorking paperMarc DufayDiana GhineaAnton Paramonovhttp://arxiv.org/abs/2606.07316v1Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration2026-06-05T14:35:58ZByzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n<3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.2026-06-05T14:35:58Z27 pages, 3 figures, 8 tablesHaoran XuLei ZhangIadh OunisXianbin Wanghttp://arxiv.org/abs/2606.07248v1Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends2026-06-05T13:19:05ZSerial LLM inference backends -- such as Ollama -- process requests one at a time under FCFS admission, causing Head-of-Line Blocking (HOLB) under mixed workloads at high utilisation: short factual queries can be delayed by minutes behind long generation jobs. While cloud-scale deployments mitigate HOLB via continuous batching (vLLM, Orca), these solutions require tens of GB of VRAM for concurrent KV-caches -- infeasible for memory-constrained edge and local deployments that rely on serial request dispatch. We present \clairvoyant, a drop-in sidecar proxy for any serial OpenAI-compatible backend (e.g., Ollama, llama.cpp). \clairvoyant predicts response length from 19 lightweight lexical features via an ONNX-exported XGBoost classifier, achieving 0.029\,ms per-request latency (four orders of magnitude below typical generation time). Because admission scheduling depends on relative ordering rather than exact prediction, the system optimises ranking fidelity, achieving 62--96\% in-distribution and 52--66\% cross-distribution accuracy across natural conversation datasets. We find that curated instruction datasets are degenerate training sources for length prediction: GPT-imposed brevity constraints reduce Long-class representation to under 0.02\% of examples, making natural conversation logs the only viable training source. End-to-end GPU benchmarks on an RTX~4090 show 70--76\% P50 latency reduction for short requests under maximum queue pressure (100 concurrent requests) and 17\% under steady-state Poisson arrivals ($ρ=0.74$). \clairvoyant is open-source and requires no modifications to the inference backend.2026-06-05T13:19:05Z17 pages, 3 figures, 8 tables. Code: https://github.com/Aravind0403/clairvoyant-schedulerAravind Sundaresanhttp://arxiv.org/abs/2605.25645v3Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines2026-06-05T11:34:29ZWe present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure.2026-05-25T09:51:59ZJatin KishnaniMayank GoelAmit SinghPulkit AgrawalSairanjan Mishrahttp://arxiv.org/abs/2606.07046v1Predictive Autoscaling in Cloud-Native and Federated Cloud-Edge Computing Environments: A Taxonomy and Future Directions2026-06-05T08:43:25ZAutoscaling is a key capability in cloud-native systems, where dynamic workloads, heterogeneous environments, and latency-sensitive applications require efficient and adaptive resource management. Traditional reactive approaches based on fixed thresholds often respond too late, leading to resource imbalance, performance degradation, and unstable scaling behavior. Recent advances in predictive models, Kubernetes Custom Resource Definitions (CRDs), Monitor-Analyse-Plan-Execute (MAPE) based control loops, and federated learning (FL) have enabled more proactive and autonomous autoscaling strategies. This paper presents a structured review of these developments. It first introduces a taxonomy of autoscaling techniques based on triggers, targets, prediction models, and evaluation metrics. It then examines predictive autoscaling approaches and CRD-based mechanisms, including Kubernetes operators and reconciliation workflows. Further, it analyses autoscaling in federated learning environments, highlighting reactive and proactive strategies alongside privacy-preserving techniques and container-level isolation. The paper also discusses drift-aware and uncertainty-aware autoscaling, incorporating concepts such as the Autoscaling Drift Index (ADI), feedback-driven correction, and stability control for heterogeneous workloads. Finally, it outlines open challenges and future research directions, providing a foundation for next-generation intelligent predictive autoscaling in cloud-edge environments.2026-06-05T08:43:25ZBablu KumarAnshul VermaRajkumar Buyyahttp://arxiv.org/abs/2606.07019v1PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer2026-06-05T08:08:56ZDistributed machine learning has become increasingly important due to the massive scale of large-scale generative models. Both model parameters and data are distributed across many compute devices, which requires frequent collective communications to synchronize activations and parameter updates. Such collective communications have become a major bottleneck. While the performance of the collective algorithm depends on the physical network topology, the baseline collective algorithms in collective communication libraries are largely topology-agnostic. Collective algorithm synthesizers address this inefficiency by automatically generating topology-aware collective algorithms. However, prior works have largely overlooked that collective communication typically occurs only among a subset of devices, known as process groups. Additionally, most existing synthesizers are limited in the range of target collective patterns they can generate. We propose PCCL, a scalable and generic framework for synthesizing topology-aware collective algorithms. PCCL is process group-aware and capable of generating near-optimal collective algorithms even when only a subset of devices participates in collective operations. PCCL synthesizes arbitrary collective patterns, including 512-NPU All-to-All synthesis in 11.68 minutes.2026-06-05T08:08:56ZContains 11 main pages, 19 figures, three tables, three algorithmsWilliam WonKartik LakhotiaMadhu KumarSudarshan SrinivasanTushar Krishnahttp://arxiv.org/abs/2606.06996v1Mission-Level Runtime Assurance Framework for Autonomous Driving2026-06-05T07:35:10ZThis paper studies runtime safety for autonomous driving when high-level driving commands become faulty or unreliable. Unlike conventional runtime-safety approaches that mainly focus on immediate vehicle safety, the proposed framework evaluates both driving safety and whether the vehicle can still successfully complete its mission before a command is executed. The framework extends highway-env with mission-level fault scenarios such as skipping required checkpoints, entering restricted areas, and generating future routes that can no longer complete the mission successfully. A runtime monitoring system is introduced to detect and reject unsafe or mission-infeasible commands before execution. For comparison, an adapted Simplex-Drive runtime-safety baseline with learning-based driving control, safety fallback control, and runtime controller switching is implemented using the public Simplex-Drive framework. Experimental results show that platform-level runtime safety alone cannot detect mission-level planning faults, while the proposed framework successfully rejects mission-infeasible commands and improves mission success under randomized fault conditions.2026-06-05T07:35:10ZChieh TsaiSalim Hariri