https://arxiv.org/api/dip4mXMx9e642o4fTdF4GCGqGqM2026-04-14T09:19:05Z2801358515http://arxiv.org/abs/2603.05217v1Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks2026-03-05T14:30:10ZReal-time city-scale traffic analytics requires processing 100s-1000s of CCTV streams under strict latency, bandwidth, and compute limits. We present a scalable AI-driven Intelligent Transportation System (AIITS) designed to address multi-dimensional scaling on an edge-cloud fabric. Our platform transforms live multi-camera video feeds into a dynamic traffic graph through a DNN inferencing pipeline, complemented by real-time nowcasting and short-horizon forecasting using Spatio-Temporal GNNs. Using a testbed to validate in a Bengaluru neighborhood, we ingest 100+ RTSP feeds from Raspberry Pis, while Jetson Orin edge accelerators perform high-throughput detection and tracking, producing lightweight flow summaries for cloud-based GNN inference. A capacity-aware scheduler orchestrates load-balancing across heterogeneous devices to sustain real-time performance as stream counts increase. To ensure continuous adaptation, we integrate SAM3 foundation-model assisted labeling and Continuous Federated Learning to update DNN detectors on the edge. Experiments show stable ingestion up to 2000 FPS on Jetson Orins, low-latency aggregation, and accurate and scalable ST-GNN forecasts for up to 1000 streams. A planned live demonstration will scale the full pipeline to 1000 streams, showcasing practical, cross-fabric scalability.2026-03-05T14:30:10ZAccepted at TCSC SCALE Challenge 2026. To appear in the Proceedings of IEEE/ACM CCGRID Workshops, Sydney, 2026Akash SharmaPranjal NamanRoopkatha BanerjeePriyanshu PansariSankalp GawaliMayank AryaSharath ChandraArun JosephrajRakshit RameshPunit RathoreAnirban ChakrabortyRaghu KrishnapuramVijay KovvaliYogesh Simmhanhttp://arxiv.org/abs/2603.05118v1Leveraging Structural Knowledge for Solving Election in Anonymous Networks with Shared Randomness2026-03-05T12:44:24ZWe study the classical Election problem in anonymous net- works, where solutions can rely on the use of random bits, which may be either shared or unshared among nodes. We provide a complete char- acterization of the conditions under which a randomized Election algo- rithm exists, for arbitrary structural knowledge. Our analysis considers both Las Vegas and Monte Carlo randomized algorithms, under the as- sumptions of shared and unshared randomness. In our setting, random sources are considered shared if the output bits are identical across spe- cific subsets of nodes. The algorithms and impossibility proofs are extensions of those of [5] for the deterministic setting. Our results are a complete generalization of those from [8]. Moreover, as applications, we consider many specific knowledge: no knowledge, a bound on the size, a bound on the number of nodes sharing a source, the size, or the full topology of the network. For each of them, we show how the general characterizations apply, showing they actually correspond to classes of structural knowledge. We also de- scribe also how randomized Election algorithms from the literature fits in this landscape. We therefore provide a comprehensive picture illustrating how knowledge influences the computability of the Election problem in arbitrary anonymous graphs with shared randomness.2026-03-05T12:44:24ZFull version of Sirocco'2026Jérémie ChalopinEmmanuel Godardhttp://arxiv.org/abs/2603.05087v1PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning2026-03-05T11:58:55ZPrompt tuning has become a prominent strategy for enhancing the performance of Large Language Models (LLMs) on downstream tasks. Many IT enterprises now offer Prompt-Tuning-as-a-Service to fulfill the growing demand for prompt tuning LLMs on downstream tasks. Their primary objective is to satisfy users Service Level Objectives (SLOs) while reducing resource provisioning costs. Nevertheless, our characterization analysis for existing deep learning resource management systems reveals that they are insufficient to optimize these objectives for LLM prompt tuning workloads.
In this paper, we introduce PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning. It contains two innovations. (1) We design a Prompt Bank to identify efficient initial prompts to expedite the convergence of prompt tuning. (2) We develop aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs. In our evaluation, PromptTuner reduces SLO violations by 4.0x and 7.9x, and lowers costs by 1.6x and 4.5x, compared to INFless and ElasticFlow respectively.2026-03-05T11:58:55ZWei GaoPeng SunDmitrii UstiugovTianwei ZhangYonggang Wenhttp://arxiv.org/abs/2602.13046v2Classification of Local Optimization Problems in Directed Cycles2026-03-05T09:29:48ZWe present a complete classification of the distributed computational complexity of local optimization problems in directed cycles for both the deterministic and the randomized LOCAL model. We show that for any local optimization problem $Π$ (that can be of the form min-sum, max-sum, min-max, or max-min, for any local cost or utility function over some finite alphabet), and for any constant approximation ratio $α$, the task of finding an $α$-approximation of $Π$ in directed cycles has one of the following complexities:
1. $O(1)$ rounds in deterministic LOCAL, $O(1)$ rounds in randomized LOCAL,
2. $Θ(\log^* n)$ rounds in deterministic LOCAL, $O(1)$ rounds in randomized LOCAL,
3. $Θ(\log^* n)$ rounds in deterministic LOCAL, $Θ(\log^* n)$ rounds in randomized LOCAL,
4. $Θ(n)$ rounds in deterministic LOCAL, $Θ(n)$ rounds in randomized LOCAL.
Moreover, for any given $Π$ and $α$, we can determine the complexity class automatically, with an efficient (centralized, sequential) meta-algorithm, and we can also efficiently synthesize an asymptotically optimal distributed algorithm.
Before this work, similar results were only known for local search problems (e.g., locally checkable labeling problems). The family of local optimization problems is a strict generalization of local search problems, and it contains numerous commonly studied distributed tasks, such as the problems of finding approximations of the maximum independent set, minimum vertex cover, minimum dominating set, and minimum vertex coloring.2026-02-13T16:03:14Z26 pages, 2 figuresThomas BoudierFabian KuhnAugusto ModaneseRonja StimpertJukka Suomelahttp://arxiv.org/abs/2603.04235v22-Coloring Cycles in One Round2026-03-05T09:04:42ZWe show that there is a one-round randomized distributed algorithm that can 2-color cycles such that the expected fraction of monochromatic edges is less than 0.24118. We also show that a one-round algorithm cannot achieve a fraction less than 0.23879. Before this work, the best upper and lower bounds were 0.25 and 0.2. Our proof was largely discovered and developed by large language models, and both the upper and lower bounds have been formalized in Lean 4.2026-03-04T16:18:30Z9 pages, 3 figuresMaxime FlinAlesya RaevskayaRonja StimpertJukka SuomelaQingxin Yanghttp://arxiv.org/abs/2603.04937v1FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability2026-03-05T08:36:59ZDespite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This challenge is particularly pronounced in large-scale observability platforms handling high-volume, high-velocity data records. For instance, recurrent, expensive filtering queries at query time impose substantial computational and storage overheads in the analytical data plane. In this paper, we propose FluxSieve, a unified architecture that reconciles traditional pull-based query processing with push-based stream processing by embedding a lightweight in-stream precomputation and filtering layer directly into the data ingestion path. This avoids the complexity and operational burden of running queries in dedicated stream processing frameworks. Concretely, this work (i) introduces a foundational architecture that unifies streaming and analytical data planes via in-stream filtering and records enrichment, (ii) designs a scalable multi-pattern matching mechanism that supports concurrent evaluation and on-the-fly updates of filtering rules with minimal per-record overhead, (iii) demonstrates how to integrate this ingestion-time processing with two open-source analytical systems -- Apache Pinot as a Real-Time Online Analytical Processing (RTOLAP) engine and DuckDB as an embedded analytical database, and (iv) performs comprehensive experimental evaluation of our approach. Our evaluation across different systems, query types, and performance metrics shows up to orders-of-magnitude improvements in query performance at the cost of negligible additional storage and very low computational overhead.2026-03-05T08:36:59ZAdriano VogelSören HenningOtmar Ertlhttp://arxiv.org/abs/2603.04826v1The Semantic Arrow of Time, Part V: The Leibniz Bridge -- Toward a Unified Theory of Semantic Time2026-03-05T05:19:47ZThis is the final paper in the five-part series The Semantic Arrow of Time. Part I identified the FITO category mistake -- treating forward temporal flow as sufficient for establishing meaning. Part II presented the constructive alternative: the OAE link state machine with its mandatory reflecting phase. Part III showed the FITO fallacy operating at industrial scale in RDMA completion semantics. Part IV traced the same pattern through file synchronization, email, human memory, and language model hallucination.
This paper closes the series by constructing the Leibniz Bridge: a unified framework that connects the philosophical foundations (Leibniz's Identity of Indiscernibles, as formalized by Spekkens), the protocol engineering (OAE's bilateral transaction structure), and the physical substrate (indefinite causal order in quantum mechanics). The bridge rests on a single principle: mutual information conservation -- the requirement that every causal exchange preserve the total information accessible to both endpoints, with the direction of time emerging not from axiom but from entropy production when a reversible exchange commits.
We show that this principle dissolves the apparent impossibility of the FLP, Two Generals, and CAP theorems by revealing them as theorems about FITO systems, not about physics. We present the triangle network as the minimal topology for semantic consistency without centralized coordination. We conclude with open questions and a reflection on what distributed computing looks like when the FITO assumption is dropped.2026-03-05T05:19:47Z6 figures. Part V of V in "The Semantic Arrow of Time" seriesPaul Borrillhttp://arxiv.org/abs/2603.04810v1The Semantic Arrow of Time, Part IV: Why Transactions Fail2026-03-05T04:54:24ZThis is the fourth of five papers comprising The Semantic Arrow of Time. Parts I-III established that computing's hidden arrow of time is semantic rather than thermodynamic, that bilateral transaction protocols create causal order through a mandatory reflecting phase, and that RDMA's completion semantics implement the FITO category mistake at industrial scale.
This paper traces the consequences of the FITO category mistake beyond the data center, into systems people use every day. We examine three domains where forward-only temporal assumptions destroy meaning: file synchronization, where cloud platforms silently delete user content because last-writer-wins cannot represent distributed causality; email, where timestamp-based ordering produces phantom messages, causality violations, and stuck synchronization; and memory--both human and artificial--where reconstructive processes that operate without transactional guarantees produce systematic semantic corruption.
In each domain, we identify the same structural pattern: a system that commits state changes forward in time without a reflecting phase, and that therefore cannot distinguish between successful semantic integration and mere temporal succession. The pattern is not coincidental. It is the FITO category mistake operating at different scales: bytes in a NIC buffer, files in a cloud, messages in an inbox, engrams in a hippocampus, tokens in a transformer.
We conclude that the semantic arrow of time is violated whenever a system treats the forward flow of information as sufficient evidence of meaning. Part V will show how the Leibniz Bridge provides a unified framework for closing this gap across all five domains.2026-03-05T04:54:24Z13 pages, 0 figures. Part IV of V in The Semantic Arrow of Time seriesPaul Borrillhttp://arxiv.org/abs/2603.04782v1Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL2026-03-05T04:01:30ZPython's Global Interpreter Lock prevents execution on more than one CPU core at the same time, even when multiple threads are used. However, starting with Python 3.13 an experimental build allows disabling the GIL. While prior work has examined speedup implications of this disabling, the effects on energy consumption and hardware utilization have received less attention. This study measures execution time, CPU utilization, memory usage, and energy consumption using four workload categories: NumPy-based, sequential kernels, threaded numerical workloads, and threaded object workloads, comparing GIL and free-threaded builds of Python 3.14.2.
The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption. Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention. Across all workloads, energy consumption is proportional to execution time, indicating that disabling the GIL does not significantly affect power consumption, even when CPU utilization increases. When it comes to memory, the no-GIL build shows a general increase, more visible in virtual memory than in physical memory. This increase is primarily attributed to per-object locking, additional thread-safety mechanisms in the runtime, and the adoption of a new memory allocator.
These findings suggest that Python's no-GIL build is not a universal improvement. Developers should evaluate whether their workload can effectively benefit from parallel execution before adoption.2026-03-05T04:01:30ZJosé Daniel Montoya Salazarhttp://arxiv.org/abs/2603.04774v1The Semantic Arrow of Time, Part III: RDMA and the Completion Fallacy2026-03-05T03:45:55ZThis is the third of five papers comprising The Semantic Arrow of Time. Parts I and II identified computing's hidden semantic arrow of time, the FITO category mistake, and presented the constructive alternative: the OAE link state machine with its mandatory reflecting phase.
This paper examines what happens when those principles are violated at industrial scale. Remote Direct Memory Access (RDMA) is the highest-performance data movement technology in production, deployed across Meta's 24,000-GPU clusters, Google's data centers, and Microsoft's Azure infrastructure. We argue that RDMA's completion semantics contain a category mistake: they guarantee placement (data written to a remote NIC buffer) but not commitment (data semantically integrated by the receiving application). We call this the completion fallacy.
We document the fallacy through seven temporal stages of an RDMA Write operation, showing that the gap between completion signal and application semantic satisfaction can be arbitrarily large. We trace consequences through four case studies: Meta's RoCE fabric, Google's 1RMA redesign, Microsoft's DCQCN failures, and SDR-RDMA partial completions.
A comparative analysis shows CXL 3.0, NVLink, and UALink each address parts of the completion fallacy but none eliminates it entirely. Only a protocol architecture with a mandatory reflecting phase can close the gap between delivery and commitment.2026-03-05T03:45:55Z9 pages, 0 figures, 1 table. Part III of V in The Semantic Arrow of Time seriesPaul Borrillhttp://arxiv.org/abs/2511.12185v3Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications2026-03-05T02:42:59ZData is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in both research and industry. Traditionally, data engineering, Machine Learning, and AI workloads have been run on large clusters within data center environments, requiring substantial investment in hardware and maintenance. With the rise of the public cloud, it is now possible to run large applications across nodes without owning or maintaining hardware. Serverless functions such as AWS Lambda provide horizontal scaling and precise billing without the hassle of managing traditional cloud infrastructure. However, when processing large datasets, users often rely on external storage options that are significantly slower than direct communication typical of HPC clusters. We introduce Cylon, a high-performance distributed data frame solution that has shown promising results for data processing using Python. We describe how we took inspiration from the FMI library and designed a serverless communicator to tackle communication and performance issues associated with serverless functions.
With our design, we demonstrate that the scaling efficiency of AWS Lambda achieves within 6.5% of serverful AWS (EC2) at 64 nodes, based on implementing direct communication via NAT Traversal TCP Hole Punching.2025-11-15T12:28:39Z12 pages, 9 figures, 3 tablesMills StaylorArup Kumar SarkerGregor von LaszewskiGeoffrey FoxYue ChengJudy Foxhttp://arxiv.org/abs/2603.04716v1SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference2026-03-05T01:41:09ZPrefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D hardware resources, subject to constraints on total throughput, service level objectives (SLOs), and request characteristics - specifically input and output lengths. To address this gap, we propose a hybrid approach that combines theoretical modeling with empirical benchmarking. First, we present a theoretical model for calculating P/D resource counts, which is based on total throughput requirements, request input and output lengths, as well as prefill and decode throughput. Then, to obtain the actual prefill and decode throughput under SLO constraints, we model the prefill process using M/M/1 queuing theory, deriving the achieved prefill throughput from the benchmarked maximum prefill throughput and Time-To-First-Token (TTFT). For the decode phase, we determine the decode batch sizes that meet Time-Per-Output-Token (TPOT) requirements and obtain the corresponding decode throughput through empirical measurements. Our experimental results demonstrate that the proposed method can accurately predict optimal P/D resource allocation in real-world LLM inference scenarios.2026-03-05T01:41:09Z10 pages, 3 figuresLuchang LiDongfang LiBozhao GongYu Zhanghttp://arxiv.org/abs/2603.04621v1DuaLip-GPU Technical Report2026-03-04T21:30:10ZLarge-scale linear programs (LPs) arise in many decision systems, including ranking, allocation, and matching problems that must be solved repeatedly at massive scale. Prior work such as ECLIPSE and LinkedIn's open-source DuaLip showed that ridge-regularized dual ascent with first-order methods can scale to these settings. However, the original implementation was tightly coupled to a small number of schemas and built on a CPU-centric Scala/Spark stack, limiting extensibility and preventing effective use of modern accelerators.
We present a redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution. The system uses an operator-centric programming model in which LP formulations are expressed through composable primitives for dual objective evaluation and blockwise projection operators for decomposable constraint families. This design allows new formulations to be added locally while reusing a shared optimization loop, diagnostics, and distributed infrastructure.
To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter.
On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees.2026-03-04T21:30:10ZGregory DexterAida RahmattalabiSanjana GargQinquan SongRuby TuYuan GaoYi ZhangZhipeng WangRahul Mazumderhttp://arxiv.org/abs/2512.22695v2Modality Inflation: Energy Characterization and Optimization Opportunities for MLLM Inference2026-03-04T20:53:09ZMultimodal large language models (MLLMs) are built on text-only LLMs by incorporating additional modalities, enabling multimodal understanding and a broader range of applications. However, these additions introduce a previously unexplored energy trade-off across modalities that remains poorly understood, as most prior work focuses on text-only models. In this paper, we examine modality inflation, a key source of inefficiency in which multimodal inputs increase inference workloads through extra encoding stages and expanded token sequences. We provide the first detailed, stage-level analysis of energy consumption in MLLM inference by breaking the pipeline into vision encoding, prefill, and decoding stages. Using four representative MLLMs evaluated on NVIDIA A100 GPU, we quantify the additional energy required for multimodal inference compared to text-only baselines, observing overheads ranging from 17% to 94% across models for identical inputs. Our results show that energy bottlenecks differ widely across model architectures, stemming either from compute-heavy vision encoders or from the downstream impact of large visual token sequences during prefill. By examining GPU power traces, we further uncover substantial GPU underutilization during multimodal execution and show that input complexity leads to markedly different energy scaling behaviors across models. Finally, we demonstrate that stage-wise dynamic voltage and frequency scaling (DVFS) is an effective optimization, allowing energy savings with only modest performance impact. Together, these findings offer practical insights and concrete guidance for designing more energy-efficient multimodal LLM serving systems.2025-12-27T19:49:21ZMona MoghadampanahAdib Rezaei ShahmirzadiFarhana AminDimitrios S. Nikolopouloshttp://arxiv.org/abs/2603.04583v1Overcoming Latency-bound Limitations of Distributed Graph Algorithms using the HPX Runtime System2026-03-04T20:26:30ZGraph processing at scale presents many challenges, including the irregular structure of graphs, the latency-bound nature of graph algorithms, and the overhead associated with distributed execution. While existing frameworks such as Spark GraphX and the Parallel Boost Graph Library (PBGL) have introduced abstractions for distributed graph processing, they continue to struggle with inherent issues like load imbalance and synchronization overhead. In this work, we present a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting, using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs. These algorithms span the categories of Traversal, centrality, and Pattern matching, and are selected to represent diverse computational characteristics. We evaluate our HPX-based implementations against GraphX, and PBGL, showing that a high-performance runtime such as HPX enables the construction of algorithms that significantly outperform conventional frameworks by exploiting asynchronous execution, latency hiding, and fine-grained parallelism in shared memory. All algorithms in our prototype follow a unified execution model in which local and remote computations are expressed using the same programming abstractions, with asynchrony managed transparently by the runtime. This design explicitly leverages shared-memory parallelism within each locality while overlapping communication and computation across localities, providing a practical foundation for extending this approach to a broader class of distributed graph algorithms.2026-03-04T20:26:30ZIEEE-format paper, submitted to GrAPL Workshop at IPDPS conference. 4 authors, 12 PagesKarame MohammadiporshokoohPanagiotis SyskakisAndrew LumsdaineHartmut Kaiser