https://arxiv.org/api/jP4qgjSF75ajyENgnz3ogquFmGM2026-04-12T12:07:09Z2795352515http://arxiv.org/abs/2603.04826v1The Semantic Arrow of Time, Part V: The Leibniz Bridge -- Toward a Unified Theory of Semantic Time2026-03-05T05:19:47ZThis is the final paper in the five-part series The Semantic Arrow of Time. Part I identified the FITO category mistake -- treating forward temporal flow as sufficient for establishing meaning. Part II presented the constructive alternative: the OAE link state machine with its mandatory reflecting phase. Part III showed the FITO fallacy operating at industrial scale in RDMA completion semantics. Part IV traced the same pattern through file synchronization, email, human memory, and language model hallucination.
This paper closes the series by constructing the Leibniz Bridge: a unified framework that connects the philosophical foundations (Leibniz's Identity of Indiscernibles, as formalized by Spekkens), the protocol engineering (OAE's bilateral transaction structure), and the physical substrate (indefinite causal order in quantum mechanics). The bridge rests on a single principle: mutual information conservation -- the requirement that every causal exchange preserve the total information accessible to both endpoints, with the direction of time emerging not from axiom but from entropy production when a reversible exchange commits.
We show that this principle dissolves the apparent impossibility of the FLP, Two Generals, and CAP theorems by revealing them as theorems about FITO systems, not about physics. We present the triangle network as the minimal topology for semantic consistency without centralized coordination. We conclude with open questions and a reflection on what distributed computing looks like when the FITO assumption is dropped.2026-03-05T05:19:47Z6 figures. Part V of V in "The Semantic Arrow of Time" seriesPaul Borrillhttp://arxiv.org/abs/2603.04810v1The Semantic Arrow of Time, Part IV: Why Transactions Fail2026-03-05T04:54:24ZThis is the fourth of five papers comprising The Semantic Arrow of Time. Parts I-III established that computing's hidden arrow of time is semantic rather than thermodynamic, that bilateral transaction protocols create causal order through a mandatory reflecting phase, and that RDMA's completion semantics implement the FITO category mistake at industrial scale.
This paper traces the consequences of the FITO category mistake beyond the data center, into systems people use every day. We examine three domains where forward-only temporal assumptions destroy meaning: file synchronization, where cloud platforms silently delete user content because last-writer-wins cannot represent distributed causality; email, where timestamp-based ordering produces phantom messages, causality violations, and stuck synchronization; and memory--both human and artificial--where reconstructive processes that operate without transactional guarantees produce systematic semantic corruption.
In each domain, we identify the same structural pattern: a system that commits state changes forward in time without a reflecting phase, and that therefore cannot distinguish between successful semantic integration and mere temporal succession. The pattern is not coincidental. It is the FITO category mistake operating at different scales: bytes in a NIC buffer, files in a cloud, messages in an inbox, engrams in a hippocampus, tokens in a transformer.
We conclude that the semantic arrow of time is violated whenever a system treats the forward flow of information as sufficient evidence of meaning. Part V will show how the Leibniz Bridge provides a unified framework for closing this gap across all five domains.2026-03-05T04:54:24Z13 pages, 0 figures. Part IV of V in The Semantic Arrow of Time seriesPaul Borrillhttp://arxiv.org/abs/2603.04782v1Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL2026-03-05T04:01:30ZPython's Global Interpreter Lock prevents execution on more than one CPU core at the same time, even when multiple threads are used. However, starting with Python 3.13 an experimental build allows disabling the GIL. While prior work has examined speedup implications of this disabling, the effects on energy consumption and hardware utilization have received less attention. This study measures execution time, CPU utilization, memory usage, and energy consumption using four workload categories: NumPy-based, sequential kernels, threaded numerical workloads, and threaded object workloads, comparing GIL and free-threaded builds of Python 3.14.2.
The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption. Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention. Across all workloads, energy consumption is proportional to execution time, indicating that disabling the GIL does not significantly affect power consumption, even when CPU utilization increases. When it comes to memory, the no-GIL build shows a general increase, more visible in virtual memory than in physical memory. This increase is primarily attributed to per-object locking, additional thread-safety mechanisms in the runtime, and the adoption of a new memory allocator.
These findings suggest that Python's no-GIL build is not a universal improvement. Developers should evaluate whether their workload can effectively benefit from parallel execution before adoption.2026-03-05T04:01:30ZJosé Daniel Montoya Salazarhttp://arxiv.org/abs/2603.04774v1The Semantic Arrow of Time, Part III: RDMA and the Completion Fallacy2026-03-05T03:45:55ZThis is the third of five papers comprising The Semantic Arrow of Time. Parts I and II identified computing's hidden semantic arrow of time, the FITO category mistake, and presented the constructive alternative: the OAE link state machine with its mandatory reflecting phase.
This paper examines what happens when those principles are violated at industrial scale. Remote Direct Memory Access (RDMA) is the highest-performance data movement technology in production, deployed across Meta's 24,000-GPU clusters, Google's data centers, and Microsoft's Azure infrastructure. We argue that RDMA's completion semantics contain a category mistake: they guarantee placement (data written to a remote NIC buffer) but not commitment (data semantically integrated by the receiving application). We call this the completion fallacy.
We document the fallacy through seven temporal stages of an RDMA Write operation, showing that the gap between completion signal and application semantic satisfaction can be arbitrarily large. We trace consequences through four case studies: Meta's RoCE fabric, Google's 1RMA redesign, Microsoft's DCQCN failures, and SDR-RDMA partial completions.
A comparative analysis shows CXL 3.0, NVLink, and UALink each address parts of the completion fallacy but none eliminates it entirely. Only a protocol architecture with a mandatory reflecting phase can close the gap between delivery and commitment.2026-03-05T03:45:55Z9 pages, 0 figures, 1 table. Part III of V in The Semantic Arrow of Time seriesPaul Borrillhttp://arxiv.org/abs/2511.12185v3Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications2026-03-05T02:42:59ZData is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in both research and industry. Traditionally, data engineering, Machine Learning, and AI workloads have been run on large clusters within data center environments, requiring substantial investment in hardware and maintenance. With the rise of the public cloud, it is now possible to run large applications across nodes without owning or maintaining hardware. Serverless functions such as AWS Lambda provide horizontal scaling and precise billing without the hassle of managing traditional cloud infrastructure. However, when processing large datasets, users often rely on external storage options that are significantly slower than direct communication typical of HPC clusters. We introduce Cylon, a high-performance distributed data frame solution that has shown promising results for data processing using Python. We describe how we took inspiration from the FMI library and designed a serverless communicator to tackle communication and performance issues associated with serverless functions.
With our design, we demonstrate that the scaling efficiency of AWS Lambda achieves within 6.5% of serverful AWS (EC2) at 64 nodes, based on implementing direct communication via NAT Traversal TCP Hole Punching.2025-11-15T12:28:39Z12 pages, 9 figures, 3 tablesMills StaylorArup Kumar SarkerGregor von LaszewskiGeoffrey FoxYue ChengJudy Foxhttp://arxiv.org/abs/2603.04716v1SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference2026-03-05T01:41:09ZPrefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D hardware resources, subject to constraints on total throughput, service level objectives (SLOs), and request characteristics - specifically input and output lengths. To address this gap, we propose a hybrid approach that combines theoretical modeling with empirical benchmarking. First, we present a theoretical model for calculating P/D resource counts, which is based on total throughput requirements, request input and output lengths, as well as prefill and decode throughput. Then, to obtain the actual prefill and decode throughput under SLO constraints, we model the prefill process using M/M/1 queuing theory, deriving the achieved prefill throughput from the benchmarked maximum prefill throughput and Time-To-First-Token (TTFT). For the decode phase, we determine the decode batch sizes that meet Time-Per-Output-Token (TPOT) requirements and obtain the corresponding decode throughput through empirical measurements. Our experimental results demonstrate that the proposed method can accurately predict optimal P/D resource allocation in real-world LLM inference scenarios.2026-03-05T01:41:09Z10 pages, 3 figuresLuchang LiDongfang LiBozhao GongYu Zhanghttp://arxiv.org/abs/2603.04621v1DuaLip-GPU Technical Report2026-03-04T21:30:10ZLarge-scale linear programs (LPs) arise in many decision systems, including ranking, allocation, and matching problems that must be solved repeatedly at massive scale. Prior work such as ECLIPSE and LinkedIn's open-source DuaLip showed that ridge-regularized dual ascent with first-order methods can scale to these settings. However, the original implementation was tightly coupled to a small number of schemas and built on a CPU-centric Scala/Spark stack, limiting extensibility and preventing effective use of modern accelerators.
We present a redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution. The system uses an operator-centric programming model in which LP formulations are expressed through composable primitives for dual objective evaluation and blockwise projection operators for decomposable constraint families. This design allows new formulations to be added locally while reusing a shared optimization loop, diagnostics, and distributed infrastructure.
To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter.
On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees.2026-03-04T21:30:10ZGregory DexterAida RahmattalabiSanjana GargQinquan SongRuby TuYuan GaoYi ZhangZhipeng WangRahul Mazumderhttp://arxiv.org/abs/2512.22695v2Modality Inflation: Energy Characterization and Optimization Opportunities for MLLM Inference2026-03-04T20:53:09ZMultimodal large language models (MLLMs) are built on text-only LLMs by incorporating additional modalities, enabling multimodal understanding and a broader range of applications. However, these additions introduce a previously unexplored energy trade-off across modalities that remains poorly understood, as most prior work focuses on text-only models. In this paper, we examine modality inflation, a key source of inefficiency in which multimodal inputs increase inference workloads through extra encoding stages and expanded token sequences. We provide the first detailed, stage-level analysis of energy consumption in MLLM inference by breaking the pipeline into vision encoding, prefill, and decoding stages. Using four representative MLLMs evaluated on NVIDIA A100 GPU, we quantify the additional energy required for multimodal inference compared to text-only baselines, observing overheads ranging from 17% to 94% across models for identical inputs. Our results show that energy bottlenecks differ widely across model architectures, stemming either from compute-heavy vision encoders or from the downstream impact of large visual token sequences during prefill. By examining GPU power traces, we further uncover substantial GPU underutilization during multimodal execution and show that input complexity leads to markedly different energy scaling behaviors across models. Finally, we demonstrate that stage-wise dynamic voltage and frequency scaling (DVFS) is an effective optimization, allowing energy savings with only modest performance impact. Together, these findings offer practical insights and concrete guidance for designing more energy-efficient multimodal LLM serving systems.2025-12-27T19:49:21ZMona MoghadampanahAdib Rezaei ShahmirzadiFarhana AminDimitrios S. Nikolopouloshttp://arxiv.org/abs/2603.04583v1Overcoming Latency-bound Limitations of Distributed Graph Algorithms using the HPX Runtime System2026-03-04T20:26:30ZGraph processing at scale presents many challenges, including the irregular structure of graphs, the latency-bound nature of graph algorithms, and the overhead associated with distributed execution. While existing frameworks such as Spark GraphX and the Parallel Boost Graph Library (PBGL) have introduced abstractions for distributed graph processing, they continue to struggle with inherent issues like load imbalance and synchronization overhead. In this work, we present a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting, using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs. These algorithms span the categories of Traversal, centrality, and Pattern matching, and are selected to represent diverse computational characteristics. We evaluate our HPX-based implementations against GraphX, and PBGL, showing that a high-performance runtime such as HPX enables the construction of algorithms that significantly outperform conventional frameworks by exploiting asynchronous execution, latency hiding, and fine-grained parallelism in shared memory. All algorithms in our prototype follow a unified execution model in which local and remote computations are expressed using the same programming abstractions, with asynchrony managed transparently by the runtime. This design explicitly leverages shared-memory parallelism within each locality while overlapping communication and computation across localities, providing a practical foundation for extending this approach to a broader class of distributed graph algorithms.2026-03-04T20:26:30ZIEEE-format paper, submitted to GrAPL Workshop at IPDPS conference. 4 authors, 12 PagesKarame MohammadiporshokoohPanagiotis SyskakisAndrew LumsdaineHartmut Kaiserhttp://arxiv.org/abs/2603.04377v1Benchmarking Quantum Computers via Protocols, Comparing IBM's Heron vs IBM's Eagle2026-03-04T18:41:19ZAs quantum computing hardware rapidly advances, objectively evaluating the capabilities and error rates of new processors remains a critical challenge for the field. A clear and realistic understanding of current quantum performance is essential to guide research priorities and drive meaningful progress. In this work, we apply and extend a protocol-based benchmarking methodology (presented in arXiv:2505.12441) that utilizes well-defined quantumness thresholds. By evaluating performance at protocol level rather then the gate level, this approach provides a transparent and intuitive assessment of whether specific quantum processors, or isolated sub-chips within them, can demonstrate a practical quantum advantage. To illustrate the utility of this method, we compare two generations of IBM quantum computers: the older Eagle architecture and the newer Heron architecture. Our findings reveal the genuine operational strengths and limitations of these devices, demonstrating substantial performance improvements in the newer Heron generation.2026-03-04T18:41:19Z42 pages, 51 figuresNitay MayoTal MorYossi Weinsteinhttp://arxiv.org/abs/2603.04323v1PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology2026-03-04T17:44:39ZFederated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48-dimensional PH feature vectors-compact shape summaries whose many-to-one structure makes inversion provably ill-posed-rather than model gradients. The server performs topology-guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra-cluster models are topology-weighted,and clusters are blended with a global consensus. We prove an information-contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein-weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non-IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively-the highest in both settings-while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at https://github.com/MorillaLab/TopoFederatedL and data at https://doi.org/10.5281/zenodo.18827595.2026-03-04T17:44:39Z22 pages, 6 FiguresKelly L Vomo-DonfackAdryel HoszuGrégory GinotIan Morillahttp://arxiv.org/abs/2603.04126v1Efficient Time-Aware Partitioning of Quantum Circuits for Distributed Quantum Computing2026-03-04T14:43:10ZTo overcome the physical limitations of scaling monolithic quantum computers, distributed quantum computing (DQC) interconnects multiple smaller-scale quantum processing units (QPUs) to form a quantum network. However, this approach introduces a critical challenge, namely the high cost of quantum communication between remote QPUs incurred by quantum state teleportation and quantum gate teleportation. To minimize this communication overhead, DQC compilers must strategically partition quantum circuits by mapping logical qubits to distributed physical QPUs. Static graph partitioning methods are fundamentally ill-equipped for this task as they ignore execution dynamics and underlying network topology, while metaheuristics require substantial computational runtime. In this work, we propose a heuristic based on beam search to solve the circuit partitioning problem. Our time-aware algorithm incrementally constructs a low-cost sequence of qubit assignments across successive time steps to minimize overall communication overhead. The time and space complexities of the proposed algorithm scale quadratically with the number of qubits and linearly with circuit depth, offering a significant computational speedup over common metaheuristics. We demonstrate that our proposed algorithm consistently achieves significantly lower communication costs than static baselines across varying circuit sizes, depths, and network topologies, providing an efficient compilation tool for near-term distributed quantum hardware.2026-03-04T14:43:10Z5 pages, 3 figures, conference: accepted at QCNC 2026Raymond P. H. WuChathu RanaweeraSutharshan RajasegararRia Rushin JosephJinho ChoiSeng W. Lokehttp://arxiv.org/abs/2411.19058v4Carbon-Aware Quality Adaptation for Energy-Intensive Services2026-03-04T14:09:36ZThe energy demand of modern cloud services, particularly those related to generative AI, is increasing at an unprecedented pace. To date, carbon-aware computing strategies have primarily focused on batch process scheduling or geo-distributed load balancing. However, such approaches are not applicable to services that require constant availability at specific locations due to latency, privacy, data, or infrastructure constraints.
In this paper, we explore how the carbon footprint of energy-intensive services can be reduced by adjusting the fraction of requests served by different service quality tiers. We show that adapting this quality of responses with respect to grid carbon intensity can lead to additional carbon savings beyond resource and energy efficiency. Building on this, we introduce a forecast-based multi-horizon optimization that reaches close-to-optimal carbon savings and is able to automatically adapt service quality for best-effort users to stay within an annual carbon budget. Our approach can reduce the emissions of large-scale LLM services, which we estimate at multiple 10,000 tons of CO2 annually, by up to 10%.2024-11-28T11:17:30ZExtended version of our paper published at e-Energy'25. Compared to the published version, we (i) add a time-based vs. utilization-based power attribution perspective together with a proof that both yield equivalent provisioning decisions under mild assumptions and (ii) extend the online approach with an automatic quality adaptation to meet a fixed annual carbon budgetPhilipp WiesnerDennis GrinwaldPhilipp WeißPatrick WilhelmRamin KhaliliOdej Kao10.1145/3679240.3734614http://arxiv.org/abs/2603.04027v1Performance Optimization in Stream Processing Systems: Experiment-Driven Configuration Tuning for Kafka Streams2026-03-04T13:04:03ZConfiguring stream processing systems for efficient performance, especially in cloud-native deployments, is a challenging and largely manual task. We present an experiment-driven approach for automated configuration optimization that combines three phases: Latin Hypercube Sampling for initial exploration, Simulated Annealing for guided stochastic search, and Hill Climbing for local refinement. The workflow is integrated with the cloud-native Theodolite benchmarking framework, enabling automated experiment orchestration on Kubernetes and early termination of underperforming configurations. In an experimental evaluation with Kafka Streams and a Kubernetes-based cloud testbed, our approach identifies configurations that improve throughput by up to 23% over the default. The results indicate that Latin Hypercube Sampling with early termination and Simulated Annealing are particularly effective in navigating the configuration space, whereas additional fine-tuning via Hill Climbing yields limited benefits.2026-03-04T13:04:03ZAccepted for the 9th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2026) at ACM/SPEC ICPE 2026David ChenSören HenningKassiano MatteussiRick Rabiser10.1145/3777911.3800636http://arxiv.org/abs/2603.04008v1Lambdas at the Far Edge: a Tale of Flying Lambdas and Lambdas on Wheels2026-03-04T12:50:07ZAggregate Programming (AP) is a paradigm for programming the collective behaviour of sets of distributed devices, possibly situated at the network far edge, by relying on asynchronous proximity-based interactions. The eXchange Calculus (XC), a recently proposed foundational model for AP, is essentially a typed lambda calculus extended with an operator (the exchange operator) providing an implicit communication mechanism between neighbour devices. This paper provides a gentle introduction to XC and to its implementation as a C++ library, called FCPP. The FCPP library and toolchain has been mainly developed at the Department of Computer Science of the University of Turin, where Stefano Berardi spent most of his academic career conducting outstanding research about logical foundation of computer science and transmitting his passion for research to students and young researchers, often exploiting typed lambda calculi. An FCCP program is essentially a typed lambda term, and FCPP has been used to write code that has been deployed on devices at the far edge of the network, including rovers and (soon) Uncrewed Aerial Vehicles (UAVs); hence the title of the paper.2026-03-04T12:50:07ZIn Proceedings LTT 2026, arXiv:2603.02912EPTCS 441, 2026, pp. 19-45Giorgio AudritoDepartment of Computer ScienceDaniele BortoluzziDepartment of Computer ScienceFerruccio DamianiDepartment of Computer ScienceGiordano ScarsoDepartment of Computer ScienceGianluca TortaDepartment of Computer ScienceAndrea BassoMITO Technology, Milan, ItalyMonica CochiTorino AirportLorenzo GusmanTorino AirportLorenzo CombaDepartment of Agricultural, Forest and Food SciencesPaolo GayDepartment of Agricultural, Forest and Food SciencesPaola Dal ZovoConcept Engineering Reply, Turin, ItalyGiada GalatiEurix, Turin, ItalyFrancesco GalloEurix, Turin, ItalyAljaž GrdadolnikFaculty of Computer and Information Science University of Ljubljana, Ljubljana, SloveniaMassimo PescarolloDepartment of Economics and Statistics Cognetti de Martiis, University of Turin, Turin, ItalyPaola PisanoDepartment of Economics and Statistics, Cognetti de Martiis, University of Turin, Turin, Italy10.4204/EPTCS.441.2