Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

2026-03-06T16:48:24Z

Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms and hardware demands a generic and systematic approach to handling tensor layouts. In this work, we introduce Linear Layouts, a novel approach that models tensor layouts using linear algebra over $\mathbb{F}_2$. By representing tensor layouts as binary matrices acting on the bits of the hardware representation, our approach enables a generic layout definition -- as opposed to the classical case-by-case approach -- and allows for generic layout-to-layout conversions, eliminating the quadratic explosion that plagues existing solutions. We integrate linear layouts with Triton and demonstrate their effectiveness in optimizing individual Triton operators as well as kernels written in Triton. We also show that linear layouts reduce engineering effort in the compiler backend while fixing several bugs in Triton's legacy layout system.

A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows

2026-03-06T16:13:43Z

Scientific workflows are critical to scientific data analysis and often involve computationally intensive processing of large datasets on compute clusters. As such, their execution tends to be long-running and resource-intensive, resulting in significant energy consumption and carbon emissions. While carbon-aware computing methods have received considerable attention in general cloud contexts, their application to scientific data analysis workflows remains a critical research gap. Our study addresses this oversight by showing how the delay tolerance, interruptibility, and scalability of scientific workflows can be leveraged for a significantly more sustainable execution model. In this study, we first quantify the problem of carbon emissions associated with running scientific workflows, and then demonstrate the transformative potential for carbon-aware workflow execution. We estimate the carbon footprint of seven real-world Nextflow workflows executed on diverse dedicated cluster and public cloud resources using high-resolution average and marginal grid carbon intensity data from open and commercial data providers. Furthermore, we conduct a systematic evaluation of the impact of carbon-aware temporal shifting, and the dynamic pausing and resuming of the workflow. Moreover, we investigate the impact of resource scaling at both workflow and workflow task levels. Finally, we report substantial potential reductions in overall carbon emissions, with temporal shifting capable of decreasing emissions by over 80%, and resource scaling by 67%.

Comparative Analysis of Cross-Chain Token Standards

2026-03-06T15:37:04Z

Cross-chain token standards enable fungible tokens that exist across multiple blockchains with a unified total supply model. This paper presents a comprehensive comparative analysis of five leading cross-chain token standards and frameworks: the xERC20 standard (implementing ERC-7281), the Omnichain Fungible Token (OFT) standard, the Native Token Transfers (NTT) framework, the Cross-Chain Token (CCT) standard, and the SuperchainERC20 standard (implementing ERC-7802). We examine each standard's distinguishing properties and technical design, including architecture, message-passing mechanisms, interoperability scope, chain compatibility, and security features. Our analysis reveals that while all these standards share the goal of seamless cross-chain fungibility, they differ significantly in implementation approach, trust model, and target ecosystem.

MoEless: Efficient MoE LLM Serving via Serverless Computing

2026-03-06T14:58:16Z

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.

Edge Intelligence-Driven LegalEdge Contracts for EV Charging Stations: A Fedrated Learning with Deep Q-Networks Approach

2026-03-06T08:53:39Z

We introduce LegalEdge, an edge intelligence-driven framework that integrates Federated Learning (FL) and Deep Q-Networks (DQN) to optimize electric vehicle (EV) charging infrastructure. LegalEdge contracts are novel smart contracts deployed on the blockchain to manage dynamic pricing and incentive mechanisms transparently and autonomously. By leveraging FL, multiple edge devices such as EV charging stations collaboratively train DQN agents without sharing raw data, preserving user privacy while reducing communication costs. These edge-deployed agents learn optimal charging strategies in real time based on local conditions and global policy updates. LegalEdge ensures low-latency decisions, high contract integrity, and efficient energy allocation. Our experimental results demonstrate significant improvements in learning convergence, transaction speed, and operational transparency, establishing LegalEdge as a scalable, intelligent, and accountable solution for next-generation EV charging networks.

A Hierarchical Sharded Blockchain Balancing Performance and Availability

2026-03-06T07:11:09Z

Blockchain networks offer decentralization, transparency, and immutability for managing critical data but encounter scalability problems as the number of network members and transaction issuers grows. Sharding is considered a promising solution to enhance blockchain scalability. However, most existing blockchain sharding techniques prioritize performance at the cost of availability (e.g., a failure in a few servers holding a shard leads to data unavailability). In this paper, we propose PyloChain, a hierarchical sharded blockchain that balances availability and performance. PyloChain consists of multiple lower-level local chains and one higher-level main chain. Each local chain speculatively executes local transactions to achieve high parallelism across multiple local chains. The main chain leverages a directed-acyclic-graph (DAG)-based mempool to guarantee local block availability and to enable efficient Byzantine Fault Tolerance (BFT) consensus to execute global (or cross-shard) transactions within a collocated sharding. PyloChain speculatively executes local transactions across multiple local chains to achieve high parallelism. In order to reduce the number of aborted local transactions, PyloChain applies a simple scheduling technique to handle global transactions in the main chain. PyloChain provides a fine-grained auditing mechanism to mitigate faulty higher-level members by externalizing main chain operations to lower-level local members. We implemented and evaluated PyloChain, demonstrating its performance scalability with 1.49x higher throughput and 2.63x faster latency compared to the state-of-the-art balanced hierarchical sharded blockchain.

Domain-Adaptive Model Merging across Disconnected Modes

2026-03-06T06:42:57Z

Learning across domains is challenging when data cannot be centralized due to privacy or heterogeneity, which limits the ability to train a single comprehensive model. Model merging provides an appealing alternative by consolidating knowledge from multiple specialized models into one, avoiding data sharing and reducing retraining cost. In this work, we present DMM, a data-free model merging framework designed to handle highly divergent models. DMM proceeds in three steps. First, domain-specific models are trained independently. Second, models with high similarity are merged using standard techniques to ensure stability. Third, we synthesize pseudo-data from normalization statistics and distill knowledge from divergent models into the merged model through a lightweight refinement guided by these samples. This approach preserves rare but critical knowledge while maintaining stability. Extensive experiments on unimodal and multimodal benchmarks show that DMM achieves state-of-the-art performance over existing merging methods.

Background and Intellectual Development: Supplementary Material for the Category Mistake Papers

2026-03-06T06:08:23Z

This supplement documents the intellectual trajectory that led to the Category Mistake framework and the Forward-In-Time-Only (FITO) analysis presented in our recent arXiv papers. The ideas crystallized over fifteen years of research, conversation, and engineering practice -- beginning with a 2014 Stanford EE380 lecture on the physics of time in computing, sharpened through a 2016 email exchange with Leslie Lamport following a Papers We Love presentation of his seminal 1978 paper, and matured through the development of Open Atomic Ethernet (OAE). This document traces the concept development from its origins in the physics of entanglement and background-free time, through the recognition that Lamport's "happened-before" relation embeds a category mistake, to the practical engineering consequences documented in "Why iCloud Fails" and "What Distributed Computing Got Wrong." It is intended as archival supplementary material for future arXiv submission.

Knowledge-driven Reasoning for Mobile Agentic AI: Concepts, Approaches, and Directions

2026-03-06T02:28:22Z

Mobile agentic AI is extending autonomous capabilities to resource-constrained platforms such as edge robots and unmanned aerial vehicles (UAVs), where strict size, weight, power, and cost (SWAP-C) constraints and intermittent wireless connectivity limit both on-device computation and cloud access. Existing approaches mostly optimize per-round communication efficiency, yet mobile agents must sustain competence across a stream of tasks. We propose a knowledge-driven reasoning framework that extracts reusable decision structures from past execution, synchronizes them over bandwidth-limited links, and injects them into on-device reasoning to reduce latency, energy, and error accumulation. A DIKW-inspired taxonomy distinguishes raw observations, episode-scoped traces, and persistent cross-task knowledge, and categorizes knowledge into retrieval, structured, procedural, and parametric representations, each with a distinct tradeoff between reasoning speedup and failure risk. A key finding is that knowledge exposure is non-monotonic: too little forces costly trial-and-error replanning, while too much introduces conflicting cues and errors. A UAV case study validates the framework, where a compact knowledge pack synchronized over intermittent backhaul enables a 3B-parameter onboard model to achieve perfect mission reliability with lower reasoning cost than both knowledge-free on-device reasoning and cloud-centric replanning.

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

2026-03-06T01:22:16Z

Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.

Gathering Autonomous Mobile Robots Under the Adversarial Defected View Model

2026-03-06T00:37:22Z

This paper studies the gathering problem for a set of $N \ge 2$ autonomous mobile robots operating in the Euclidean plane under the distributed Look-Compute-Move model. We consider oblivious robots executing under the adversarial defected view model, in which an activated robot may observe only a restricted subset of robots due to adversarial visibility faults. Consequently, the information obtained during each Look phase may be incomplete and dynamically altered. The objective is to guarantee deterministic finite-time gathering at a location not known a priori despite such sensing restrictions. We present two distributed algorithms under distinct scheduling assumptions. In the fully synchronous (FSYNC) model, we prove finite-time gathering in the adversarial (4, 2) defected view setting, resolving a previously open case without requiring additional capabilities or coordinate agreement. In the asynchronous (ASYNC) model, we establish finite-time gathering under the general adversarial (N, K) defected view model, where an activated robot observes at most K of the other $N - 1$ robots for any $1 \le K < N - 1$. Both results hold under non-rigid motion. The proposed algorithm for the ASYNC model assumes agreement in the direction and orientation of one coordinate axis.

First-Order Softmax Weighted Switching Gradient Method for Distributed Stochastic Minimax Optimization with Stochastic Constraints

2026-03-06T00:14:46Z

This paper addresses the distributed stochastic minimax optimization problem subject to stochastic constraints. We propose a novel first-order Softmax-Weighted Switching Gradient method tailored for federated learning. Under full client participation, our algorithm achieves the standard $\mathcal{O}(ε^{-4})$ oracle complexity to satisfy a unified bound $ε$ for both the optimality gap and feasibility tolerance. We extend our theoretical analysis to the practical partial participation regime by quantifying client sampling noise through a stochastic superiority assumption. Furthermore, by relaxing standard boundedness assumptions on the objective functions, we establish a strictly tighter lower bound for the softmax hyperparameter. We provide a unified error decomposition and establish a sharp $\mathcal{O}(\log\frac{1}δ)$ high-probability convergence guarantee. Ultimately, our framework demonstrates that a single-loop primal-only switching mechanism provides a stable alternative for optimizing worst-case client performance, effectively bypassing the hyperparameter sensitivity and convergence oscillations often encountered in traditional primal-dual or penalty-based approaches. We verify the efficacy of our algorithm via experiment on the Neyman-Pearson (NP) classification and fair classification tasks.

A Lock-Free Work-Stealing Algorithm for Bulk Operations

2026-03-05T23:59:26Z

Work-stealing is a widely used technique for balancing irregular parallel workloads, and most modern runtime systems adopt lock-free work-stealing deques to reduce contention and improve scalability. However, existing algorithms are designed for general-purpose parallel runtimes and often incur overheads that are unnecessary in specialized settings. In this paper, we present a new lock-free work-stealing queue tailored for a master-worker framework used in the parallelization of a mixed-integer programming optimization solver based on decision diagrams. Our design supports native bulk operations, grows without bounds, and assumes at most one owner and one concurrent stealer, thereby eliminating the need for heavy synchronization. We provide an informal sketch that our queue is linearizable and lock-free under this restricted concurrency model. Benchmarks demonstrate that our implementation achieves constant-latency push performance, remaining stable even as batch size increases, in contrast to existing queues from C++ Taskflow whose latencies grow sharply with batch size. Pop operations perform comparably across all implementations, while our steal operation maintains nearly flat latency across different steal proportions. We also explore an optimized steal variant that reduces latency by up to 3x in practice. Finally, a pseudo workload based on large-graph exploration confirms that all implementations scale linearly. However, we argue that solver workloads with irregular node processing times would further amplify the advantages of our algorithm.

A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

2026-03-05T21:46:58Z

Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code \& off-the-shelf example are contributed to the open-source RL training system AReaL at: https://github.com/inclusionAI/AReaL/blob/v1.0.0.rc1/docs/algorithms/prox_approx.md

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

2026-03-05T21:33:24Z

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs--those that activate all model parameters for each token generation--form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense model parameters exceed device memory capacity, can significantly affect both latency and overall system throughput. This paper (i) investigates the workloads of two representative dense LLMs--Llama-3.1-70B and Llama-3.1-405B, focusing in particular on intra-node parallelization schemes, (ii) analyzes how input characteristics, batching, and parallelism strategies influence latency flexibility and the latency-throughput tradeoff, and (iii) identifies key performance bottlenecks that inform design choices for meeting service-level agreements (SLAs) and sustaining inference quality. Our empirical evaluations reveal that Tensor Parallelism (TP) improves the latency objectives while Pipeline Parallelism (PP) is better-suited for throughput-oriented applications. We highlight that their hybrid usage by controlling the TP and PP degrees provides control over the latency-throughput interplay.