https://arxiv.org/api/JE/0OhDPn//2XIqr59X3GUJKjio2026-06-21T11:24:25Z137910515http://arxiv.org/abs/2604.07839v1A Hardware-Anchored Privacy Middleware for PII Sharing Across Heterogeneous Embedded Consumer Devices2026-04-09T05:40:55ZThe rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent - specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device binding cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market.2026-04-09T05:40:55Z4 pages, 2 figures, 4 tablesAditya SabbineniPravin NagareDevendra DahiphalePreetam DeduWillison Lopeshttp://arxiv.org/abs/2604.05505v2Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers2026-04-09T02:03:45ZAs quantum computing moves from isolated experiments toward integration with large-scale workflows, the integration of quantum devices into HPC systems has gained much interest. Quantum cloud providers expose shared devices through first-come first-serve queues where a circuit that executes in 3 seconds can spend minutes to an entire day waiting. Minimizing this overhead while maintaining execution fidelity is the central challenge of quantum cloud scheduling, and existing approaches treat the two as separate concerns. We present Qurator, an architecture-agnostic quantum-classical task scheduler that jointly optimizes queue time and circuit fidelity across heterogeneous providers. Qurator models hybrid workloads as dynamic DAGs with explicit quantum semantics, including entanglement dependencies, synchronization barriers, no-cloning constraints, and circuit cutting and merging decisions, all of which render classical scheduling techniques ineffective. Fidelity is estimated through a unified logarithmic success score that reconciles incompatible calibration data from IBM, IonQ, IQM, Rigetti, AQT, and QuEra into a canonical set of gate error, readout fidelity, and decoherence terms. We evaluate Qurator on a simulator driven by four months of real queue data using circuits from the Munich Quantum Toolkit benchmark suite. Across load conditions from 5 to 35,000 quantum tasks, Qurator stays within 1% of the highest-fidelity baseline at low load while achieving 30-75% queue time reduction at high load, at a fidelity cost bounded by a user-specified target.2026-04-07T06:58:46ZSinan PehlivanogluUlrik de MuelenaerePeter KoggeAmr Sabryhttp://arxiv.org/abs/2603.18030v2Quine: Realizing LLM Agents as Native POSIX Processes2026-04-09T01:29:42ZCurrent LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application-layer orchestrator, this paper presents Quine, a runtime architecture and reference implementation that realizes LLM agents as native POSIX processes. The mapping is explicit: identity is PID, interface is standard streams and exit status, state is memory, environment variables, and filesystem, and lifecycle is fork/exec/exit. A single executable implements this model by recursively spawning fresh instances of itself. By grounding the agent abstraction in the OS process model, Quine inherits isolation, composition, and resource control directly from the kernel, while naturally supporting recursive delegation, context renewal via exec, and shell-native composition. The design also exposes where the POSIX process model stops: processes provide a robust substrate for execution, but not a complete runtime model for cognition. In particular, the analysis points toward two immediate extensions beyond process semantics: task-relative worlds and revisable time. A reference implementation of Quine is publicly available on GitHub.2026-03-08T05:32:46ZMinor revision clarifying exec semanticsHao Kehttp://arxiv.org/abs/2604.07609v1Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC2026-04-08T21:27:47ZLarge Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized.
We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement.
Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.2026-04-08T21:27:47ZMohammad SiavashiMariano ScazzarielloGerald Q. MaguireDejan KostićMarco Chiesahttp://arxiv.org/abs/2604.06970v1Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale2026-04-08T11:41:21ZWhen output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.2026-04-08T11:41:21Z10 pages, 8 figures. Code and reproduction artifacts available upon requestRenzhong YuanYijun ZengXiaosong GaoLinxi YuHaochun LiaoHan Wanghttp://arxiv.org/abs/2604.05091v1MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU2026-04-06T18:43:56ZWe present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.2026-04-06T18:43:56ZZhengqing YuanHanchi SunLichao SunYanfang Yehttp://arxiv.org/abs/2602.04816v3Horizon-LM: A RAM-Centric Architecture for LLM Training2026-04-06T17:38:36ZThe rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.2026-02-04T18:04:46ZThis paper contained an error in the throughput computation used in the experimental evaluation. Specifically, the TFLOPS calculation omitted the 12HL term in the training FLOPs formula, which led to systematic underestimation of the reported throughput numbers in the experimental results. We are withdrawing this version to correct the evaluation and avoid confusion for readersZhengqing YuanLichao SunYanfang Yehttp://arxiv.org/abs/2603.15042v3Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing2026-04-03T11:51:07ZExisting GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing catastrophic token drift in generative models.
We present CoGPU, a transparent spatial sharing system that resolves this trilemma. CoGPU introduces \emph{GPU coroutine}, a novel abstraction that enables logical-to-physical resource decoupling. By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics.
Evaluations demonstrate CoGPU simultaneously achieves high utilization, strong isolation, and absolute semantic determinism (guaranteeing zero token mismatch). In multi-tenant co-location, it improves training throughput by up to 79.2\% over temporal sharing and reduces P99 inference tail latency by 15.1\%. Its pluggable architecture supports custom policies; compared to the default policy, a \textsc{TPOT-FIRST} policy further reduces SLO violations by 21.2\% under dynamic traffic.2026-03-16T09:48:34ZZhenyuan YangWenxin ZhengMingyu LiHaibo Chenhttp://arxiv.org/abs/2604.02442v1WIO: Upload-Enabled Computational Storage on CXL SSDs2026-04-02T18:14:28ZThe widening gap between processor speed and storage latency has made data movement a dominant bottleneck in modern systems. Two lines of storage-layer innovation attempted to close this gap: persistent memory shortened the latency hierarchy, while computational storage devices pushed processing toward the data. Neither has displaced conventional NVMe SSDs at scale, largely due to programming complexity, ecosystem fragmentation, and thermal/power cliffs under sustained load. We argue that storage-side compute should be \emph{reversible}: computation should migrate dynamically between host and device based on runtime conditions. We present \sys, which realizes this principle on CXL SSDs by decomposing I/O-path logic into migratable \emph{storage actors} compiled to WebAssembly. Actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Our evaluation on an FPGA-based CXL SSD prototype and two production CSDs shows that \sys turns hard thermal cliffs into elastic trade-offs, achieving up to 2$\times$ throughput improvement and 3.75$\times$ write latency reduction without application modification.2026-04-02T18:14:28ZYiwei YangYanpeng HuYusheng ZhengEstabon RamosJianchang SuAndi QuinnWei Zhanghttp://arxiv.org/abs/2604.01655v1HACache: Leveraging Read Performance with Cache in a Heterogeneous Array2026-04-02T05:54:08ZIn cost-sensitive deployments, RAID arrays may combine SSDs with different performance levels. Such heterogeneity arises when aging SSDs degrade yet remain usable, or when failed drives are replaced with new devices of explicitly better performance. While this reduces procurement cost, it creates performance challenges: traditional striping mecahnism distributes requests evenly, but slower SSDs become bottlenecks, leaving faster ones underutilized and limiting overall bandwidth to the slowest drive.
To address this, we propose HACache (Heterogeneity Adaptive Cache) for read-intensive workloads. HACache introduces high-performance SSDs as read caches to rebalance request distribution. First, we formalize the request diversion problem and solve it formally. Second, to support optimal diversion ratios searching at runtime, HACache adopts a two-phase request diversion ratio adjustment mechanism. Finally, a cache capacity regulation is adopted to adapt quotas for each backend SSD based on hit rates and request diversion needs. This design maximizes bandwidth utilization. Experiments show HACache improves heterogeneous RAID read performance significantly, with bandwidth gains of about 35\% in typical mixed configurations.2026-04-02T05:54:08Z11pages, 16figuresJialin LiuLiang ShiDingcui Yuhttp://arxiv.org/abs/2604.01620v1DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory2026-04-02T04:56:20ZCXL (Compute Express Link) enables multiple hosts to share byte-addressable memory with hardware cache coherence, but no existing filesystem exploits this for lock-free multi-host coordination. We present DaxFS, a Linux filesystem for CXL shared memory that uses cmpxchg atomic operations, which CXL makes coherent across host boundaries, as its sole coordination primitive. A CAS-based hash overlay enables lock-free concurrent writes from multiple hosts without any centralized coordinator. A cooperative shared page cache with a novel multi-host clock eviction algorithm (MH-clock) provides demand-paged caching in shared DAX memory, with fully decentralized victim selection via cmpxchg. We validate multi-host correctness using QEMU-emulated CXL 3.0, where two virtual hosts share a memory region with TCP-forwarded atomics. Under cross-host contention, DaxFS maintains >99% CAS accuracy with no lost updates. On single-host DRAM-backed DAX, DaxFS exceeds tmpfs throughput across all write workloads, achieving up to 2.68x higher random write throughput with 4 threads and 1.18x higher random read throughput at 64 KB. Preliminary GPU microbenchmarks show that the cmpxchg-based design extends to GPU threads performing page cache operations at PCIe 5.0 bandwidth limits.2026-04-02T04:56:20ZCong WangYiwei YangYusheng Zhenghttp://arxiv.org/abs/2604.01441v1Generative Profiling for Soft Real-Time Systems and its Applications to Resource Allocation2026-04-01T22:27:27ZModern real-time systems require accurate characterization of task timing behavior to ensure predictable performance, particularly on complex hardware architectures. Existing methods, such as worst-case execution time analysis, often fail to capture the fine-grained timing behaviors of a task under varying resource contexts (e.g., an allocation of cache, memory bandwidth, and CPU frequency), which is necessary to achieve efficient resource utilization. In this paper, we introduce a novel generative profiling approach that synthesizes context-dependent, fine-grained timing profiles for real-time tasks, including those for unmeasured resource allocations. Our approach leverages a nonparametric, conditional multi-marginal Schrödinger Bridge (MSB) formulation to generate accurate execution profiles for unseen resource contexts, with maximum likelihood guarantees. We demonstrate the efficiency and effectiveness of our approach through real-world benchmarks, and showcase its practical utility in a representative case study of adaptive multicore resource allocation for real-time systems.2026-04-01T22:27:27ZGeorgiy A. BondarAbigail EisenklamYifan CaiRobert GiffordTushar SialLinh Thi Xuan PhanAbhishek Halderhttp://arxiv.org/abs/2603.29052v1SteelDB: Diagnosing Kernel-Space Bottlenecks in Cloud OLTP Databases2026-03-30T22:43:33ZModern cloud OLTP databases have sought performance primarily through user-space optimization - separating storage and compute layers, or distributing transactions across multiple nodes using consensus algorithms. This paper turns attention to a previously unexplored layer: kernel-space I/O behavior. From an on-premises perspective, where a single server with local storage delivers excellent performance, these elaborate designs seem puzzling. Why do cloud databases require such architectural complexity? We investigate this through a pathological analysis of databases that rely on OS-level I/O control in cloud-specific storage environments. We show that bottlenecks widely attributed to network or storage architectures in fact originate in kernel-space I/O behavior. Based on this diagnosis, we derive treatment principles and realize them as SteelDB, a zero-patch architecture that improves database performance on general-purpose cloud distributed block storage through strategic I/O optimization without requiring kernel or database patches. TPC-C evaluations demonstrate that SteelDB achieves up to 9x performance improvement at no additional cost. Against Amazon Aurora, SteelDB achieved 3.1x higher performance while reducing costs by 58%, leading to a 7.3x improvement in cost efficiency. While Aurora requires an average of 254 days for major version upgrades due to applying proprietary patches to newly released OSS databases, our zero-patch architecture reduces these software maintenance costs to near zero.2026-03-30T22:43:33ZMitsumasa Kondohttp://arxiv.org/abs/2603.25666v1Experimental Analysis of FreeRTOS Dependability through Targeted Fault Injection Campaigns2026-03-26T17:21:16ZReal-Time Operating Systems (RTOSes) play a crucial role in safety-critical domains, where deterministic and predictable task execution is essential. Yet they are increasingly exposed to ionizing radiation, which can compromise system dependability.
To assess FreeRTOS under such conditions, we introduce KRONOS, a software-based, non-intrusive post-propagation Fault Injection (FI) framework that injects transient and permanent faults into Operating System-visible kernel data structures without specialized hardware or debug interfaces. Using KRONOS, we conduct an extensive FI campaign on core FreeRTOS kernel components, including scheduler-related variables and Task Control Blocks (TCBs), characterizing the impact of kernel-level corruptions on functional correctness, timing behavior, and availability.
The results show that corruption of pointer and key scheduler-related variables frequently leads to crashes, whereas many TCB fields have only a limited impact on system availability.2026-03-26T17:21:16Z6 pages; 5 figures; sent to the International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS) 2026Proceeding of the 2026 IEEE 29th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS)Luca MannellaStefano Di CarloAlessandro Savino10.1109/DDECS69233.2026.11520990http://arxiv.org/abs/2603.21466v2GateANN: I/O-Efficient Filtered Vector Search on SSDs2026-03-26T10:24:52ZWe present GateANN, an I/O-efficient SSD-based graph ANNS system that supports filtered vector search on an unmodified graph index. Existing SSD-based systems either waste I/O by post-filtering, or require expensive filter-aware index rebuilds. GateANN avoids both by decoupling graph traversal from vector retrieval. Our key insight is that traversing a node requires only its neighbor list and an approximate distance, neither of which needs the full-precision vector on SSD. Based on this, GateANN introduces graph tunneling. It checks each node's filter predicate in memory before issuing I/O and routes through non-matching nodes entirely in memory, preserving graph connectivity without any SSD read for non-matching nodes. Our experimental results show that it reduces SSD reads by up to 10x and improves throughput by up to 7.6x.2026-03-23T01:03:51ZNakyung LeeSoobin ChoJiwoong ParkGyuyeong Kim