https://arxiv.org/api/XjrapaQd3gNR5s8Eq3apbfPdBo0 2026-03-18T10:10:23Z 7778 0 15 http://arxiv.org/abs/2603.16812v1 ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation 2026-03-17T17:16:41Z Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes increasingly challenging due to complex validation framework setup, large design scale, high concurrency, non-deterministic execution, and intricate protocol interactions at chiplet boundaries, often resulting in long integration cycles. This paper presents a replay-driven validation methodology developed during the integration of a CPU subsystem, multiple Xe GPU cores, and a configurable Network-on-Chip (NoC) within a foundational SoC building block targeting the ODIN integrated chiplet architecture. By leveraging deterministic waveform capture and replay across both simulation and emulation using a single design database, complex GPU workloads and protocol sequences can be reproduced reliably at the system level. This approach significantly accelerates debug, improves integration confidence, and enables end-to-end system boot and workload execution within a single quarter, demonstrating the effectiveness of replay-based validation as a scalable methodology for chiplet-based systems. 2026-03-17T17:16:41Z Nij Dorairaj Debabrata Chatterjee Hong Wang Hong Jiang Alankar Saxena Altug Koker Thiam Ern Lim Cathrane Teoh Chuan Yin Loo Bishara Shomar Anthony Lester http://arxiv.org/abs/2505.02314v2 NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities 2026-03-17T15:17:58Z The exponential growth of artificial intelligence (AI) applications has exposed the inefficiency of conventional von Neumann architectures, where frequent data transfers between compute units and memory create significant energy and latency bottlenecks. Analog Computing-in-Memory (ACIM) addresses this challenge by performing multiply-accumulate (MAC) operations directly in the memory arrays, substantially reducing data movement. However, designing robust ACIM accelerators requires accurate modeling of device- and circuit-level non-idealities. In this work, we present NeuroSim V1.5, introducing several key advances: (1) seamless integration with TensorRT's post-training quantization flow enabling support for more neural networks including transformers, (2) a flexible noise injection methodology built on pre-characterized statistical models, making it straightforward to incorporate data from SPICE simulations or silicon measurements, (3) expanded device support including emerging non-volatile capacitive memories, and (4) up to 6.5x faster runtime than NeuroSim V1.4 through optimized behavioral simulation. The combination of these capabilities uniquely enables systematic design space exploration across both accuracy and hardware efficiency metrics. Through multiple case studies, we demonstrate optimization of critical design parameters while maintaining network accuracy. By bridging high-fidelity noise modeling with efficient simulation, NeuroSim V1.5 advances the design and validation of next-generation ACIM accelerators. All NeuroSim versions are available open-source at https://github.com/neurosim/NeuroSim. 2025-05-05T02:07:04Z 15 pages, 9 figures, 6 tables James Read Ming-Yen Lee Wei-Hsing Huang Yuan-Chun Luo Anni Lu Shimeng Yu http://arxiv.org/abs/2603.16565v1 Deep Learning-Driven Black-Box Doherty Power Amplifier with Pixelated Output Combiner and Extended Efficiency Range 2026-03-17T14:23:58Z This article presents a deep learning-driven inverse design methodology for Doherty power amplifiers (PA) with multi-port pixelated output combiner networks. A deep convolutional neural network (CNN) is developed and trained as an electromagnetic (EM) surrogate model to accurately and rapidly predict the S-parameters of pixelated passive networks. By leveraging the CNN-based surrogate model within a blackbox Doherty framework and a genetic algorithm (GA)-based optimizer, we effectively synthesize complex Doherty combiners that enable an extended back-off efficiency range using fully symmetrical devices. As a proof of concept, we designed and fabricated two Doherty PA prototypes incorporating three-port pixelated combiners, implemented with GaN HEMT transistors. In measurements, both prototypes demonstrate a maximum drain efficiency exceeding 74% and deliver an output power surpassing 44.1 dBm at 2.75 GHz. Furthermore, a measured drain efficiency above 52% is maintained at the 9-dB back-off power level for both prototypes at the same frequency. To evaluate linearity and efficiency under realistic signal conditions, both prototypes are tested using a 20-MHz 5G new radio (NR)-like waveform exhibiting a peak-to-average power ratio (PAPR) of 9.0 dB. After applying digital predistortion (DPD), each design achieves an average power added efficiency (PAE) above 51%, while maintaining an adjacent channel leakage ratio (ACLR) better than -60.8 dBc. 2026-03-17T14:23:58Z Han Zhou Haojie Chang David Widen http://arxiv.org/abs/2603.16543v1 A Pin-Array Structured Climbing Robot for Stable Locomotion on Steep Rocky Terrain 2026-03-17T14:05:42Z Climbing robots face significant challenges when navigating unstructured environments, where reliable attachment to irregular surfaces is critical. We present a novel mobile climbing robot equipped with compliant pin-array structured grippers that passively conform to surface irregularities, ensuring stable ground gripping without the need for complicated sensing or control. Each pin features a vertically split design, combining an elastic element with a metal spine to enable mechanical interlocking with microscale surface features. Statistical modeling and experimental validation indicate that variability in individual pin forces and contact numbers are the primary sources of grasping uncertainty. The robot demonstrated robust and stable locomotion in indoor tests on inclined walls (10-30 degrees) and in outdoor tests on natural rocky terrain. This work highlights that a design emphasizing passive compliance and mechanical redundancy provides a practical and robust solution for real-world climbing robots while minimizing control complexity. 2026-03-17T14:05:42Z Author's version of a manuscript accepted at the 2026 IEEE International Conference on Robotics and Automation (ICRA). (c) IEEE Keita Nagaoka Kentaro Uno Kazuya Yoshida http://arxiv.org/abs/2603.16490v1 ETM2: Empowering Traditional Memory Bandwidth Regulation using ETM 2026-03-17T13:19:15Z The Embedded Trace Macrocell (ETM) is a standard component of Arm's CoreSight architecture, present in a wide range of platforms and primarily designed for tracing and debugging. In this work, we demonstrate that it can be repurposed to implement a novel hardware-assisted memory bandwidth regulator, providing a portable and effective solution to mitigate memory interference in real-time multicore systems. ETM2 requires minimal software intervention and bridges the gap between the fine-grained microsecond resolution of MemPol and the portability and reaction time of interrupt-based solutions, such as MemGuard. We assess the effectiveness and portability of our design with an evaluation on a large number of 64-bit Arm boards, and we compare ETM2 with previous works using a setup based on the San Diego Vision Benchmark Suite on the AMD Zynq UltraScale+. Our results show the scalability of the approach and highlight the design trade-offs it enables. ETM2 is effective in enforcing per-core memory bandwidth regulation and unlocks new regulation options that were infeasible under MemGuard and MemPol. 2026-03-17T13:19:15Z Extended version or the paper accepted to appear at IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) 2026 Alexander Zuepke Ashutosh Pradhan Daniele Ottaviano Andrea Bastoni Marco Caccamo http://arxiv.org/abs/2603.16203v1 A Scalable Open-Source QEC System with Sub-Microsecond Decoding-Feedback Latency 2026-03-17T07:30:35Z Quantum error correction (QEC) is essential for realizing large-scale, fault-tolerant quantum computation, yet its practical implementation remains a major engineering challenge. In particular, QEC demands precise real-time control of a large number of qubits and low-latency, high-throughput and accurate decoding of error syndromes. While most prior work has focused primarily on decoder design, the overall performance of any QEC system depends critically on all its subsystems including control, communication, and decoding, as well as their integration. To address this challenge, we present an open-source, fully integrated QEC system built on RISC-Q, a generator for RISC-V-based quantum control architectures. Implemented on RFSoC FPGAs, our system prototype integrates real-time qubit control, a scalable distributed multi-board architecture, and the state-of-the-art hardware QEC decoder within a low-latency, high-throughput decoding pipeline, forming a complete hardware platform ready for deployment with superconducting qubits. Experimental evaluation on a three-board prototype based on AMD ZCU216 RFSoCs demonstrates an end-to-end QEC decoding-feedback latency of 446 ns for a distance-3 surface code, including syndrome aggregation, network communication, syndrome decoding, and error distribution. Extrapolating from measured subsystem performance and state-of-the-art decoder benchmarks, the architecture can achieve sub-microsecond decoding-feedback latency up to a distance-21 surface code ($\sim$881 physical qubits) when scaled to larger hardware configurations. 2026-03-17T07:30:35Z Junyi Liu Yi Lee Yilun Xu Gang Huang Xiaodi Wu http://arxiv.org/abs/2512.18345v2 Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration 2026-03-17T07:24:08Z Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. Focusing on the memory hierarchy, we demonstrate that dominant kernels remain bound by the on-chip L2 cache despite its high bandwidth, exposing a persistent inner memory wall beyond the conventional off-chip DRAM bottleneck. Further, we reveal that the overall CKKS throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Theodosian achieves 1.45--1.83x performance improvements over a highly optimized baseline, Cheddar, across representative CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers from 22.1ms to 15.2ms, and further to 12.8ms with additional algorithmic optimizations, establishing a new state-of-the-art GPU performance to the best of our knowledge. 2025-12-20T12:18:29Z 12 pages, 8 figures, accepted at ISPASS 2026 Wonseok Choi Hyunah Yu Jongmin Kim Hyesung Ji Jaiyoung Park Jung Ho Ahn http://arxiv.org/abs/2603.15589v1 LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs 2026-03-16T17:48:30Z Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point standards allocate eight bits to the exponent, but our profiling reveals that exponent streams exhibit fewer than 3 bits Shannon entropy, indicating high inherent compressibility. To exploit this potential, we propose LEXI, a novel lossless exponent compression scheme based on Huffman coding. LEXI compresses activations and caches on the fly while storing compressed weights for just-in-time decompression near compute, without sacrificing system throughput and model accuracy. The codecs at the ingress and egress ports of network-on-chip routers sustain the maximum link bandwidth via multi-lane LUT decoders, incurring only 0.09 percent area and energy overheads with GF 22 nm technology. LEXI reduces inter-chiplet communication and end-to-end inference latencies by 33-45 percent and 30-35 percent on modern Jamba, Zamba, and Qwen LLMs implemented on a homogeneous chiplet architecture. 2026-03-16T17:48:30Z 7 pages Miao Sun Alish Kanani Kaushik Shroff Umit Ogras http://arxiv.org/abs/2603.15571v1 Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models 2026-03-16T17:33:26Z Solid-state storage architectures based on NAND or emerging memory devices (SSD), are fundamentally architected and optimized for both reliability and performance. Achieving these simultaneous goals requires co-design of memory components with firmware-architected Error Management (EM) algorithms for density- and performance-scaled memory technologies. We describe a Machine Learning (ML) for systems methodology and modeling for co-designing the EM subsystem together with the natural variance inherent to scaled silicon process of memory components underlying SSD technology. The modeling analyzes NAND memory components and EM algorithms interacting with comprehensive suite of synthetic (stress-focused and JEDEC) and emulation (YCSB and similar) workloads across Flash Translation abstraction layers, by leveraging a statistically interpretable and intuitively explainable ML algorithm. The generalizable co-design framework evaluates several thousand datacenter SSDs spanning multiple generations of memory and storage technology. Consequently, the modeling framework enables continuous, holistic, data-driven design towards generational architectural advancements. We additionally demonstrate that the framework enables Representation Learning of the EM-workload domain for enhancement of the architectural design-space across broad spectrum of workloads. 2026-03-16T17:33:26Z 9 pages, 10 figures Jay Sarkar Vamsi Pavan Rayaprolu Abhijeet Bhalerao http://arxiv.org/abs/2603.15530v1 DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages 2026-03-16T16:56:01Z Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU. 2026-03-16T16:56:01Z Paper accepted for publication at the Design Automation Conference (DAC) 2026 conference Alish Kanani Sangwan Lee Han Lyu Jiahao Lin Jaehyun Park Umit Y. Ogras http://arxiv.org/abs/2603.15717v1 GLANCE: Gaze-Led Attention Network for Compressed Edge-inference 2026-03-16T15:52:52Z Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing. 2026-03-16T15:52:52Z Neeraj Solanki Hong Ding Sepehr Tabrizchi Ali Shafiee Sarvestani Shaahin Angizi David Z. Pan Arman Roohi http://arxiv.org/abs/2603.15413v1 RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks 2026-03-16T15:22:55Z This work proposes a unified three-stage framework that produces a quantized DNN with balanced fault and attack robustness. The first stage improves attack resilience via fine-tuning that desensitizes feature representations to small input perturbations. The second stage reinforces fault resilience through fault-aware fine-tuning under simulated bit-flip faults. Finally, a lightweight post-training adjustment integrates quantization to enhance efficiency and further mitigate fault sensitivity without degrading attack resilience. Experiments on ResNet18, VGG16, EfficientNet, and Swin-Tiny in CIFAR-10, CIFAR-100, and GTSRB show consistent gains of up to 10.35% in attack resilience and 12.47% in fault resilience, while maintaining competitive accuracy in quantized networks. The results also highlight an asymmetric interaction in which improvements in fault resilience generally increase resilience to adversarial attacks, whereas enhanced adversarial resilience does not necessarily lead to higher fault resilience. 2026-03-16T15:22:55Z Ali Soltan Mohammadi Samira Nazari Ali Azarpeyvand Mahdi Taheri Milos Krstic Michael Huebner Christian Herglotz Tara Ghasempouri http://arxiv.org/abs/2603.14988v1 bitSMM: A bit-Serial Matrix Multiplication Accelerator 2026-03-16T08:51:54Z Neural-network (NN) inference is increasingly present on-board spacecraft to reduce downlink bandwidth and enable timely decision making. However, the power and reliability constraints of space missions limit the applicability of many state-of-the-art NN accelerators. This paper presents bitSMM, a bit-serial matrix multiplication accelerator built around a systolic array of bit-serial multiply--accumulate (MAC) units. The design supports runtime-configurable operand precision from 1 to 16 bits and evaluates two MAC variants: a Booth-inspired architecture and a standard binary multiplication with correction architecture. We implement bitSMM in [System]Verilog and evaluate it on an AMD ZCU104 FPGA and through ASIC physical implementation using the asap7 and nangate45 process design kits. On the FPGA, bitSMM achieves up to 19.2~GOPS and 2.973~GOPS/W, and in asap7 it achieves up to 73.22~GOPS, 552~GOPS/mm$^2$, and 40.8~GOPS/W. 2026-03-16T08:51:54Z Accepted at CGRA4HPC 2026 Pedro Antunes Artur Podobas http://arxiv.org/abs/2603.14785v1 SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation 2026-03-16T03:35:04Z Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their inference efficiency remains a critical bottleneck due to rapidly growing parameters. Recent advances in dynamic computation allocation address this challenge by exploiting the highly uneven contributions of different tokens and layers, enabling selective execution that significantly reduces redundant computation while preserving model accuracy. However, existing hardware platforms and accelerators are primarily optimized for uniform, static execution, limiting their ability to efficiently support such dynamic inference patterns. In this work, we propose SkipOPU, an FPGA-based overlay processor that dynamically allocates computation across tokens and layers with high flexibility through a lightweight routing mechanism. First, we decouple reduction operations from element-wise computation in nonlinear modules and perform reductions incrementally, which enables both stages to be fused with adjacent linear operations (router or matrix multiplication) for effective latency hiding. Second, motivated by asymmetric sensitivity to numerical precision between activation and weight, we design a PE array that efficiently supports float-fixed hybrid execution. A novel DSP overpacking technique is introduced to maximize hardware utilization while minimizing resource overhead. Finally, we develop a proactive on-chip KV history buffer that exploits cross-layer KV invariance of pruned tokens, eliminating irregular HBM accesses during decoding and supplementing off-chip bandwidth through high-locality on-chip reuse. Experimental results demonstrate that SkipOPU on an AMD U280 FPGA outperforms GPU and other FPGA-based accelerators by 1.23x-3.83x in bandwidth efficiency for LLMs inference with dynamic computation allocation and can reduce up to 25.4% KV storage overhead across varying sequence lengths. 2026-03-16T03:35:04Z 22 pages,9 figures Zicheng He Anhao Zhao Xiaoyu Shen Chen Wu Lei He http://arxiv.org/abs/2603.14583v1 Machine Learning-Driven Intelligent Memory System Design: From On-Chip Caches to Storage 2026-03-15T20:02:05Z Despite the data-rich environment in which memory systems of modern computing platforms operate, many state-of-the-art architectural policies employed in the memory system rely on static, human-designed heuristics that fail to truly adapt to the workload and system behavior via principled learning methodologies. In this article, we propose a fundamentally different design approach: using lightweight and practical machine learning (ML) methods to enable adaptive, data-driven control throughout the memory hierarchy. We present three ML-guided architectural policies: (1) Pythia, a reinforcement learning-based data prefetcher for on-chip caches, (2) Hermes, a perceptron learning-based off-chip predictor for multi-level cache hierarchies, and (3) Sibyl, a reinforcement learning-based data placement policy for hybrid storage systems. Our evaluation shows that Pythia, Hermes, and Sibyl significantly outperform the best-prior human-designed policies, while incurring modest hardware overheads. Collectively, this article demonstrates that integrating adaptive learning into memory subsystems can lead to intelligent, self-optimizing architectures that unlock performance and efficiency gains beyond what is possible with traditional human-designed approaches. 2026-03-15T20:02:05Z Extended version of the IEEE Micro 2026 article Rahul Bera Rakesh Nadig Onur Mutlu 10.1109/MM.2026.3667076