https://arxiv.org/api/x+CP9N3PjWAC3hNIHgSJZ0vwoaU 2026-03-30T10:42:06Z 5080 240 15 http://arxiv.org/abs/2512.09199v1 LLMs for Analog Circuit Design Continuum (ACDC) 2025-12-09T23:57:28Z

Large Language Models (LLMs) and transformer architectures have shown impressive reasoning and generation capabilities across diverse natural language tasks. However, their reliability and robustness in real-world engineering domains remain largely unexplored, limiting their practical utility in human-centric workflows. In this work, we investigate the applicability and consistency of LLMs for analog circuit design -- a task requiring domain-specific reasoning, adherence to physical constraints, and structured representations -- focusing on AI-assisted design where humans remain in the loop. We study how different data representations influence model behavior and compare smaller models (e.g., T5, GPT-2) with larger foundation models (e.g., Mistral-7B, GPT-oss-20B) under varying training conditions. Our results highlight key reliability challenges, including sensitivity to data format, instability in generated designs, and limited generalization to unseen circuit configurations. These findings provide early evidence on the limits and potential of LLMs as tools to enhance human capabilities in complex engineering tasks, offering insights into designing reliable, deployable foundation models for structured, real-world applications.

2025-12-09T23:57:28Z Yasaman Esfandiari Jocelyn Rego Austin Meyer Jonathan Gallagher Mia Levy http://arxiv.org/abs/2512.08715v1 Multi-domain performance analysis with scores tailored to user preferences 2025-12-09T15:29:53Z

The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.

2025-12-09T15:29:53Z Sébastien Piérard Adrien Deliège Marc Van Droogenbroeck http://arxiv.org/abs/2511.22334v2 Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends 2025-12-09T12:19:18Z

Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy consumption, making them unsuitable for large language models (LLMs). Fortunately, Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments by significantly reducing computational cost while remaining suitable for specialization and customization. In this scenario, selecting the hardware platform that best balances performance and efficiency for SLM inference is challenging due to strict resource limitations. To address this issue, this study evaluates the inference performance and energy efficiency of commercial CPUs (Intel and ARM), GPUs (NVIDIA), and NPUs (RaiderChip) for running SLMs. GPUs, the usual platform of choice, are compared against commercial NPUs and recent multi-core CPUs. While NPUs leverage custom hardware designs optimized for computation, modern CPUs increasingly incorporate dedicated features targeting language-model workloads. Using a common execution framework and a suite of state-of-the-art SLMs, we analyze both maximum achievable performance and processing and energy efficiency across commercial solutions available for each platform. The results indicate that specialized backends outperform general-purpose CPUs, with NPUs achieving the highest performance by a wide margin. Bandwidth normalization proves essential for fair cross-architecture comparisons. Although low-power ARM processors deliver competitive results when energy usage is considered, metrics that combine performance and power (such as EDP) again highlight NPUs as the dominant architecture. These findings show that designs optimized for both efficiency and performance offer a clear advantage for edge workloads.

2025-11-27T11:11:01Z 8 pages, 9 figures Pablo Prieto Pablo Abad http://arxiv.org/abs/2512.08465v1 High-performance computing enabled contingency analysis for modern power networks 2025-12-09T10:38:49Z

Modern power networks face increasing vulnerability to cascading failures due to high complexity and the growing penetration of intermittent resources, necessitating rigorous security assessment beyond the conventional $N-1$ criterion. Current approaches often struggle to achieve the computational tractability required for exhaustive $N-2$ contingency analysis integrated with complex stability evaluations like small-signal stability. Addressing this computational bottleneck and the limitations of deterministic screening, this paper presents a scalable methodology for the vulnerability assessment of modern power networks, integrating $N-2$ contingency analysis with small-signal stability evaluation. To prioritize critical components, we propose a probabilistic \textbf{Risk Index ($R_i$)} that weights the deterministic \textit{severity} of a contingency (including optimal power flow divergence, islanding, and oscillatory instability) by the \textit{failure frequency} of the involved elements based on reliability data. The proposed framework is implemented using High-Performance Computing (HPC) techniques through the PyCOMPSs parallel programming library, orchestrating optimal power flow simulations (VeraGrid) and small-signal analysis (STAMP) to enable the exhaustive exploration of massive contingency sets. The methodology is validated on the IEEE 118-bus test system, processing more than \num{57000} scenarios to identify components prone to triggering cascading failures. Results demonstrate that the risk-based approach effectively isolates critical assets that deterministic $N-1$ criteria often overlook. This work establishes a replicable and efficient workflow for probabilistic security assessment, suitable for large-scale networks and capable of supporting operator decision-making in near real-time environments.

2025-12-09T10:38:49Z 10 apges, 5 figures, pending to be submitted on IJEPES Alexandre Gracia-Calvo Francesca Rossi Eduardo Iraola Juan Carlos Olives-Camps Eduardo Prieto-Araujo http://arxiv.org/abs/2509.08207v2 Aurora: Architecting Argonne's First Exascale Supercomputer for Accelerated Scientific Discovery 2025-12-08T18:45:43Z

Aurora is Argonne National Laboratory's pioneering Exascale supercomputer, designed to accelerate scientific discovery with cutting-edge architectural innovations. Key new technologies include the Intel(TM) Xeon(TM) Data Center GPU Max Series (code-named Sapphire Rapids) with support for High Bandwidth Memory (HBM), alongside the Intel(TM) Data Center GPU Max Series (code-named Ponte Vecchio) on each compute node. Aurora also integrates the Distributed Asynchronous Object Storage (DAOS), a novel exascale storage solution, and leverages Intel's oneAPI programming environment. This paper presents an in-depth exploration of Aurora's node architecture, the HPE Slingshot interconnect, the supporting software ecosystem, and DAOS. We provide insights into standard benchmark performance and applications readiness efforts via Aurora's Early Science Program and the Exascale Computing Project.

2025-09-10T00:30:05Z 40 pages, 10 figures. Submitted to J. Supercomputing William E. Allcock Benjamin S. Allen James Anchell Victor Anisimov Thomas Applencourt Abhishek Bagusetty Ramesh Balakrishnan Riccardo Balin Solomon Bekele Colleen Bertoni Cyrus Blackworth Renzo Bustamante Kevin Canada John Carrier Christopher Chan-nui Lance C. Cheney Taylor Childers Paul Coffman Susan Coghlan Tanima Dey Michael D'Mello Ashok Emani Murali Emani Kyle G. Felker Sam Foreman Olivier Franza Longfei Gao Marta García María Garzarán Balazs Gerofi Yasaman Ghadar Subrata Goswami Neha Gupta Kevin Harms Väinö Hatanpää Brian Holland Carissa Holohan Brian Homerding Khalid Hossain Xue Hu Louise Huot Huda Ibeid Joseph A. Insley Sai Jayanthi Hong Jiang Wei Jiang Xiao-Yong Jin Jeongnim Kim Christopher Knight Panagiotis Kourdis Kalyan Kumaran JaeHyuk Kwack Janghaeng Lee Ti Leggett Ben Lenard Chris Lewis Nevin Liber Johann Lombardi Raymond M. Loy Ye Luo Bethany Lusch Nilakantan Mahadevan Beth Markey Victor A. Mateevitsi Gordon McPheeters Ryan Milner Jerome Mitchell Vitali A. Morozov Servesh Muralidharan Tom Musta Mrigendra Nagar Vikram Narayana Marieme Ngom Anthony-Trung Nguyen Nathan Nichols Aditya Nishtala James C. Osborn Michael E. Papka Scott Parker Saumil S. Patel Julia Piotrowska Adrian C. Pope Sucheta Raghunanda Esteban Rangel Paul M. Rich Katherine M. Riley Silvio Rizzi Kris Rowe Varuni Sastry Adam Scovel Filippo Simini Haritha Siddabathuni Som Patrick Steinbrecher Rick Stevens Xinmin Tian Peter Upton Thomas Uram Archit K. Vasan Álvaro Vázquez-Mayagoitia Kaushik Velusamy Brice Videau Venkatram Vishwanath Brian Whitney Timothy J. Williams Michael Woodacre Sam Zeltner Chuanjun Zhang Gengbin Zheng Huihuo Zheng http://arxiv.org/abs/2512.07622v1 Análisis de rendimiento y eficiencia energética en el cluster Raspberry Pi Cronos 2025-12-08T15:08:09Z

This article presents an evaluation of the computational performance and energy efficiency of the Cronos cluster, composed of Raspberry Pi4 and 3b microcomputers designed for educational purposes. Experimental tests were performed using the High Performance Linpack (HPL) benchmark, under a resource management environment configured with Slurm and parallel communication via Open MPI. The study focuses on analyzing scalability, stability, and power consumption during the execution of computationally intensive workloads, considering different node configurations. The results show that the cluster achieves a performance of up to 6.91 GFLOPS in homogeneous configurations of 6 Raspberry Pi 4 nodes, and that the use of heterogeneous nodes (including Raspberry Pi 3b) can negatively impact stability and efficiency. Additionally, the total electrical consumption of the system was measured during the runs, allowing for the estimation of the performance-to-consumption ratio (GFLOPS/W) as a comparative metric. This study constitutes a concrete contribution to the design, evaluation, and utilization of low-cost ARM clusters in educational and research contexts.

2025-12-08T15:08:09Z in Spanish language Martha Semken Mariano Vargas Ignacio Tula Giuliana Zorzoli Andrés Rojas Paredes http://arxiv.org/abs/2508.16653v2 H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference 2025-12-08T13:48:55Z

Large language models (LLMs) have demonstrated remarkable proficiency in a wide range of natural language processing applications. However, the high energy and latency overhead induced by the KV cache limits the edge deployment, especially for long contexts. Emerging hybrid bonding (HB) technology has been proposed as a promising alternative to conventional near-memory processing (NMP) architectures, offering improved bandwidth efficiency and lower power consumption while exhibiting characteristics of distributed memory. In this paper, we propose H2EAL, a hybrid bonding-based accelerator with sparse attention algorithm-hardware co-design for efficient LLM inference at the edge. At the algorithm level, we propose a hybrid sparse attention scheme with static and dynamic sparsity for different heads to fully leverage the sparsity with high accuracy. At the hardware level, we co-design the hardware to support hybrid sparse attention and propose memory-compute co-placement to address the distributed memory bottleneck. Since different attention heads exhibit different sparse patterns and the attention structure often mismatches the HB architecture, we further develop a load-balancing scheduler with parallel tiled attention to address workload imbalance and optimize the mapping strategy. Extensive experiments demonstrate H2EAL achieves 5.20~48.21x speedup and 6.22~73.48x energy efficiency improvement over baseline HB implementation, with a negligible average accuracy drop of 0.87% on multiple benchmarks.

2025-08-20T03:42:37Z International Conference on Computer-Aided Design (ICCAD) 2025 Zizhuo Fu Xiaotian Guo Wenxuan Zeng Shuzhang Zhong Yadong Zhang Peiyu Chen Runsheng Wang Le Ye Meng Li http://arxiv.org/abs/2512.07449v1 AFarePart: Accuracy-aware Fault-resilient Partitioner for DNN Edge Accelerators 2025-12-08T11:25:11Z

Deep Neural Networks (DNNs) are increasingly deployed across distributed and resource-constrained platforms, such as System-on-Chip (SoC) accelerators and edge-cloud systems. DNNs are often partitioned and executed across heterogeneous processing units to optimize latency and energy. However, the reliability of these partitioned models under hardware faults and communication errors remains a critical yet underexplored topic, especially in safety-critical applications. In this paper, we propose an accuracy-aware, fault-resilient DNN partitioning framework targeting multi-objective optimization using NSGA-II, where accuracy degradation under fault conditions is introduced as a core metric alongside energy and latency. Our framework performs runtime fault injection during optimization and utilizes a feedback loop to prioritize fault-tolerant partitioning. We evaluate our approach on benchmark CNNs including AlexNet, SqueezeNet and ResNet18 on hardware accelerators, and demonstrate up to 27.7% improvement in fault tolerance with minimal increase in performance overhead. Our results highlight the importance of incorporating resilience into DNN partitioning, and thereby paving the way for robust AI inference in error-prone environments.

2025-12-08T11:25:11Z 6 pages, 4 figures, 2 tables Mukta Debnath University of Calcutta, India Krishnendu Guha University College Cork, Ireland Debasri Saha University of Calcutta, India Amlan Chakrabarti University of Calcutta, India Susmita Sur-Kolay Indian Statistical Institute, India http://arxiv.org/abs/2512.07011v1 Block Sparse Flash Attention 2025-12-07T21:20:12Z

Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention

2025-12-07T21:20:12Z 10 pages, 5 figures. Code: https://github.com/Danielohayon/Block-Sparse-Flash-Attention Daniel Ohayon Itay Lamprecht Itay Hubara Israel Cohen Daniel Soudry Noam Elata http://arxiv.org/abs/2310.18149v2 Game of arrivals at a two queue network with heterogeneous customer routes 2025-12-07T04:41:11Z

We consider a queuing network that opens at a specified time, where customers are non-atomic and belong to different classes. Each class has its own route, and as is typical in the literature, the costs are a linear function of waiting and service completion time. We restrict ourselves to a two class, two queue network: this simplification is well motivated as the diversity in solution structure as a function of problem parameters is substantial even in this simple setting (e.g., a specific routing structure involves eight different regimes), suggesting a combinatorial blow up as the number of queues, routes and customer classes increase. We identify the unique Nash equilibrium customer arrival profile when the customer linear cost preferences are different. This profile is a function of problem parameters including the size of each class, service rates at each queue, and customer cost preferences. When customer cost preferences match, under certain parametric settings, the equilibrium arrival profiles may not be unique and may lie in a convex set. We further make a surprising observation that in some parametric settings, customers in one class may arrive in disjoint intervals. Further, the two classes may arrive in contiguous intervals or in overlapping intervals, and at varying rates within an interval, depending upon the problem parameters.

2023-10-27T13:55:14Z discussions on the connection with non-fluid two queue network arrival games added; full version of a short paper with same title published in IFIP Performance 2025 Agniv Bandyopadhyay Sandeep Juneja http://arxiv.org/abs/2512.06390v1 Web Technologies Security in the AI Era: A Survey of CDN-Enhanced Defenses 2025-12-06T10:42:14Z

The modern web stack, which is dominated by browser-based applications and API-first backends, now operates under an adversarial equilibrium where automated, AI-assisted attacks evolve continuously. Content Delivery Networks (CDNs) and edge computing place programmable defenses closest to users and bots, making them natural enforcement points for machine-learning (ML) driven inspection, throttling, and isolation. This survey synthesizes the landscape of AI-enhanced defenses deployed at the edge: (i) anomaly- and behavior-based Web Application Firewalls (WAFs) within broader Web Application and API Protection (WAAP), (ii) adaptive DDoS detection and mitigation, (iii) bot management that resists human-mimicry, and (iv) API discovery, positive security modeling, and encrypted-traffic anomaly analysis. We add a systematic survey method, a threat taxonomy mapped to edge-observable signals, evaluation metrics, deployment playbooks, and governance guidance. We conclude with a research agenda spanning XAI, adversarial robustness, and autonomous multi-agent defense. Our findings indicate that edge-centric AI measurably improves time-to-detect and time-to-mitigate while reducing data movement and enhancing compliance, yet introduces new risks around model abuse, poisoning, and governance.

2025-12-06T10:42:14Z Accepted at 2025 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob). 7 pages, 5 figures 2025 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), Bali, Indonesia, 2025, pp. 180-186 Mehrab Hosain Sabbir Alom Shuvo Matthew Ogbe Md Shah Jalal Mazumder Yead Rahman Md Azizul Hakim Anukul Pandey 10.1109/APWiMob67231.2025.11269122 http://arxiv.org/abs/2502.08804v3 Novel Lower Bounds on M/G/k Scheduling 2025-12-05T23:13:56Z

In queueing systems, effective scheduling algorithms are essential for optimizing performance. Optimal scheduling for the M/G/k queue has been explored in the heavy traffic limit, but much remains unknown in the intermediate load regime. In this paper, we give the first framework for proving nontrivial lower bounds on the mean response time of the M/G/k system under arbitrary scheduling policies. Our bounds tighten previous naive lower bounds by more than 60\%, yielding significant improvements particularly for moderate loads. Key to our approach is a new variable-speed queue, which more accurately captures the work completion behavior of multiserver systems. To analyze the expected work of this queue, we develop a novel manner of employing the drift method or the BAR approach, by developing test functions via the solutions to a differential equation. We validate our results numerically for systems with up to 5 servers and a range of job size distributions.

2025-02-12T21:39:22Z Ziyuan Wang Izzy Grosof http://arxiv.org/abs/2512.05831v1 Dissecting Embedding Bag Performance in DLRM Inference 2025-12-05T15:54:51Z

As the size of DLRMs gets larger, the models must be partitioned across multiple GPUs or nodes of GPUs due to the size limitation of total HBM memory that can be packaged in a GPU. This partitioning adds communication and synchronization overhead of sending and receiving data across GPUs. We use the NCCL and NVSHMEM libraries to measure the performance of an Embedding Bag kernel implemented on H100 GPUs. We compare its performance across diOerent batch sizes, number of tables, table sizes, pooling factors, and embedding dimensions. For a large embedding table that spans multiple GPUs, we project the performance slowdown from distributing an embedding table across multiple GPUs.

2025-12-05T15:54:51Z Chandrish Ambati Jing Ding Trung Diep http://arxiv.org/abs/2601.19904v1 DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs 2025-12-04T22:43:14Z

The exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore's Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework designed for evaluating LLM workloads on dataflow-based accelerators. By combining intra-chip performance profiling and inter-chip scalability analysis, DABench-LLM enables comprehensive evaluation across key metrics such as resource allocation, load balance, and resource efficiency. The framework helps researchers rapidly gain insights into underlying hardware and system behaviors, and provides guidance for performance optimizations. We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU. Our framework reveals performance bottlenecks and provides specific optimization strategies, demonstrating its generality and effectiveness across a diverse range of dataflow-based AI hardware platforms.

2025-12-04T22:43:14Z Ziyu Hu Zhiqing Zhong Weijian Zheng Zhijing Ye Xuwei Tan Xueru Zhang Zheng Xie Rajkumar Kettimuthu Xiaodong Yu http://arxiv.org/abs/2512.03914v2 Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale 2025-12-04T10:33:41Z

Efficient simulation of complex plasma dynamics is crucial for advancing fusion energy research. Particle-in-Cell (PIC) Monte Carlo (MC) simulations provide insights into plasma behavior, including turbulence and confinement, which are essential for optimizing fusion reactor performance. Transitioning to exascale simulations introduces significant challenges, with traditional file input/output (I/O) inefficiencies remaining a key bottleneck. This work advances BIT1, an electrostatic PIC MC code, by improving the particle mover with OpenMP task-based parallelism, integrating the openPMD streaming API, and enabling in-memory data streaming with ADIOS2's Sustainable Staging Transport (SST) engine to enhance I/O performance, computational efficiency, and system storage utilization. We employ profiling tools such as gprof, perf, IPM and Darshan, which provide insights into computation, communication, and I/O operations. We implement time-dependent data checkpointing with the openPMD API enabling seamless data movement and in-situ visualization for real-time analysis without interrupting the simulation. We demonstrate improvements in simulation runtime, data accessibility and real-time insights by comparing traditional file I/O with the ADIOS2 BP4 and SST backends. The proposed hybrid BIT1 openPMD SST enhancement introduces a new paradigm for real-time scientific discovery in plasma simulations, enabling faster insights and more efficient use of exascale computing resources.

2025-12-03T15:59:14Z Accepted by The International Journal of High Performance Computing Applications (IJHPCA) prepared in English, formatted in SAGE Publications (LaTeX) template and consists of 22 pages, which includes the main text, references, and figures Jeremy J. Williams Stefan Costea Daniel Medeiros Jordy Trilaksono Pratibha Hegde David Tskhakaya Leon Kos Ales Podolnik Jakub Hromadka Kevin A. Huck Allen D. Malony Frank Jenko Erwin Laure Stefano Markidis 10.1177/10943420251409229