https://arxiv.org/api/x+CP9N3PjWAC3hNIHgSJZ0vwoaU2026-03-30T10:42:06Z508024015http://arxiv.org/abs/2512.09199v1LLMs for Analog Circuit Design Continuum (ACDC)2025-12-09T23:57:28ZLarge Language Models (LLMs) and transformer architectures have shown impressive reasoning and generation capabilities across diverse natural language tasks. However, their reliability and robustness in real-world engineering domains remain largely unexplored, limiting their practical utility in human-centric workflows. In this work, we investigate the applicability and consistency of LLMs for analog circuit design -- a task requiring domain-specific reasoning, adherence to physical constraints, and structured representations -- focusing on AI-assisted design where humans remain in the loop. We study how different data representations influence model behavior and compare smaller models (e.g., T5, GPT-2) with larger foundation models (e.g., Mistral-7B, GPT-oss-20B) under varying training conditions. Our results highlight key reliability challenges, including sensitivity to data format, instability in generated designs, and limited generalization to unseen circuit configurations. These findings provide early evidence on the limits and potential of LLMs as tools to enhance human capabilities in complex engineering tasks, offering insights into designing reliable, deployable foundation models for structured, real-world applications.2025-12-09T23:57:28ZYasaman EsfandiariJocelyn RegoAustin MeyerJonathan GallagherMia Levyhttp://arxiv.org/abs/2512.08715v1Multi-domain performance analysis with scores tailored to user preferences2025-12-09T15:29:53ZThe performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.2025-12-09T15:29:53ZSébastien PiérardAdrien DeliègeMarc Van Droogenbroeckhttp://arxiv.org/abs/2511.22334v2Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends2025-12-09T12:19:18ZEdge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy consumption, making them unsuitable for large language models (LLMs). Fortunately, Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments by significantly reducing computational cost while remaining suitable for specialization and customization. In this scenario, selecting the hardware platform that best balances performance and efficiency for SLM inference is challenging due to strict resource limitations. To address this issue, this study evaluates the inference performance and energy efficiency of commercial CPUs (Intel and ARM), GPUs (NVIDIA), and NPUs (RaiderChip) for running SLMs. GPUs, the usual platform of choice, are compared against commercial NPUs and recent multi-core CPUs. While NPUs leverage custom hardware designs optimized for computation, modern CPUs increasingly incorporate dedicated features targeting language-model workloads. Using a common execution framework and a suite of state-of-the-art SLMs, we analyze both maximum achievable performance and processing and energy efficiency across commercial solutions available for each platform. The results indicate that specialized backends outperform general-purpose CPUs, with NPUs achieving the highest performance by a wide margin. Bandwidth normalization proves essential for fair cross-architecture comparisons. Although low-power ARM processors deliver competitive results when energy usage is considered, metrics that combine performance and power (such as EDP) again highlight NPUs as the dominant architecture. These findings show that designs optimized for both efficiency and performance offer a clear advantage for edge workloads.2025-11-27T11:11:01Z8 pages, 9 figuresPablo PrietoPablo Abadhttp://arxiv.org/abs/2512.08465v1High-performance computing enabled contingency analysis for modern power networks2025-12-09T10:38:49ZModern power networks face increasing vulnerability to cascading failures due to high complexity and the growing penetration of intermittent resources, necessitating rigorous security assessment beyond the conventional $N-1$ criterion. Current approaches often struggle to achieve the computational tractability required for exhaustive $N-2$ contingency analysis integrated with complex stability evaluations like small-signal stability. Addressing this computational bottleneck and the limitations of deterministic screening, this paper presents a scalable methodology for the vulnerability assessment of modern power networks, integrating $N-2$ contingency analysis with small-signal stability evaluation. To prioritize critical components, we propose a probabilistic \textbf{Risk Index ($R_i$)} that weights the deterministic \textit{severity} of a contingency (including optimal power flow divergence, islanding, and oscillatory instability) by the \textit{failure frequency} of the involved elements based on reliability data. The proposed framework is implemented using High-Performance Computing (HPC) techniques through the PyCOMPSs parallel programming library, orchestrating optimal power flow simulations (VeraGrid) and small-signal analysis (STAMP) to enable the exhaustive exploration of massive contingency sets. The methodology is validated on the IEEE 118-bus test system, processing more than \num{57000} scenarios to identify components prone to triggering cascading failures. Results demonstrate that the risk-based approach effectively isolates critical assets that deterministic $N-1$ criteria often overlook. This work establishes a replicable and efficient workflow for probabilistic security assessment, suitable for large-scale networks and capable of supporting operator decision-making in near real-time environments.2025-12-09T10:38:49Z10 apges, 5 figures, pending to be submitted on IJEPESAlexandre Gracia-CalvoFrancesca RossiEduardo IraolaJuan Carlos Olives-CampsEduardo Prieto-Araujohttp://arxiv.org/abs/2509.08207v2Aurora: Architecting Argonne's First Exascale Supercomputer for Accelerated Scientific Discovery2025-12-08T18:45:43ZAurora is Argonne National Laboratory's pioneering Exascale supercomputer, designed to accelerate scientific discovery with cutting-edge architectural innovations. Key new technologies include the Intel(TM) Xeon(TM) Data Center GPU Max Series (code-named Sapphire Rapids) with support for High Bandwidth Memory (HBM), alongside the Intel(TM) Data Center GPU Max Series (code-named Ponte Vecchio) on each compute node. Aurora also integrates the Distributed Asynchronous Object Storage (DAOS), a novel exascale storage solution, and leverages Intel's oneAPI programming environment. This paper presents an in-depth exploration of Aurora's node architecture, the HPE Slingshot interconnect, the supporting software ecosystem, and DAOS. We provide insights into standard benchmark performance and applications readiness efforts via Aurora's Early Science Program and the Exascale Computing Project.2025-09-10T00:30:05Z40 pages, 10 figures. Submitted to J. SupercomputingWilliam E. AllcockBenjamin S. AllenJames AnchellVictor AnisimovThomas ApplencourtAbhishek BagusettyRamesh BalakrishnanRiccardo BalinSolomon BekeleColleen BertoniCyrus BlackworthRenzo BustamanteKevin CanadaJohn CarrierChristopher Chan-nuiLance C. CheneyTaylor ChildersPaul CoffmanSusan CoghlanTanima DeyMichael D'MelloAshok EmaniMurali EmaniKyle G. FelkerSam ForemanOlivier FranzaLongfei GaoMarta GarcíaMaría GarzaránBalazs GerofiYasaman GhadarSubrata GoswamiNeha GuptaKevin HarmsVäinö HatanpääBrian HollandCarissa HolohanBrian HomerdingKhalid HossainXue HuLouise HuotHuda IbeidJoseph A. InsleySai JayanthiHong JiangWei JiangXiao-Yong JinJeongnim KimChristopher KnightPanagiotis KourdisKalyan KumaranJaeHyuk KwackJanghaeng LeeTi LeggettBen LenardChris LewisNevin LiberJohann LombardiRaymond M. LoyYe LuoBethany LuschNilakantan MahadevanBeth MarkeyVictor A. MateevitsiGordon McPheetersRyan MilnerJerome MitchellVitali A. MorozovServesh MuralidharanTom MustaMrigendra NagarVikram NarayanaMarieme NgomAnthony-Trung NguyenNathan NicholsAditya NishtalaJames C. OsbornMichael E. PapkaScott ParkerSaumil S. PatelJulia PiotrowskaAdrian C. PopeSucheta RaghunandaEsteban RangelPaul M. RichKatherine M. RileySilvio RizziKris RoweVaruni SastryAdam ScovelFilippo SiminiHaritha Siddabathuni SomPatrick SteinbrecherRick StevensXinmin TianPeter UptonThomas UramArchit K. VasanÁlvaro Vázquez-MayagoitiaKaushik VelusamyBrice VideauVenkatram VishwanathBrian WhitneyTimothy J. WilliamsMichael WoodacreSam ZeltnerChuanjun ZhangGengbin ZhengHuihuo Zhenghttp://arxiv.org/abs/2512.07622v1Análisis de rendimiento y eficiencia energética en el cluster Raspberry Pi Cronos2025-12-08T15:08:09ZThis article presents an evaluation of the computational performance and energy efficiency of the Cronos cluster, composed of Raspberry Pi4 and 3b microcomputers designed for educational purposes. Experimental tests were performed using the High Performance Linpack (HPL) benchmark, under a resource management environment configured with Slurm and parallel communication via Open MPI. The study focuses on analyzing scalability, stability, and power consumption during the execution of computationally intensive workloads, considering different node configurations. The results show that the cluster achieves a performance of up to 6.91 GFLOPS in homogeneous configurations of 6 Raspberry Pi 4 nodes, and that the use of heterogeneous nodes (including Raspberry Pi 3b) can negatively impact stability and efficiency. Additionally, the total electrical consumption of the system was measured during the runs, allowing for the estimation of the performance-to-consumption ratio (GFLOPS/W) as a comparative metric. This study constitutes a concrete contribution to the design, evaluation, and utilization of low-cost ARM clusters in educational and research contexts.2025-12-08T15:08:09Zin Spanish languageMartha SemkenMariano VargasIgnacio TulaGiuliana ZorzoliAndrés Rojas Paredeshttp://arxiv.org/abs/2508.16653v2H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference2025-12-08T13:48:55ZLarge language models (LLMs) have demonstrated remarkable proficiency in a wide range of natural language processing applications. However, the high energy and latency overhead induced by the KV cache limits the edge deployment, especially for long contexts. Emerging hybrid bonding (HB) technology has been proposed as a promising alternative to conventional near-memory processing (NMP) architectures, offering improved bandwidth efficiency and lower power consumption while exhibiting characteristics of distributed memory. In this paper, we propose H2EAL, a hybrid bonding-based accelerator with sparse attention algorithm-hardware co-design for efficient LLM inference at the edge. At the algorithm level, we propose a hybrid sparse attention scheme with static and dynamic sparsity for different heads to fully leverage the sparsity with high accuracy. At the hardware level, we co-design the hardware to support hybrid sparse attention and propose memory-compute co-placement to address the distributed memory bottleneck. Since different attention heads exhibit different sparse patterns and the attention structure often mismatches the HB architecture, we further develop a load-balancing scheduler with parallel tiled attention to address workload imbalance and optimize the mapping strategy. Extensive experiments demonstrate H2EAL achieves 5.20~48.21x speedup and 6.22~73.48x energy efficiency improvement over baseline HB implementation, with a negligible average accuracy drop of 0.87% on multiple benchmarks.2025-08-20T03:42:37ZInternational Conference on Computer-Aided Design (ICCAD) 2025Zizhuo FuXiaotian GuoWenxuan ZengShuzhang ZhongYadong ZhangPeiyu ChenRunsheng WangLe YeMeng Lihttp://arxiv.org/abs/2512.07449v1AFarePart: Accuracy-aware Fault-resilient Partitioner for DNN Edge Accelerators2025-12-08T11:25:11ZDeep Neural Networks (DNNs) are increasingly deployed across distributed and resource-constrained platforms, such as System-on-Chip (SoC) accelerators and edge-cloud systems. DNNs are often partitioned and executed across heterogeneous processing units to optimize latency and energy. However, the reliability of these partitioned models under hardware faults and communication errors remains a critical yet underexplored topic, especially in safety-critical applications. In this paper, we propose an accuracy-aware, fault-resilient DNN partitioning framework targeting multi-objective optimization using NSGA-II, where accuracy degradation under fault conditions is introduced as a core metric alongside energy and latency. Our framework performs runtime fault injection during optimization and utilizes a feedback loop to prioritize fault-tolerant partitioning. We evaluate our approach on benchmark CNNs including AlexNet, SqueezeNet and ResNet18 on hardware accelerators, and demonstrate up to 27.7% improvement in fault tolerance with minimal increase in performance overhead. Our results highlight the importance of incorporating resilience into DNN partitioning, and thereby paving the way for robust AI inference in error-prone environments.2025-12-08T11:25:11Z6 pages, 4 figures, 2 tablesMukta DebnathUniversity of Calcutta, IndiaKrishnendu GuhaUniversity College Cork, IrelandDebasri SahaUniversity of Calcutta, IndiaAmlan ChakrabartiUniversity of Calcutta, IndiaSusmita Sur-KolayIndian Statistical Institute, Indiahttp://arxiv.org/abs/2512.07011v1Block Sparse Flash Attention2025-12-07T21:20:12ZModern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention2025-12-07T21:20:12Z10 pages, 5 figures. Code: https://github.com/Danielohayon/Block-Sparse-Flash-AttentionDaniel OhayonItay LamprechtItay HubaraIsrael CohenDaniel SoudryNoam Elatahttp://arxiv.org/abs/2310.18149v2Game of arrivals at a two queue network with heterogeneous customer routes2025-12-07T04:41:11ZWe consider a queuing network that opens at a specified time, where customers are non-atomic and belong to different classes. Each class has its own route, and as is typical in the literature, the costs are a linear function of waiting and service completion time. We restrict ourselves to a two class, two queue network: this simplification is well motivated as the diversity in solution structure as a function of problem parameters is substantial even in this simple setting (e.g., a specific routing structure involves eight different regimes), suggesting a combinatorial blow up as the number of queues, routes and customer classes increase. We identify the unique Nash equilibrium customer arrival profile when the customer linear cost preferences are different. This profile is a function of problem parameters including the size of each class, service rates at each queue, and customer cost preferences. When customer cost preferences match, under certain parametric settings, the equilibrium arrival profiles may not be unique and may lie in a convex set. We further make a surprising observation that in some parametric settings, customers in one class may arrive in disjoint intervals. Further, the two classes may arrive in contiguous intervals or in overlapping intervals, and at varying rates within an interval, depending upon the problem parameters.2023-10-27T13:55:14Zdiscussions on the connection with non-fluid two queue network arrival games added; full version of a short paper with same title published in IFIP Performance 2025Agniv BandyopadhyaySandeep Junejahttp://arxiv.org/abs/2512.06390v1Web Technologies Security in the AI Era: A Survey of CDN-Enhanced Defenses2025-12-06T10:42:14ZThe modern web stack, which is dominated by browser-based applications and API-first backends, now operates under an adversarial equilibrium where automated, AI-assisted attacks evolve continuously. Content Delivery Networks (CDNs) and edge computing place programmable defenses closest to users and bots, making them natural enforcement points for machine-learning (ML) driven inspection, throttling, and isolation. This survey synthesizes the landscape of AI-enhanced defenses deployed at the edge: (i) anomaly- and behavior-based Web Application Firewalls (WAFs) within broader Web Application and API Protection (WAAP), (ii) adaptive DDoS detection and mitigation, (iii) bot management that resists human-mimicry, and (iv) API discovery, positive security modeling, and encrypted-traffic anomaly analysis. We add a systematic survey method, a threat taxonomy mapped to edge-observable signals, evaluation metrics, deployment playbooks, and governance guidance. We conclude with a research agenda spanning XAI, adversarial robustness, and autonomous multi-agent defense. Our findings indicate that edge-centric AI measurably improves time-to-detect and time-to-mitigate while reducing data movement and enhancing compliance, yet introduces new risks around model abuse, poisoning, and governance.2025-12-06T10:42:14ZAccepted at 2025 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob). 7 pages, 5 figures2025 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), Bali, Indonesia, 2025, pp. 180-186Mehrab HosainSabbir Alom ShuvoMatthew OgbeMd Shah Jalal MazumderYead RahmanMd Azizul HakimAnukul Pandey10.1109/APWiMob67231.2025.11269122http://arxiv.org/abs/2502.08804v3Novel Lower Bounds on M/G/k Scheduling2025-12-05T23:13:56ZIn queueing systems, effective scheduling algorithms are essential for optimizing performance. Optimal scheduling for the M/G/k queue has been explored in the heavy traffic limit, but much remains unknown in the intermediate load regime.
In this paper, we give the first framework for proving nontrivial lower bounds on the mean response time of the M/G/k system under arbitrary scheduling policies. Our bounds tighten previous naive lower bounds by more than 60\%, yielding significant improvements particularly for moderate loads. Key to our approach is a new variable-speed queue, which more accurately captures the work completion behavior of multiserver systems. To analyze the expected work of this queue, we develop a novel manner of employing the drift method or the BAR approach, by developing test functions via the solutions to a differential equation.
We validate our results numerically for systems with up to 5 servers and a range of job size distributions.2025-02-12T21:39:22ZZiyuan WangIzzy Grosofhttp://arxiv.org/abs/2512.05831v1Dissecting Embedding Bag Performance in DLRM Inference2025-12-05T15:54:51ZAs the size of DLRMs gets larger, the models must be partitioned across multiple GPUs or nodes of GPUs due to the size limitation of total HBM memory that can be packaged in a GPU. This partitioning adds communication and synchronization overhead of sending and receiving data across GPUs. We use the NCCL and NVSHMEM libraries to measure the performance of an Embedding Bag kernel implemented on H100 GPUs. We compare its performance across diOerent batch sizes, number of tables, table sizes, pooling factors, and embedding dimensions. For a large embedding table that spans multiple GPUs, we project the performance slowdown from distributing an embedding table across multiple GPUs.2025-12-05T15:54:51ZChandrish AmbatiJing DingTrung Diephttp://arxiv.org/abs/2601.19904v1DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs2025-12-04T22:43:14ZThe exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore's Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework designed for evaluating LLM workloads on dataflow-based accelerators. By combining intra-chip performance profiling and inter-chip scalability analysis, DABench-LLM enables comprehensive evaluation across key metrics such as resource allocation, load balance, and resource efficiency. The framework helps researchers rapidly gain insights into underlying hardware and system behaviors, and provides guidance for performance optimizations. We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU. Our framework reveals performance bottlenecks and provides specific optimization strategies, demonstrating its generality and effectiveness across a diverse range of dataflow-based AI hardware platforms.2025-12-04T22:43:14ZZiyu HuZhiqing ZhongWeijian ZhengZhijing YeXuwei TanXueru ZhangZheng XieRajkumar KettimuthuXiaodong Yuhttp://arxiv.org/abs/2512.03914v2Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale2025-12-04T10:33:41ZEfficient simulation of complex plasma dynamics is crucial for advancing fusion energy research. Particle-in-Cell (PIC) Monte Carlo (MC) simulations provide insights into plasma behavior, including turbulence and confinement, which are essential for optimizing fusion reactor performance. Transitioning to exascale simulations introduces significant challenges, with traditional file input/output (I/O) inefficiencies remaining a key bottleneck. This work advances BIT1, an electrostatic PIC MC code, by improving the particle mover with OpenMP task-based parallelism, integrating the openPMD streaming API, and enabling in-memory data streaming with ADIOS2's Sustainable Staging Transport (SST) engine to enhance I/O performance, computational efficiency, and system storage utilization. We employ profiling tools such as gprof, perf, IPM and Darshan, which provide insights into computation, communication, and I/O operations. We implement time-dependent data checkpointing with the openPMD API enabling seamless data movement and in-situ visualization for real-time analysis without interrupting the simulation. We demonstrate improvements in simulation runtime, data accessibility and real-time insights by comparing traditional file I/O with the ADIOS2 BP4 and SST backends. The proposed hybrid BIT1 openPMD SST enhancement introduces a new paradigm for real-time scientific discovery in plasma simulations, enabling faster insights and more efficient use of exascale computing resources.2025-12-03T15:59:14ZAccepted by The International Journal of High Performance Computing Applications (IJHPCA) prepared in English, formatted in SAGE Publications (LaTeX) template and consists of 22 pages, which includes the main text, references, and figuresJeremy J. WilliamsStefan CosteaDaniel MedeirosJordy TrilaksonoPratibha HegdeDavid TskhakayaLeon KosAles PodolnikJakub HromadkaKevin A. HuckAllen D. MalonyFrank JenkoErwin LaureStefano Markidis10.1177/10943420251409229