https://arxiv.org/api/naQJXrQH6vAIqkkWgqRjkRbBRy4 2026-06-10T12:42:30Z 28838 225 15 http://arxiv.org/abs/2605.29868v1 Ciphera: A Decentralised Biometric Identity Framework 2026-05-28T12:52:12Z Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment. 2026-05-28T12:52:12Z Accepted at the CyberAI 2026 Conference, and to be indexed at IEEE-Scopus CyberAI 2026 (https://cyberai-conf.org/) Ankit Kanaiyalal Prajapati Shahzad Memon Mohammed Mahir Rahman Ameer Al-Nemrat http://arxiv.org/abs/2605.29752v1 From Roofline to Ruggedness: Decomposing and Smoothing the GEMM Performance Landscape 2026-05-28T10:50:53Z Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every non-peak workload - is the subject of this paper. We propose performance ruggedness analysis as an analytical framework complementary to roofline: rather than summarizing GPU performance with a scalar bound, treat the full multidimensional performance surface as the object of study, decompose its texture into mechanism-attributable components and separate software-removable contributions from hardware-bound ones. The framing is directly analogous to deep-learning loss landscapes - a continuous quantity (the idealized time 2MNK / compute_throughput_peak) made rugged by interaction with discrete hardware substrates (tiles, sub-groups, cache lines, DRAM channels). We apply the framework to BF16 NN (no transpose) GEMM on Intel Battlemage (Arc B580, sycl-tla) via a 32,768-configuration sweep (M, N, K) belongs to {128, ..., 4096}^3. The peak is 110.8 TFLOPs at the non-square shape M=3840, N=2048, K=4096 with the default tile size; the initial landscape roughness is 16.8 TFLOPs per 128-step against an ideal of 2.0. A two-stage software stack - (i) best-of-six dynamic tile selection and (ii) a novel dynamic-programming based padding-and-splitting optimizer with O(1) runtime lookup - reduces roughness by 70% and raises mean throughput by 30%. Cross-tile experiments establish that the residual sawtooth period scales exactly with software tile size, ruling out cache set conflicts and attributing the remaining variance to four hardware-bound sources (per-kernel base overhead, wave quantization, DPAS atom geometry and GDDR6 channel-hash interactions). 2026-05-28T10:50:53Z Aditya Chatterjee http://arxiv.org/abs/2605.29740v1 CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis 2026-05-28T10:35:28Z In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums. 2026-05-28T10:35:28Z published on IISWC '24 (International Symposium on Workload Characterization) José Morgado Leonel Sousa Aleksandar Ilic 10.1109/IISWC63097.2024.00016 http://arxiv.org/abs/2605.29728v1 PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration 2026-05-28T10:21:45Z Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU. 2026-05-28T10:21:45Z published on IISWC '25 (International Symposium on Workload Characterization) Daniel Pacheco Leonel Sousa Aleksandar Ilic 10.1109/IISWC66894.2025.00029 http://arxiv.org/abs/2605.22831v2 Monte Cimone v3: Where RISC-V Stands in High-Performance Computing 2026-05-28T09:57:36Z The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+(Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip. 2026-04-22T10:59:03Z Extended abstract for RISC-V Summit Europe 2026 Emanuele Venieri Simone Manoni Giacomo Madella Federico Proverbio Federico Ficarelli Luca Benini Andrea Bartolini http://arxiv.org/abs/2605.29664v1 AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training 2026-05-28T09:25:12Z Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence. 2026-05-28T09:25:12Z Accepted by ICML 2026, 9 pages, and 8 figures Ling Chen Houming Wu Wenjie Yu http://arxiv.org/abs/2605.29604v1 TC-MIS: Maximal Independent Set on Tensor-cores 2026-05-28T08:40:56Z Maximal Independent Set (MIS) in a graph is a fundamental problem with applications in resource allocation, scheduling, and network optimization. Although graphs are inherently un-structured and challenging for GPU parallelism due to irregular memory access and workload imbalance, specialized GPU algorithms have achieved good performance, processing million-vertex graphs in milliseconds. Modern GPUs are equipped with Tensor Cores (TCs), specialized units for matrix operations with 8-16x higher throughput than CUDA Cores (CCs), which are extensively used for ML, DL, and inference tasks but remain largely unexplored for graph algorithms. In this paper, we present TC-MIS, a TC-accelerated algorithm that reformulates key phases of MIS computation as sparse matrix-vector multiplication (SpMV). TC-MIS tiles the graph adjacency matrix and employs Warp Matrix Multiply-Accumulate (WMMA) operations to transform irregular graph traversal into regular, massively parallel computation. Our evaluation across TC-enabled microarchitectures (Ampere, Ada Lovelace, Hopper, Blackwell) demonstrates that TC-MIS achieves an average speedup of 2.84x on RTX A5000, 4.84x on L40S, 18.80x on H200 GPUs, and 5.20x on RTX 5080 with a maximum speedup of 44.38x on H200 GPU over state-of-the-art methods, while maintaining solution quality comparable to that obtained by established heuristics that produce near-maximum independent sets. 2026-05-28T08:40:56Z Prajjwal Nijhara Dip Sankar Banerjee http://arxiv.org/abs/2605.29573v1 Design and Implementation of a Serverless MapReduce Framework for Scalable Data Pipelines 2026-05-28T08:20:51Z Modern logistics systems tend to generate continuous streams of data from sources such as GPS, IoT sensors, and logistics management systems. The aggregation, processing, and analysis of data have become vital for monitoring operations, optimizing efficiency, and responding quickly to decision making tasks. In this paper, an event-driven MapReduce framework for real-time data processing in logistics environments is presented. This system runs on Kubernetes with Knative and utilizes Apache Kafka as the backbone for communication between the components. This platform is composed of five loosely coupled services that receive, process, and aggregate the incoming data in real-time. Redis is used to preserve workflow metadata, while an AWS S3 service provides persistent storage for the framework. The design is inspired by the MapReduce programming model. It integrates Function-as-a-Service (FaaS) principles with distributed processing techniques that allow configurable scaling based on the workload demands and the underlying hardware. Experimental evaluation shows that the system can scale effectively as the input data volume increases while supporting scale-to-zero, on-demand processing. 2026-05-28T08:20:51Z Angelos Dorotheos Chatzopoulos Babis Andreou Kakia Panagidi Stathes Hadjiefthymiades http://arxiv.org/abs/2605.29506v1 Silent Data Corruption Protection through Efficient Task Replication 2026-05-28T07:29:28Z The trend of increasing cluster sizes of supercomputers leads to a growing susceptibility to Silent Data Corruption (SDC) that can invalidate program results. A common strategy for SDC protection is replication, where the computation is repeated, and the correct result is determined as the one that is the same in at least two different computations. Applying replication to Asynchronous Many-Task (AMT) runtimes on clusters is challenging due to dynamic task spawning and work stealing, which complicate the identification of replicated tasks. To address the challenge, this paper introduces a novel replication scheme that detects and corrects SDCs for nested fork-join programs. Briefly stated, our approach replicates the computation and records the task tree. Upon a mismatch in the final result, it traverses the tree top-down to identify all corrupted tasks that could have impacted the final result. Recovery is then performed by recomputing these tasks, while the results of correct child tasks are reused. We demonstrate our implementation within a variant of the Itoyori cluster AMT runtime. Our experimental results suggest that the time to identify and reprocess the affected tasks is negligible. The paper concludes by discussing the adaptability of our scheme to tasks that cooperate through futures. 2026-05-28T07:29:28Z preprint Mia Reitz Claudia Fohry http://arxiv.org/abs/2605.29346v1 Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training 2026-05-28T04:37:24Z Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training, this behavior places the CPU on the critical path, introducing persistent host-device orchestration overhead and frequent GPU-CPU synchronization, which dominate end-to-end runtime when GPU computation is small. Existing approaches, including CUDA Graphs and GPU dynamic parallelism, fail to address this problem because the metadata-driven control loop remains host-mediated, and execution structure varies across iterations. We present ZEROGNN, a system that removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. ZEROGNN keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Experiments on sampling-based GNN workloads show that ZEROGNN achieves up to 5.28 x end-to-end speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation, while enabling strong multi-GPU scaling by eliminating host-side bottlenecks. 2026-05-28T04:37:24Z Yidong Gong Saima Afrin Yuchen Ma Guannan Wang Bin Ren Pradeep Kumar http://arxiv.org/abs/2606.06510v1 FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail 2026-05-28T03:40:05Z Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA's Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS -- a 31x regression from the B200 -- rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta -> 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title's claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be. 2026-05-28T03:40:05Z There is a companion Part (2) paper focusing on Ozaki-style FFT Satoshi Matsuoka http://arxiv.org/abs/2511.12025v2 A Quick and Exact Method for Distributed Quantile Computation 2026-05-28T02:02:33Z Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster. 2025-11-15T04:26:11Z 10 pages, 2 figures. Draft version for testing and feedback Ivan Cao Jaromir J. Saloni David A. G. Harrison http://arxiv.org/abs/2603.07974v3 ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems 2026-05-28T00:14:42Z Post-quantum signature schemes impose kilobyte-scale on-chain artifacts. Verifying them inside ZK circuits merely relocates the cost via expensive lattice arithmetic in prover circuits. We present ZK-ACE (Zero-Knowledge Authorization for Cryptographic Entities), which replaces transaction-carried signature objects with identity-bound ZK statements. Given a deterministic identity derivation primitive (DIDP) as a black box, the prover demonstrates in zero knowledge that an identity consistent with an on-chain commitment authorized the transaction; no signature object is produced or verified on-chain. We provide game-based definitions and reduction-based proofs for authorization soundness, replay resistance, substitution resistance, and cross-domain separation, under knowledge soundness, collision resistance, and DIDP recovery hardness. Structural data accounting shows an order-of-magnitude reduction in per-transaction authorization data versus direct PQC deployment. A reference implementation offers two backends: Circle STARK (341 active rows / 361 AIR constraint expressions, 14.5 ms prove, 1.1 ms verify, approx. 107 KB proofs, transparent setup, post-quantum-oriented) and Groth16/BN254 (2,155 R1CS constraints, 37.3 ms prove, 128-byte proofs). Both are roughly 500--2,300x smaller than in-circuit PQC signature verification. Under mandatory per-block STARK aggregation, per-transaction consensus-visible data is approx. 160 bytes. 2026-03-09T05:21:44Z 34 pages Jian Sheng Wang http://arxiv.org/abs/2605.29155v1 CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control 2026-05-27T22:38:09Z In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time. 2026-05-27T22:38:09Z Accepted for presentation at the 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026 Antoonio Buo Vittorio Cammarota Michele Avagnale Pierluigi Arpenti Vincenzo Lippiello Fabio Ruggiero http://arxiv.org/abs/2605.29135v1 Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory 2026-05-27T21:57:36Z Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints that limit access to large accelerator clusters, and as models continue to improve, deployment accessibility may matter as much as capability itself. This paper presents Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. A public validation was conducted using a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop with an RTX 4060 Laptop GPU containing 8 GB of VRAM. Under the primary configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second. The goal is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments where such infrastructure is unavailable. The results should be read as exploratory rather than definitive, but they suggest deployment accessibility deserves continued investigation as these models evolve. 2026-05-27T21:57:36Z 10 pages, 3 figures. Also archived at Zenodo (DOI: 10.5281/zenodo.20406471). Related to Korean Patent Publication KR 10-2026-0070380 Myeong Jun Jo 10.5281/zenodo.20406471