https://arxiv.org/api/naQJXrQH6vAIqkkWgqRjkRbBRy4
2026-06-10T12:42:30Z
28838
225
15
http://arxiv.org/abs/2605.29868v1
Ciphera: A Decentralised Biometric Identity Framework
2026-05-28T12:52:12Z
Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment.
2026-05-28T12:52:12Z
Accepted at the CyberAI 2026 Conference, and to be indexed at IEEE-Scopus
CyberAI 2026 (https://cyberai-conf.org/)
Ankit Kanaiyalal Prajapati
Shahzad Memon
Mohammed Mahir Rahman
Ameer Al-Nemrat
http://arxiv.org/abs/2605.29752v1
From Roofline to Ruggedness: Decomposing and Smoothing the GEMM Performance Landscape
2026-05-28T10:50:53Z
Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every non-peak workload - is the subject of this paper.
We propose performance ruggedness analysis as an analytical framework complementary to roofline: rather than summarizing GPU performance with a scalar bound, treat the full multidimensional performance surface as the object of study, decompose its texture into mechanism-attributable components and separate software-removable contributions from hardware-bound ones. The framing is directly analogous to deep-learning loss landscapes - a continuous quantity (the idealized time 2MNK / compute_throughput_peak) made rugged by interaction with discrete hardware substrates (tiles, sub-groups, cache lines, DRAM channels).
We apply the framework to BF16 NN (no transpose) GEMM on Intel Battlemage (Arc B580, sycl-tla) via a 32,768-configuration sweep (M, N, K) belongs to {128, ..., 4096}^3. The peak is 110.8 TFLOPs at the non-square shape M=3840, N=2048, K=4096 with the default tile size; the initial landscape roughness is 16.8 TFLOPs per 128-step against an ideal of 2.0. A two-stage software stack - (i) best-of-six dynamic tile selection and (ii) a novel dynamic-programming based padding-and-splitting optimizer with O(1) runtime lookup - reduces roughness by 70% and raises mean throughput by 30%. Cross-tile experiments establish that the residual sawtooth period scales exactly with software tile size, ruling out cache set conflicts and attributing the remaining variance to four hardware-bound sources (per-kernel base overhead, wave quantization, DPAS atom geometry and GDDR6 channel-hash interactions).
2026-05-28T10:50:53Z
Aditya Chatterjee
http://arxiv.org/abs/2605.29740v1
CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis
2026-05-28T10:35:28Z
In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums.
2026-05-28T10:35:28Z
published on IISWC '24 (International Symposium on Workload Characterization)
José Morgado
Leonel Sousa
Aleksandar Ilic
10.1109/IISWC63097.2024.00016
http://arxiv.org/abs/2605.29728v1
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
2026-05-28T10:21:45Z
Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning. One of the most used tensor decomposition algorithms is the Alternating Least Squares Canonical Polyadic Decomposition (CP-ALS), where the most time-consuming operation is the Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). This operation is strongly memory-bound, making it hard to implement efficiently on general-purpose processors. This work proposes PRISM, the first approach to tackle this operation using Processing-In-Memory (PIM) technology. We extensively characterize different partitioning strategies, number formats, and kernel optimizations that efficiently adapt this operation to UPMEM PIM, which is further boosted by heterogeneous collaboration with the CPU. The experimental results show that the proposed PIM-based and heterogeneous approaches achieve up to 2.37x and 2.64x speedup compared to state-of-the-art CPU implementations, respectively. However, the UPMEM distributed memory system can significantly hinder performance on certain workloads. Nonetheless, the efficiency of resource consumption for this approach, measured by peak performance fraction usage, is significantly higher than for both CPU and GPU.
2026-05-28T10:21:45Z
published on IISWC '25 (International Symposium on Workload Characterization)
Daniel Pacheco
Leonel Sousa
Aleksandar Ilic
10.1109/IISWC66894.2025.00029
http://arxiv.org/abs/2605.22831v2
Monte Cimone v3: Where RISC-V Stands in High-Performance Computing
2026-05-28T09:57:36Z
The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+(Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.
2026-04-22T10:59:03Z
Extended abstract for RISC-V Summit Europe 2026
Emanuele Venieri
Simone Manoni
Giacomo Madella
Federico Proverbio
Federico Ficarelli
Luca Benini
Andrea Bartolini
http://arxiv.org/abs/2605.29664v1
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
2026-05-28T09:25:12Z
Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence.
2026-05-28T09:25:12Z
Accepted by ICML 2026, 9 pages, and 8 figures
Ling Chen
Houming Wu
Wenjie Yu
http://arxiv.org/abs/2605.29604v1
TC-MIS: Maximal Independent Set on Tensor-cores
2026-05-28T08:40:56Z
Maximal Independent Set (MIS) in a graph is a fundamental problem with applications in resource allocation, scheduling, and network optimization. Although graphs are inherently un-structured and challenging for GPU parallelism due to irregular memory access and workload imbalance, specialized GPU algorithms have achieved good performance, processing million-vertex graphs in milliseconds. Modern GPUs are equipped with Tensor Cores (TCs), specialized units for matrix operations with 8-16x higher throughput than CUDA Cores (CCs), which are extensively used for ML, DL, and inference tasks but remain largely unexplored for graph algorithms. In this paper, we present TC-MIS, a TC-accelerated algorithm that reformulates key phases of MIS computation as sparse matrix-vector multiplication (SpMV). TC-MIS tiles the graph adjacency matrix and employs Warp Matrix Multiply-Accumulate (WMMA) operations to transform irregular graph traversal into regular, massively parallel computation. Our evaluation across TC-enabled microarchitectures (Ampere, Ada Lovelace, Hopper, Blackwell) demonstrates that TC-MIS achieves an average speedup of 2.84x on RTX A5000, 4.84x on L40S, 18.80x on H200 GPUs, and 5.20x on RTX 5080 with a maximum speedup of 44.38x on H200 GPU over state-of-the-art methods, while maintaining solution quality comparable to that obtained by established heuristics that produce near-maximum independent sets.
2026-05-28T08:40:56Z
Prajjwal Nijhara
Dip Sankar Banerjee
http://arxiv.org/abs/2605.29573v1
Design and Implementation of a Serverless MapReduce Framework for Scalable Data Pipelines
2026-05-28T08:20:51Z
Modern logistics systems tend to generate continuous streams of data from sources such as GPS, IoT sensors, and logistics management systems. The aggregation, processing, and analysis of data have become vital for monitoring operations, optimizing efficiency, and responding quickly to decision making tasks. In this paper, an event-driven MapReduce framework for real-time data processing in logistics environments is presented. This system runs on Kubernetes with Knative and utilizes Apache Kafka as the backbone for communication between the components. This platform is composed of five loosely coupled services that receive, process, and aggregate the incoming data in real-time. Redis is used to preserve workflow metadata, while an AWS S3 service provides persistent storage for the framework. The design is inspired by the MapReduce programming model. It integrates Function-as-a-Service (FaaS) principles with distributed processing techniques that allow configurable scaling based on the workload demands and the underlying hardware. Experimental evaluation shows that the system can scale effectively as the input data volume increases while supporting scale-to-zero, on-demand processing.
2026-05-28T08:20:51Z
Angelos Dorotheos Chatzopoulos
Babis Andreou
Kakia Panagidi
Stathes Hadjiefthymiades
http://arxiv.org/abs/2605.29506v1
Silent Data Corruption Protection through Efficient Task Replication
2026-05-28T07:29:28Z
The trend of increasing cluster sizes of supercomputers leads to a growing susceptibility to Silent Data Corruption (SDC) that can invalidate program results. A common strategy for SDC protection is replication, where the computation is repeated, and the correct result is determined as the one that is the same in at least two different computations. Applying replication to Asynchronous Many-Task (AMT) runtimes on clusters is challenging due to dynamic task spawning and work stealing, which complicate the identification of replicated tasks.
To address the challenge, this paper introduces a novel replication scheme that detects and corrects SDCs for nested fork-join programs. Briefly stated, our approach replicates the computation and records the task tree. Upon a mismatch in the final result, it traverses the tree top-down to identify all corrupted tasks that could have impacted the final result. Recovery is then performed by recomputing these tasks, while the results of correct child tasks are reused.
We demonstrate our implementation within a variant of the Itoyori cluster AMT runtime. Our experimental results suggest that the time to identify and reprocess the affected tasks is negligible. The paper concludes by discussing the adaptability of our scheme to tasks that cooperate through futures.
2026-05-28T07:29:28Z
preprint
Mia Reitz
Claudia Fohry
http://arxiv.org/abs/2605.29346v1
Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training
2026-05-28T04:37:24Z
Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training, this behavior places the CPU on the critical path, introducing persistent host-device orchestration overhead and frequent GPU-CPU synchronization, which dominate end-to-end runtime when GPU computation is small. Existing approaches, including CUDA Graphs and GPU dynamic parallelism, fail to address this problem because the metadata-driven control loop remains host-mediated, and execution structure varies across iterations. We present ZEROGNN, a system that removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. ZEROGNN keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Experiments on sampling-based GNN workloads show that ZEROGNN achieves up to 5.28 x end-to-end speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation, while enabling strong multi-GPU scaling by eliminating host-side bottlenecks.
2026-05-28T04:37:24Z
Yidong Gong
Saima Afrin
Yuchen Ma
Guannan Wang
Bin Ren
Pradeep Kumar
http://arxiv.org/abs/2606.06510v1
FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail
2026-05-28T03:40:05Z
Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA's Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS -- a 31x regression from the B200 -- rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta -> 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title's claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be.
2026-05-28T03:40:05Z
There is a companion Part (2) paper focusing on Ozaki-style FFT
Satoshi Matsuoka
http://arxiv.org/abs/2511.12025v2
A Quick and Exact Method for Distributed Quantile Computation
2026-05-28T02:02:33Z
Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.
2025-11-15T04:26:11Z
10 pages, 2 figures. Draft version for testing and feedback
Ivan Cao
Jaromir J. Saloni
David A. G. Harrison
http://arxiv.org/abs/2603.07974v3
ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems
2026-05-28T00:14:42Z
Post-quantum signature schemes impose kilobyte-scale on-chain artifacts. Verifying them inside ZK circuits merely relocates the cost via expensive lattice arithmetic in prover circuits.
We present ZK-ACE (Zero-Knowledge Authorization for Cryptographic Entities), which replaces transaction-carried signature objects with identity-bound ZK statements. Given a deterministic identity derivation primitive (DIDP) as a black box, the prover demonstrates in zero knowledge that an identity consistent with an on-chain commitment authorized the transaction; no signature object is produced or verified on-chain.
We provide game-based definitions and reduction-based proofs for authorization soundness, replay resistance, substitution resistance, and cross-domain separation, under knowledge soundness, collision resistance, and DIDP recovery hardness. Structural data accounting shows an order-of-magnitude reduction in per-transaction authorization data versus direct PQC deployment. A reference implementation offers two backends: Circle STARK (341 active rows / 361 AIR constraint expressions, 14.5 ms prove, 1.1 ms verify, approx. 107 KB proofs, transparent setup, post-quantum-oriented) and Groth16/BN254 (2,155 R1CS constraints, 37.3 ms prove, 128-byte proofs). Both are roughly 500--2,300x smaller than in-circuit PQC signature verification. Under mandatory per-block STARK aggregation, per-transaction consensus-visible data is approx. 160 bytes.
2026-03-09T05:21:44Z
34 pages
Jian Sheng Wang
http://arxiv.org/abs/2605.29155v1
CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control
2026-05-27T22:38:09Z
In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.
2026-05-27T22:38:09Z
Accepted for presentation at the 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026
Antoonio Buo
Vittorio Cammarota
Michele Avagnale
Pierluigi Arpenti
Vincenzo Lippiello
Fabio Ruggiero
http://arxiv.org/abs/2605.29135v1
Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
2026-05-27T21:57:36Z
Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints that limit access to large accelerator clusters, and as models continue to improve, deployment accessibility may matter as much as capability itself. This paper presents Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. A public validation was conducted using a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop with an RTX 4060 Laptop GPU containing 8 GB of VRAM. Under the primary configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second. The goal is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments where such infrastructure is unavailable. The results should be read as exploratory rather than definitive, but they suggest deployment accessibility deserves continued investigation as these models evolve.
2026-05-27T21:57:36Z
10 pages, 3 figures. Also archived at Zenodo (DOI: 10.5281/zenodo.20406471). Related to Korean Patent Publication KR 10-2026-0070380
Myeong Jun Jo
10.5281/zenodo.20406471