https://arxiv.org/api/Z7fictuYFP0xbM92jQoQOOADci42026-06-22T02:38:41Z266430015http://arxiv.org/abs/2507.00976v1Anatomy of High-Performance Column-Pivoted QR Decomposition2025-07-01T17:25:36ZWe introduce an algorithmic framework for performing QR factorization with column pivoting (QRCP) on general matrices. The framework enables the design of practical QRCP algorithms through user-controlled choices for the core subroutines. We provide a comprehensive overview of how to navigate these choices on modern hardware platforms, offering detailed descriptions of alternative methods for both CPUs and GPUs. The practical QRCP algorithms developed within this framework are implemented as part of the open-source RandLAPACK library. Our empirical evaluation demonstrates that, on a dual AMD EPYC 9734 system, the proposed method achieves performance improvements of up to two orders of magnitude over LAPACK's standard QRCP routine and greatly surpasses the performance of the current state-of-the-art randomized QRCP algorithm. Additionally, on an NVIDIA H100 GPU, our method attains approximately 65 percent of the performance of cuSOLVER's unpivoted QR factorization.2025-07-01T17:25:36Zv1: 33 pages in the body, 7 pages in the appendices, 17 figuresMaksim MelnichenkoRiley MurrayWilliam KillianJames DemmelMichael W. MahoneyPiotr LuszczekMark Gateshttp://arxiv.org/abs/2503.02134v2Enabling mixed-precision in spectral element codes2025-07-01T15:23:37ZMixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, and a modern Neko CFD application. With the help of the Verificarlo tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 1.62x and energy-to-solution by 2.43x on MareNostrum 5, while in the real-world Neko application, the gain is up to 1.3x in both time and energy, with the accuracy that matches double-precision results.2025-03-03T23:46:53ZarXiv admin note: text overlap with arXiv:2405.11065Yanxiang ChenPablo de Oliveira CastroPaolo BientinesiNiclas JanssonRoman Iakymchukhttp://arxiv.org/abs/2411.00442v3Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations2025-07-01T03:08:49ZAccumulation-based operations, such as summation and matrix multiplication, are fundamental to numerous computational domains. However, their accumulation orders are often undocumented in existing software and hardware implementations, making it difficult for developers to ensure consistent results across systems. To address this issue, we introduce FPRev, a diagnostic tool designed to reveal the accumulation order in the software and hardware implementations through numerical testing. With FPRev, developers can identify and compare accumulation orders, enabling developers to create reproducible software and verify implementation equivalence.
FPRev is a testing-based tool that non-intrusively reveals the accumulation order by analyzing the outputs of the tested implementation for distinct specially designed inputs. Employing FPRev, we showcase the accumulation orders of popular libraries (such as NumPy and PyTorch) on CPUs and GPUs (including GPUs with specialized matrix accelerators such as Tensor Cores). We also validate the efficiency of FPRev through extensive experiments. FPRev exhibits a lower time complexity compared to the basic solution. FPRev is open-sourced at https://github.com/peichenxie/FPRev.2024-11-01T08:26:44ZCamera-ready for USENIX ATC 2025Peichen XieYanjie GaoYang WangJilong Xuehttp://arxiv.org/abs/2501.04032v2Efficient Computation of Collatz Sequence Stopping Times: A Novel Algorithmic Approach2025-06-30T19:19:37ZThe Collatz conjecture, which posits that any positive integer will eventually reach 1 through a specific iterative process, is a classic unsolved problem in mathematics. This research focuses on designing an efficient algorithm to compute the stopping time of numbers in the Collatz sequence, achieving significant computational improvements. By leveraging structural patterns in the Collatz tree, the proposed algorithm minimizes redundant operations and optimizes computational steps. Unlike prior methods, it efficiently handles extremely large numbers without requiring advanced techniques such as memoization or parallelization. Experimental evaluations confirm computational efficiency improvements of approximately 28% over state-of-the-art methods. These findings underscore the algorithm's scalability and robustness, providing a foundation for future large-scale verification of the conjecture and potential applications in computational mathematics.2025-01-01T10:52:31ZPublished in: IEEE Access ( Volume: 13), Page(s): 41210 - 41220, Date of Publication: 05 March 2025Eyob Solomon GetachewBeakal Gizachew Assefa10.1109/ACCESS.2025.3548031http://arxiv.org/abs/2506.23558v1The Distributed and Unified Numerics Environment (DUNE), Version 2.102025-06-30T07:06:11ZVersion 2.10 of the Distributed and Unified Numerics Environment DUNE introduces a range of enhancements across its core and extension modules, with a continued emphasis on modern C++ integration and improved usability. This release extends support for C++20 features, particularly concepts, through comprehensive refinements in dune-common and dune-grid, enabling safer and more expressive generic programming paradigms. A notable advancement is the improved support for curved geometries, including new geometry implementations and a more flexible interface. Data structures have been modernized through native support for std::mdspan and std::mdarray, performance improvements in sparse matrices, and tools for visualization of matrix patterns. The build system has been restructured towards a modern CMake workflow, emphasizing target-based configuration and improved automation. Furthermore, new local finite elements have been introduced to broaden numerical capabilities. The release also brings updates across DUNE extensions, as well as improvements to infrastructure and module-level components.2025-06-30T07:06:11Z25 pages, 16 code examplesMarkus BlattSamuel BurbullaAnsgar BurchardtAndreas DednerChristian EngwerCarsten GräserChristoph GrüningerRobert KlöfkornTimo KochSantiago Ospina De Los RíosSimon PraetoriusOliver Sanderhttp://arxiv.org/abs/2506.23416v1Zero-disparity Distribution Synthesis: Fast Exact Calculation of Chi-Squared Statistic Distribution for Discrete Uniform Histograms2025-06-29T22:22:40ZPearson's chi-squared test is widely used to assess the uniformity of discrete histograms, typically relying on a continuous chi-squared distribution to approximate the test statistic, since computing the exact distribution is computationally too costly. While effective in many cases, this approximation allegedly fails when expected bin counts are low or tail probabilities are needed. Here, Zero-disparity Distribution Synthesis is presented, a fast dynamic programming approach for computing the exact distribution, enabling detailed analysis of approximation errors. The results dispel some existing misunderstandings and also reveal subtle, but significant pitfalls in approximation that are only apparent with exact values. The Python source code is available at https://github.com/DiscreteTotalVariation/ChiSquared.2025-06-29T22:22:40Z9 pages, 7 figuresNikola BanićNeven Elezovićhttp://arxiv.org/abs/2506.23388v1Escher Tile Deformation via Closed-Form Solution2025-06-29T20:03:47ZWe present a real-time deformation method for Escher tiles -- interlocking organic forms that seamlessly tessellate the plane following symmetry rules. We formulate the problem as determining a periodic displacement field. The goal is to deform Escher tiles without introducing gaps or overlaps. The resulting displacement field is obtained in closed form by an analytical solution. Our method processes tiles of 17 wallpaper groups across various representations such as images and meshes. Rather than treating tiles as mere boundaries, we consider them as textured shapes, ensuring that both the boundary and interior deform simultaneously. To enable fine-grained artistic input, our interactive tool features a user-controllable adaptive fall-off parameter, allowing precise adjustment of locality and supporting deformations with meaningful semantic control. We demonstrate the effectiveness of our method through various examples, including photo editing and shape sculpting, showing its use in applications such as fabrication and animation.2025-06-29T20:03:47ZSIGGRAPH 2025Crane He ChenVladimir G. Kim10.1145/3721238.3730681http://arxiv.org/abs/2506.19431v2The CompGIT package: a computational tool for Geometric Invariant Theory quotients2025-06-25T12:54:21ZWe describe CompGIT, a SageMath package to describe Geometric Invariant Theory (GIT) quotients of projective space by simple groups. The implementation is based on algorithms described by Gallardo--Martinez-Garcia--Moon--Swinarski. In principle the package is sufficient to describe any GIT quotient of a projective variety by a simple group -- in practice it requires that the user can construct an equivariant embedding of the polarised variety into projective space. The package describes the non-stable and unstable loci up to conjugation by the group, as well as describing the strictly polystable loci. We discuss potential applications of the outputs of CompGIT to algebraic geometry problems, a well as suggesting directions for future developments.2025-06-24T08:57:50Z15 pages, 1 figure. Comments are welcome. Code available in https://github.com/Robbie-H/CompGIT v2: corrected name on arxiv websiteRobert HansonJesus Martinez-Garciahttp://arxiv.org/abs/2506.19751v1A modular and extensible library for parameterized terrain generation2025-06-24T16:06:55ZSimulation-driven development of intelligent machines benefits from artificial terrains with controllable, well-defined characteristics. However, most existing tools for terrain generation focus on artist-driven workflows and visual realism, with limited support for parameterization, reproducibility, or scripting. We present a modular, Python-based library for procedural terrain generation that enables users to construct complex, parameterized terrains by chaining together simple modules. The system supports both structured and noise-based terrain elements, and integrates with Blender for rendering and object placement. The framework is designed to support applications such as generating synthetic terrains for training machine learning models or producing ground truth for perception tasks. By using a minimal but extensible set of modules, the system achieves high flexibility while remaining easy to configure and expand. We demonstrate that this enables fine-grained control over features such as slope, roughness, and the number of rocks, as well as extension to additional measures. This makes it well suited for workflows that demand reproducibility, variation, and integration with automated pipelines.2025-06-24T16:06:55ZErik Wallinhttp://arxiv.org/abs/2506.19175v1Binsparse: A Specification for Cross-Platform Storage of Sparse Matrices and Tensors2025-06-23T22:33:58ZSparse matrices and tensors are ubiquitous throughout multiple subfields of computing. The widespread usage of sparse data has inspired many in-memory and on-disk storage formats, but the only widely adopted storage specifications are the Matrix Market and FROSTT file formats, which both use ASCII text. Due to the inefficiency of text storage, these files typically have larger file sizes and longer parsing times than binary storage formats, which directly store an in-memory representation to disk. This can be a major bottleneck; since sparse computation is often bandwidth-bound, the cost of loading or storing a matrix to disk often exceeds the cost of performing a sparse computation. While it is common practice for practitioners to develop their own, custom, non-portable binary formats for high-performance sparse matrix storage, there is currently no cross-platform binary sparse matrix storage format. We present Binsparse, a cross-platform binary sparse matrix and tensor format specification. Binsparse is a modular, embeddable format, consisting of a JSON descriptor, which describes the matrix or tensor dimensions, type, and format, and a series of binary arrays, which can be stored in all modern binary containers, such as HDF5, Zarr, or NPZ. We provide several reference implementations of Binsparse spanning 5 languages, 5 frameworks, and 4 binary containers. We evaluate our Binsparse format on every matrix in the SuiteSparse Matrix Collection and a selection of tensors from the FROSTT collection. The Binsparse HDF5 CSR format shows file size reductions of 2.4x on average without compression and 7.5x with compression. We evaluate our parser's read/write performance against a state-of-the-art Matrix Market parser, demonstrating warm cache mean read speedups of 26.5x without compression and 2.6x with compression, and write speedups of 31x without compression and 1.4x with compression.2025-06-23T22:33:58ZBenjamin BrockWillow AhrensHameer AbbasiTimothy A. DavisJuni KimJames KitchenSpencer PattyIsaac VirshupErik Welchhttp://arxiv.org/abs/2506.17471v1Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms2025-06-20T20:23:42ZWe present a novel parallelization strategy for evaluating Finite Element Method (FEM) variational forms on GPUs, focusing on those that are expressible through the Unified Form Language (UFL) on simplex meshes. We base our approach on code transformations, wherein we construct a space of scheduling candidates and rank them via a heuristic cost model to effectively handle the large diversity of computational workloads that can be expressed in this way. We present a design of a search space to which the cost model is applied, along with an associated pruning strategy to limit the number of configurations that need to be empirically evaluated. The goal of our design is to strike a balance between the device's latency-hiding capabilities and the amount of state space, a key factor in attaining near-roofline performance.
To make our work widely available, we have prototyped our parallelization strategy within the \textsc{Firedrake} framework, a UFL-based FEM solver. We evaluate the performance of our parallelization scheme on two generations of Nvidia GPUs, specifically the Titan V (Volta architecture) and Tesla K40c (Kepler architecture), across a range of operators commonly used in applications, including fluid dynamics, wave propagation, and structural mechanics, in 2D and 3D geometries. Our results demonstrate that our proposed algorithm achieves more than $50\%$ roofline performance in $65\%$ of the test cases on both devices.2025-06-20T20:23:42ZKaushik KulkarniAndreas Klöcknerhttp://arxiv.org/abs/2506.16759v1Adaptive Sketching Based Construction of H2 Matrices on GPUs2025-06-20T05:29:24ZWe develop a novel linear-complexity bottom-up sketching-based algorithm for constructing a $H^2$ matrix, and present its high performance GPU implementation. The construction algorithm requires both a black-box sketching operator and an entry evaluation function. The novelty of our GPU approach centers around the design and implementation of the above two operations in batched mode on GPU with accommodation for variable-size data structures in a batch. The batch algorithms minimize the number of kernel launches and maximize the GPU throughput. When applied to covariance matrices, volume IE matrices and $H^2$ update operations, our proposed GPU implementation achieves up to $13\times$ speedup over our CPU implementation, and up to $1000\times$ speedup over an existing GPU implementation of the top-down sketching-based algorithm from the H2Opus library. It also achieves a $660\times$ speedup over an existing sketching-based $H$ construction algorithm from the ButterflyPACK library. Our work represents the first GPU implementation of the class of bottom-up sketching-based $H^2$ construction algorithms.2025-06-20T05:29:24ZWajih Halim BoukaramYang LiuPieter GhyselsXiaoye Sherry Lihttp://arxiv.org/abs/2506.16341v1Transformations of Computational Meshes2025-06-19T14:20:15ZComputational meshes, as a way to partition space, form the basis of much of PDE simulation technology, for instance for the finite element and finite volume discretization methods. In complex simulations, we are often driven to modify an input mesh, for example, to refine, coarsen, extrude, change cell types, or filter it. Mesh manipulation code can be voluminous, error-prone, spread over many special cases, and hard to understand and maintain by subsequent developers. We present a simple, table-driven paradigm for mesh transformation which can execute a large variety of transformations in a performant, parallel manner, along with experiments in the open source library PETSc which can be run by the reader.2025-06-19T14:20:15Z12 pages, 8 figuresMatthew G. Knepleyhttp://arxiv.org/abs/2405.04944v3A Sparse Tensor Generator with Efficient Feature Extraction2025-06-19T06:29:18ZSparse tensor operations are increasingly important in diverse applications such as social networks, deep learning, diagnosis, crime, and review analysis. However, a major obstacle in sparse tensor research is the lack of large-scale sparse tensor datasets. Another challenge lies in analyzing sparse tensor features, which are essential not only for understanding the nonzero pattern but also for selecting the most suitable storage format, decomposition algorithm, and reordering methods. However, due to the large size of real-world tensors, even extracting these features can be computationally expensive without careful optimization. To address these limitations, we have developed a smart sparse tensor generator that replicates key characteristics of real sparse tensors. Additionally, we propose efficient methods for extracting a comprehensive set of sparse tensor features. The effectiveness of our generator is validated through the quality of extracted features and the performance of decomposition on the generated tensors. Both the sparse tensor feature extractor and the tensor generator are open source with all the artifacts available at https://github.com/sparcityeu/FeaTensor and https://github.com/sparcityeu/GenTensor, respectively.2024-05-08T10:28:20Z22 pages, 4 figures, 7 tablesFrontiers in Applied Mathematics and Statistics 11 (2025)Tugba TorunAmeer TaweelDidem Unat10.3389/fams.2025.1589033http://arxiv.org/abs/2501.17737v2Sparser, Better, Faster, Stronger: Sparsity Detection for Efficient Automatic Differentiation2025-06-11T14:56:28ZFrom implicit differentiation to probabilistic modeling, Jacobian and Hessian matrices have many potential use cases in Machine Learning (ML), but they are viewed as computationally prohibitive. Fortunately, these matrices often exhibit sparsity, which can be leveraged to speed up the process of Automatic Differentiation (AD). This paper presents advances in sparsity detection, previously the performance bottleneck of Automatic Sparse Differentiation (ASD). Our implementation of sparsity detection is based on operator overloading, able to detect both local and global sparsity patterns, and supports flexible index set representations. It is fully automatic and requires no modification of user code, making it compatible with existing ML codebases. Most importantly, it is highly performant, unlocking Jacobians and Hessians at scales where they were considered too expensive to compute. On real-world problems from scientific ML, graph neural networks and optimization, we show significant speed-ups of up to three orders of magnitude. Notably, using our sparsity detection system, ASD outperforms standard AD for one-off computations, without amortization of either sparsity detection or matrix coloring.2025-01-29T16:21:54Z33 pages, 6 figures, 6 tables, 3 listingsAdrian HillGuillaume Dalle