https://arxiv.org/api/WnazZxrzyAwJEdiGzSScdT8QRMo 2026-06-22T13:49:15Z 2664 450 15 http://arxiv.org/abs/2409.18772v1 A method of using RSVD in residual calculation of LowBit GEMM 2024-09-27T14:16:35Z

The advancements of hardware technology in recent years has brought many possibilities for low-precision applications. However, the use of low precision can introduce significant computational errors, posing a considerable challenge to maintaining the computational accuracy. We propose low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication. It can bring several times accuracy improvement with only BLAS-2 level extra time overhead. Moreover, LRQMM is a completely data-free quantization method that does not require additional data for pre-training. And it only works with low precision GEMM operator, which is easy to couple with other methods. Through experimentation, LRQMM can reduce the error of direct quantized matrix multiplication by 1~2 orders of magnitude, when dealing with larger matrix sizes, the computational speed is only reduced by approximately 20\%. In deep learning networks, LRQMM-4bit achieves 61.8% ImageNet Top-1 accuracy in Resnet-50, while the Direct Quant accuracy is only 8.3%.

2024-09-27T14:16:35Z Hongyaoxing Gu http://arxiv.org/abs/2409.15926v1 QHyper: an integration library for hybrid quantum-classical optimization 2024-09-24T09:47:28Z

We propose the QHyper library, which is aimed at researchers working on computational experiments with a variety of quantum combinatorial optimization solvers. The library offers a simple and extensible interface for formulating combinatorial optimization problems, selecting and running solvers, and optimizing hyperparameters. The supported solver set includes variational gate-based algorithms, quantum annealers, and classical solutions. The solvers can be combined with provided local and global (hyper)optimizers. The main features of the library are its extensibility on different levels of use as well as a straightforward and flexible experiment configuration format presented in the paper.

2024-09-24T09:47:28Z Tomasz Lamża Justyna Zawalska Kacper Jurek Mariusz Sterzel Katarzyna Rycerz http://arxiv.org/abs/2409.13090v1 Some new techniques to use in serial sparse Cholesky factorization algorithms 2024-09-19T21:20:00Z

We present a new variant of serial right-looking supernodal sparse Cholesky factorization (RL). Our comparison of RL with the multifrontal method confirms that RL is simpler, slightly faster, and requires slightly less storage. The key to the rest of the work in this paper is recent work on reordering columns within supernodes so that the dense off-diagonal blocks in the factor matrix joining pairs of supernodes are fewer and larger. We present a second new variant of serial right-looking supernodal sparse Cholesky factorization (RLB), where this one is specifically designed to exploit fewer and larger off-diagonal blocks in the factor matrix obtained by reordering within supernodes. A key distinction found in RLB is that it uses no floating-point working storage and performs no assembly operations. Our key finding is that RLB is unequivocally faster than its competitors. Indeed, RLB is consistently, but modestly, faster than its competitors whenever Intel's MKL sequential BLAS are used. More importantly, RLB is substantially faster than its competitors whenever Intel's MKL multithreaded BLAS are used. Finally, RLB using the multithreaded BLAS achieves impressive speedups over RLB using the sequential BLAS.

2024-09-19T21:20:00Z M. Ozan Karsavuran Esmond G. Ng Barry W. Peyton Jonathan L. Peyton http://arxiv.org/abs/2409.10729v1 OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs 2024-09-16T21:05:45Z

GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses 'gang vector' and 'collapse'. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays and manual inlining via metaprogramming. Additional optimizations yield seven-times speedup in array packing and thirty-times speedup of select kernels on Frontier. Weak scaling efficiencies of 97% and 95% are observed when scaling to 50% of Summit and 95% of Frontier. Strong scaling efficiencies of 84% and 81% are observed when increasing the device count by a factor of 8 and 16 on V100 and MI250X hardware. The strong scaling efficiency of AMD's MI250X increases to 92% when increasing the device count by a factor of 16 when GPU-aware MPI is used for communication.

2024-09-16T21:05:45Z 11 pages, 9 figures, 6 listings, WACCPD at SC24 SC 24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 24). IEEE Press, 1923-1933 Benjamin Wilfong Anand Radhakrishnan Henry A. Le Berre Steve Abbott Reuben D. Budiardja Spencer H. Bryngelson 10.1109/SCW63240.2024.00242 http://arxiv.org/abs/2409.09622v1 Computing Arrangements of Hypersurfaces 2024-09-15T06:11:57Z

We present a Julia package HypersurfaceRegions.jl for computing all connected components in the complement of an arrangement of real algebraic hypersurfaces in $\mathbb{R}^n$.

2024-09-15T06:11:57Z 16 pages, 6 figures J. Softw. Alg. Geom. 15 (2025) 11-27 Paul Breiding Bernd Sturmfels Kexin Wang 10.2140/jsag.2025.15.11 http://arxiv.org/abs/2409.09208v1 A Unified Funnel Restoration SQP Algorithm 2024-09-13T21:42:06Z

We consider nonlinearly constrained optimization problems and discuss a generic double-loop framework consisting of four algorithmic ingredients that unifies a broad range of nonlinear optimization solvers. This framework has been implemented in the open-source solver Uno, a Swiss Army knife-like C++ optimization framework that unifies many nonlinearly constrained nonconvex optimization solvers. We illustrate the framework with a sequential quadratic programming (SQP) algorithm that maintains an acceptable upper bound on the constraint violation, called a funnel, that is monotonically decreased to control the feasibility of the iterates. Infeasible quadratic subproblems are handled by a feasibility restoration strategy. Globalization is controlled by a line search or a trust-region method. We prove global convergence of the trust-region funnel SQP method, building on known results from filter methods. We implement the algorithm in Uno, and we provide extensive test results for the trust-region line-search funnel SQP on small CUTEst instances.

2024-09-13T21:42:06Z Submitted to Mathematical Programming David Kiessling Sven Leyffer Charlie Vanaret http://arxiv.org/abs/1907.02088v7 hyppo: A Multivariate Hypothesis Testing Python Package 2024-09-13T01:15:40Z

We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible enough to enable future extensions. The documentation and all releases are available at https://hyppo.neurodata.io.

2019-07-03T18:05:25Z Sambit Panda Satish Palaniappan Junhao Xiong Eric W. Bridgeford Ronak Mehta Cencheng Shen Joshua T. Vogelstein http://arxiv.org/abs/2409.07000v1 Introducing UNIQuE: The Unconventional Noiseless Intermediate Quantum Emulator 2024-09-11T04:24:51Z

We implement the first open-source quantum computing emulator that includes arithmetic operations, the quantum Fourier transform, and quantum phase estimation. The emulator provides significant savings in both temporal and spatial resources compared to simulation, and these computational advantages are verified through comparison to the Intel Quantum Simulator. We also demonstrate how to use the emulator to implement Shor's algorithm and use it to solve a nontrivial factoring problem. This demonstrates that emulation can make quantum computing more accessible than simulation or noisy hardware by allowing researchers to study the behavior of algorithms on large problems in a noiseless environment.

2024-09-11T04:24:51Z 2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON) 66301, 00086 (2025) Reece Robertson Dan Ventura 10.1109/CASCON66301.2025.00086 http://arxiv.org/abs/2409.06085v1 Differentiable programming across the PDE and Machine Learning barrier 2024-09-09T21:36:38Z

The combination of machine learning and physical laws has shown immense potential for solving scientific problems driven by partial differential equations (PDEs) with the promise of fast inference, zero-shot generalisation, and the ability to discover new physics. Examples include the use of fundamental physical laws as inductive bias to machine learning algorithms, also referred to as physics-driven machine learning, and the application of machine learning to represent features not represented in the differential equations such as closures for unresolved spatiotemporal scales. However, the simulation of complex physical systems by coupling advanced numerics for PDEs with state-of-the-art machine learning demands the composition of specialist PDE solving frameworks with industry-standard machine learning tools. Hand-rolling either the PDE solver or the neural net will not cut it. In this work, we introduce a generic differentiable programming abstraction that provides scientists and engineers with a highly productive way of specifying end-to-end differentiable models coupling machine learning and PDE-based components, while relying on code generation for high performance. Our interface automates the coupling of arbitrary PDE-based systems and machine learning models and unlocks new applications that could not hitherto be tackled, while only requiring trivial changes to existing code. Our framework has been adopted in the Firedrake finite-element library and supports the PyTorch and JAX ecosystems, as well as downstream libraries.

2024-09-09T21:36:38Z Nacime Bouziani David A. Ham Ado Farsi http://arxiv.org/abs/2402.13768v5 Democratizing Uncertainty Quantification 2024-09-09T13:32:03Z

Uncertainty Quantification (UQ) is vital to safety-critical model-based analyses, but the widespread adoption of sophisticated UQ methods is limited by technical complexity. In this paper, we introduce UM-Bridge (the UQ and Modeling Bridge), a high-level abstraction and software protocol that facilitates universal interoperability of UQ software with simulation codes. It breaks down the technical complexity of advanced UQ applications and enables separation of concerns between experts. UM-Bridge democratizes UQ by allowing effective interdisciplinary collaboration, accelerating the development of advanced UQ methods, and making it easy to perform UQ analyses from prototype to High Performance Computing (HPC) scale. In addition, we present a library of ready-to-run UQ benchmark problems, all easily accessible through UM-Bridge. These benchmarks support UQ methodology research, enabling reproducible performance comparisons. We demonstrate UM-Bridge with several scientific applications, harnessing HPC resources even using UQ codes not designed with HPC support.

2024-02-21T12:43:41Z Add Benjamin Kent as co-author in accordance with the paper's published version Linus Seelinger Anne Reinarz Mikkel B. Lykkegaard Robert Akers Amal M. A. Alghamdi David Aristoff Wolfgang Bangerth Jean Bénézech Matteo Diez Kurt Frey John D. Jakeman Jakob S. Jørgensen Ki-Tae Kim Benjamin M. Kent Massimiliano Martinelli Matthew Parno Riccardo Pellegrini Noemi Petra Nicolai A. B. Riis Katherine Rosenfeld Andrea Serani Lorenzo Tamellini Umberto Villa Tim J. Dodwell Robert Scheichl http://arxiv.org/abs/2409.00568v2 Welding R and C++: A Tale of Two Programming Languages 2024-09-07T15:39:21Z

This article compares `cpp11armadillo` and `cpp11eigen`, new R packages that integrate the powerful Armadillo and Eigen C++ libraries for linear algebra into the R programming environment. This article provides a detailed comparison between Armadillo and Eigen speed and syntax. The goal of these packages is to simplify a part of the process of solving bottlenecks by using C++ within R, these offer additional ease of integration for users who require high-performance linear algebra operations in their R workflows. This document aims to discuss the tradeoff between computational efficiency and accessibility.

2024-09-01T00:09:00Z 21 pages, 0 figures, 13 tables Mauricio Vargas Sepulveda 10.1016/j.softx.2025.102087 http://arxiv.org/abs/2409.04789v1 forester: A Tree-Based AutoML Tool in R 2024-09-07T10:39:10Z

The majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user's proficiency in the area of machine learning. The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis.

2024-09-07T10:39:10Z Hubert Ruczyński Anna Kozak http://arxiv.org/abs/2409.03121v1 QHDOPT: A Software for Nonlinear Optimization with Quantum Hamiltonian Descent 2024-09-04T23:11:25Z

We develop an open-source, end-to-end software (named QHDOPT), which can solve nonlinear optimization problems using the quantum Hamiltonian descent (QHD) algorithm. QHDOPT offers an accessible interface and automatically maps tasks to various supported quantum backends (i.e., quantum hardware machines). These features enable users, even those without prior knowledge or experience in quantum computing, to utilize the power of existing quantum devices for nonlinear and nonconvex optimization tasks. In its intermediate compilation layer, QHDOPT employs SimuQ, an efficient interface for Hamiltonian-oriented programming, to facilitate multiple algorithmic specifications and ensure compatible cross-hardware deployment. The detailed documentation of QHDOPT is available at https://github.com/jiaqileng/QHDOPT.

2024-09-04T23:11:25Z 23 pages, 7 figures. The full repository is available at https://github.com/jiaqileng/QHDOPT Samuel Kushnir Jiaqi Leng Yuxiang Peng Lei Fan Xiaodi Wu http://arxiv.org/abs/2311.02037v2 An Efficient Framework for Global Non-Convex Polynomial Optimization with Algebraic Constraints 2024-09-04T17:25:51Z

We present an efficient framework for solving algebraically-constrained global non-convex polynomial optimization problems over subsets of the hypercube. We prove the existence of an equivalent nonlinear reformulation of such problems that possesses essentially no spurious local minima. Through numerical experiments on previously intractable global constrained polynomial optimization problems in high dimension, we show that polynomial scaling in dimension and degree is achievable when computing the optimal value and location.

2023-11-03T17:10:26Z Mitchell Tong Harris Pierre-David Letourneau Dalton Jones M. Harper Langston http://arxiv.org/abs/2409.01712v1 Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression 2024-09-03T08:50:42Z

We exploit the widening margin in tensor-core performance between [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate KRR matrix generation by redesigning the computation for Euclidean distances to engage INT8 tensor cores while exploiting symmetry.We accelerate solution of the regularized KRR systems by deploying a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.

2024-09-03T08:50:42Z Hatem Ltaief Rabab Alomairy Qinglei Cao Jie Ren Lotfi Slim Thorsten Kurth Benedikt Dorschner Salim Bougouffa Rached Abdelkhalak David E. Keyes