https://arxiv.org/api/WnazZxrzyAwJEdiGzSScdT8QRMo2026-06-22T13:49:15Z266445015http://arxiv.org/abs/2409.18772v1A method of using RSVD in residual calculation of LowBit GEMM2024-09-27T14:16:35ZThe advancements of hardware technology in recent years has brought many possibilities for low-precision applications. However, the use of low precision can introduce significant computational errors, posing a considerable challenge to maintaining the computational accuracy.
We propose low-rank residuals quantized matrix multiplication(LRQMM) method which introduces low-rank approximation in residual compensation for dense low precision quantization matrix multiplication. It can bring several times accuracy improvement with only BLAS-2 level extra time overhead. Moreover, LRQMM is a completely data-free quantization method that does not require additional data for pre-training. And it only works with low precision GEMM operator, which is easy to couple with other methods.
Through experimentation, LRQMM can reduce the error of direct quantized matrix multiplication by 1~2 orders of magnitude, when dealing with larger matrix sizes, the computational speed is only reduced by approximately 20\%. In deep learning networks, LRQMM-4bit achieves 61.8% ImageNet Top-1 accuracy in Resnet-50, while the Direct Quant accuracy is only 8.3%.2024-09-27T14:16:35ZHongyaoxing Guhttp://arxiv.org/abs/2409.15926v1QHyper: an integration library for hybrid quantum-classical optimization2024-09-24T09:47:28ZWe propose the QHyper library, which is aimed at researchers working on computational experiments with a variety of quantum combinatorial optimization solvers. The library offers a simple and extensible interface for formulating combinatorial optimization problems, selecting and running solvers, and optimizing hyperparameters. The supported solver set includes variational gate-based algorithms, quantum annealers, and classical solutions. The solvers can be combined with provided local and global (hyper)optimizers. The main features of the library are its extensibility on different levels of use as well as a straightforward and flexible experiment configuration format presented in the paper.2024-09-24T09:47:28ZTomasz LamżaJustyna ZawalskaKacper JurekMariusz SterzelKatarzyna Rycerzhttp://arxiv.org/abs/2409.13090v1Some new techniques to use in serial sparse Cholesky factorization algorithms2024-09-19T21:20:00ZWe present a new variant of serial right-looking supernodal sparse Cholesky factorization (RL). Our comparison of RL with the multifrontal method confirms that RL is simpler, slightly faster, and requires slightly less storage. The key to the rest of the work in this paper is recent work on reordering columns within supernodes so that the dense off-diagonal blocks in the factor matrix joining pairs of supernodes are fewer and larger. We present a second new variant of serial right-looking supernodal sparse Cholesky factorization (RLB), where this one is specifically designed to exploit fewer and larger off-diagonal blocks in the factor matrix obtained by reordering within supernodes. A key distinction found in RLB is that it uses no floating-point working storage and performs no assembly operations. Our key finding is that RLB is unequivocally faster than its competitors. Indeed, RLB is consistently, but modestly, faster than its competitors whenever Intel's MKL sequential BLAS are used. More importantly, RLB is substantially faster than its competitors whenever Intel's MKL multithreaded BLAS are used. Finally, RLB using the multithreaded BLAS achieves impressive speedups over RLB using the sequential BLAS.2024-09-19T21:20:00ZM. Ozan KarsavuranEsmond G. NgBarry W. PeytonJonathan L. Peytonhttp://arxiv.org/abs/2409.10729v1OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs2024-09-16T21:05:45ZGPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses 'gang vector' and 'collapse'. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays and manual inlining via metaprogramming. Additional optimizations yield seven-times speedup in array packing and thirty-times speedup of select kernels on Frontier. Weak scaling efficiencies of 97% and 95% are observed when scaling to 50% of Summit and 95% of Frontier. Strong scaling efficiencies of 84% and 81% are observed when increasing the device count by a factor of 8 and 16 on V100 and MI250X hardware. The strong scaling efficiency of AMD's MI250X increases to 92% when increasing the device count by a factor of 16 when GPU-aware MPI is used for communication.2024-09-16T21:05:45Z11 pages, 9 figures, 6 listings, WACCPD at SC24SC 24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 24). IEEE Press, 1923-1933Benjamin WilfongAnand RadhakrishnanHenry A. Le BerreSteve AbbottReuben D. BudiardjaSpencer H. Bryngelson10.1109/SCW63240.2024.00242http://arxiv.org/abs/2409.09622v1Computing Arrangements of Hypersurfaces2024-09-15T06:11:57ZWe present a Julia package HypersurfaceRegions.jl for computing all connected components in the complement of an arrangement of real algebraic hypersurfaces in $\mathbb{R}^n$.2024-09-15T06:11:57Z16 pages, 6 figuresJ. Softw. Alg. Geom. 15 (2025) 11-27Paul BreidingBernd SturmfelsKexin Wang10.2140/jsag.2025.15.11http://arxiv.org/abs/2409.09208v1A Unified Funnel Restoration SQP Algorithm2024-09-13T21:42:06ZWe consider nonlinearly constrained optimization problems and discuss a generic double-loop framework consisting of four algorithmic ingredients that unifies a broad range of nonlinear optimization solvers. This framework has been implemented in the open-source solver Uno, a Swiss Army knife-like C++ optimization framework that unifies many nonlinearly constrained nonconvex optimization solvers. We illustrate the framework with a sequential quadratic programming (SQP) algorithm that maintains an acceptable upper bound on the constraint violation, called a funnel, that is monotonically decreased to control the feasibility of the iterates. Infeasible quadratic subproblems are handled by a feasibility restoration strategy. Globalization is controlled by a line search or a trust-region method. We prove global convergence of the trust-region funnel SQP method, building on known results from filter methods. We implement the algorithm in Uno, and we provide extensive test results for the trust-region line-search funnel SQP on small CUTEst instances.2024-09-13T21:42:06ZSubmitted to Mathematical ProgrammingDavid KiesslingSven LeyfferCharlie Vanarethttp://arxiv.org/abs/1907.02088v7hyppo: A Multivariate Hypothesis Testing Python Package2024-09-13T01:15:40ZWe introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible enough to enable future extensions. The documentation and all releases are available at https://hyppo.neurodata.io.2019-07-03T18:05:25ZSambit PandaSatish PalaniappanJunhao XiongEric W. BridgefordRonak MehtaCencheng ShenJoshua T. Vogelsteinhttp://arxiv.org/abs/2409.07000v1Introducing UNIQuE: The Unconventional Noiseless Intermediate Quantum Emulator2024-09-11T04:24:51ZWe implement the first open-source quantum computing emulator that includes arithmetic operations, the quantum Fourier transform, and quantum phase estimation. The emulator provides significant savings in both temporal and spatial resources compared to simulation, and these computational advantages are verified through comparison to the Intel Quantum Simulator. We also demonstrate how to use the emulator to implement Shor's algorithm and use it to solve a nontrivial factoring problem. This demonstrates that emulation can make quantum computing more accessible than simulation or noisy hardware by allowing researchers to study the behavior of algorithms on large problems in a noiseless environment.2024-09-11T04:24:51Z2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON) 66301, 00086 (2025)Reece RobertsonDan Ventura10.1109/CASCON66301.2025.00086http://arxiv.org/abs/2409.06085v1Differentiable programming across the PDE and Machine Learning barrier2024-09-09T21:36:38ZThe combination of machine learning and physical laws has shown immense potential for solving scientific problems driven by partial differential equations (PDEs) with the promise of fast inference, zero-shot generalisation, and the ability to discover new physics. Examples include the use of fundamental physical laws as inductive bias to machine learning algorithms, also referred to as physics-driven machine learning, and the application of machine learning to represent features not represented in the differential equations such as closures for unresolved spatiotemporal scales. However, the simulation of complex physical systems by coupling advanced numerics for PDEs with state-of-the-art machine learning demands the composition of specialist PDE solving frameworks with industry-standard machine learning tools. Hand-rolling either the PDE solver or the neural net will not cut it. In this work, we introduce a generic differentiable programming abstraction that provides scientists and engineers with a highly productive way of specifying end-to-end differentiable models coupling machine learning and PDE-based components, while relying on code generation for high performance. Our interface automates the coupling of arbitrary PDE-based systems and machine learning models and unlocks new applications that could not hitherto be tackled, while only requiring trivial changes to existing code. Our framework has been adopted in the Firedrake finite-element library and supports the PyTorch and JAX ecosystems, as well as downstream libraries.2024-09-09T21:36:38ZNacime BouzianiDavid A. HamAdo Farsihttp://arxiv.org/abs/2402.13768v5Democratizing Uncertainty Quantification2024-09-09T13:32:03ZUncertainty Quantification (UQ) is vital to safety-critical model-based analyses, but the widespread adoption of sophisticated UQ methods is limited by technical complexity. In this paper, we introduce UM-Bridge (the UQ and Modeling Bridge), a high-level abstraction and software protocol that facilitates universal interoperability of UQ software with simulation codes. It breaks down the technical complexity of advanced UQ applications and enables separation of concerns between experts. UM-Bridge democratizes UQ by allowing effective interdisciplinary collaboration, accelerating the development of advanced UQ methods, and making it easy to perform UQ analyses from prototype to High Performance Computing (HPC) scale.
In addition, we present a library of ready-to-run UQ benchmark problems, all easily accessible through UM-Bridge. These benchmarks support UQ methodology research, enabling reproducible performance comparisons. We demonstrate UM-Bridge with several scientific applications, harnessing HPC resources even using UQ codes not designed with HPC support.2024-02-21T12:43:41ZAdd Benjamin Kent as co-author in accordance with the paper's published versionLinus SeelingerAnne ReinarzMikkel B. LykkegaardRobert AkersAmal M. A. AlghamdiDavid AristoffWolfgang BangerthJean BénézechMatteo DiezKurt FreyJohn D. JakemanJakob S. JørgensenKi-Tae KimBenjamin M. KentMassimiliano MartinelliMatthew ParnoRiccardo PellegriniNoemi PetraNicolai A. B. RiisKatherine RosenfeldAndrea SeraniLorenzo TamelliniUmberto VillaTim J. DodwellRobert Scheichlhttp://arxiv.org/abs/2409.00568v2Welding R and C++: A Tale of Two Programming Languages2024-09-07T15:39:21ZThis article compares `cpp11armadillo` and `cpp11eigen`, new R packages that integrate the powerful Armadillo and Eigen C++ libraries for linear algebra into the R programming environment. This article provides a detailed comparison between Armadillo and Eigen speed and syntax. The goal of these packages is to simplify a part of the process of solving bottlenecks by using C++ within R, these offer additional ease of integration for users who require high-performance linear algebra operations in their R workflows. This document aims to discuss the tradeoff between computational efficiency and accessibility.2024-09-01T00:09:00Z21 pages, 0 figures, 13 tablesMauricio Vargas Sepulveda10.1016/j.softx.2025.102087http://arxiv.org/abs/2409.04789v1forester: A Tree-Based AutoML Tool in R2024-09-07T10:39:10ZThe majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user's proficiency in the area of machine learning.
The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis.2024-09-07T10:39:10ZHubert RuczyńskiAnna Kozakhttp://arxiv.org/abs/2409.03121v1QHDOPT: A Software for Nonlinear Optimization with Quantum Hamiltonian Descent2024-09-04T23:11:25ZWe develop an open-source, end-to-end software (named QHDOPT), which can solve nonlinear optimization problems using the quantum Hamiltonian descent (QHD) algorithm. QHDOPT offers an accessible interface and automatically maps tasks to various supported quantum backends (i.e., quantum hardware machines). These features enable users, even those without prior knowledge or experience in quantum computing, to utilize the power of existing quantum devices for nonlinear and nonconvex optimization tasks. In its intermediate compilation layer, QHDOPT employs SimuQ, an efficient interface for Hamiltonian-oriented programming, to facilitate multiple algorithmic specifications and ensure compatible cross-hardware deployment. The detailed documentation of QHDOPT is available at https://github.com/jiaqileng/QHDOPT.2024-09-04T23:11:25Z23 pages, 7 figures. The full repository is available at https://github.com/jiaqileng/QHDOPTSamuel KushnirJiaqi LengYuxiang PengLei FanXiaodi Wuhttp://arxiv.org/abs/2311.02037v2An Efficient Framework for Global Non-Convex Polynomial Optimization with Algebraic Constraints2024-09-04T17:25:51ZWe present an efficient framework for solving algebraically-constrained global non-convex polynomial optimization problems over subsets of the hypercube. We prove the existence of an equivalent nonlinear reformulation of such problems that possesses essentially no spurious local minima. Through numerical experiments on previously intractable global constrained polynomial optimization problems in high dimension, we show that polynomial scaling in dimension and degree is achievable when computing the optimal value and location.2023-11-03T17:10:26ZMitchell Tong HarrisPierre-David LetourneauDalton JonesM. Harper Langstonhttp://arxiv.org/abs/2409.01712v1Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression2024-09-03T08:50:42ZWe exploit the widening margin in tensor-core performance between [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate KRR matrix generation by redesigning the computation for Euclidean distances to engage INT8 tensor cores while exploiting symmetry.We accelerate solution of the regularized KRR systems by deploying a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.2024-09-03T08:50:42ZHatem LtaiefRabab AlomairyQinglei CaoJie RenLotfi SlimThorsten KurthBenedikt DorschnerSalim BougouffaRached AbdelkhalakDavid E. Keyes