https://arxiv.org/api/0SUCgg49GGfvHDAJrq+Cca86rwE 2026-06-22T11:27:54Z 2664 420 15 http://arxiv.org/abs/2406.08646v2 PETSc/TAO Developments for GPU-Based Early Exascale Systems 2024-11-14T19:49:25Z

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.

2024-06-12T21:11:46Z 17 pages Richard Tran Mills Mark Adams Satish Balay Jed Brown Jacob Faibussowitsch Toby Isaac Matthew Knepley Todd Munson Hansol Suh Stefano Zampini Hong Zhang Junchao Zhang http://arxiv.org/abs/2411.06631v1 SequentialSamplingModels.jl: Simulating and Evaluating Cognitive Models of Response Times in Julia 2024-11-10T23:46:37Z

Sequential sampling models (SSMs) are a widely used framework describing decision-making as a stochastic, dynamic process of evidence accumulation. SSMs popularity across cognitive science has driven the development of various software packages that lower the barrier for simulating, estimating, and comparing existing SSMs. Here, we present a software tool, SequentialSamplingModels.jl (SSM.jl), designed to make SSM simulations more accessible to Julia users, and to integrate with the Julia ecosystem. We demonstrate the basic use of SSM.jl for simulation, plotting, and Bayesian inference.

2024-11-10T23:46:37Z Proceedings of the JuliaCon Conferences, 7(78):186, 2025 Kianté Fernandez Dominique Makowski Christopher Fisher 10.21105/jcon.00186 http://arxiv.org/abs/2411.03851v1 On a probabilistic global optimizer derived from the Walker slice sampling 2024-11-06T11:40:43Z

This article presents a zeroth order probabilistic global optimization algorithm -- SwiftNav -- for (not necessarily convex) functions over a compact domain. A discretization procedure is deployed on the compact domain, starting with a small step-size $h > 0$ and subsequently adaptively refining it in the course of a simulated annealing routine utilizing the Walker slice and the Gibbs sampler, in order to identify a set of global optimizers up to good precision. SwiftNav is parallelizable, which helps with scalability as the dimension of decision variables increases. Several numerical experiments are included here to demonstrate the effectiveness and accuracy of SwiftNav in high-dimensional benchmark optimization problems.

2024-11-06T11:40:43Z 18 pages, 16 figures Aditya Gupta Souvik Das Debasish Chatterjee http://arxiv.org/abs/2411.03501v1 The Python LevelSet Toolbox (LevelSetPy) 2024-11-05T20:31:23Z

This paper describes open-source scientific contributions in python surrounding the numerical solutions to hyperbolic Hamilton-Jacobi (HJ) partial differential equations viz., their implicit representation on co-dimension one surfaces; dynamics evolution with levelsets; spatial derivatives; total variation diminishing Runge-Kutta integration schemes; and their applications to the theory of reachable sets. They are increasingly finding applications in multiple research domains such as reinforcement learning, robotics, control engineering and automation. We describe the library components, illustrate usage with an example, and provide comparisons with existing implementations. This GPU-accelerated package allows for easy portability to many modern libraries for the numerical analyses of the HJ equations. We also provide a CPU implementation in python that is significantly faster than existing alternatives.

2024-11-05T20:31:23Z The 63rd IEEE Conference on Decision and Control, Milan, 2024 Lekan Molu http://arxiv.org/abs/2412.16161v1 Antiassociative algebra in R: introducing the evitaicossa package 2024-10-31T16:31:26Z

In this short article I introduce the evitaicossa package which provides functionality for antiassociative algebras in the R programming language; it is available on CRAN at https://CRAN.R-project.org/package=evitaicossa.

2024-10-31T16:31:26Z 6 pages Robin K. S. Hankinn http://arxiv.org/abs/2401.05868v2 Efficient N-to-M Checkpointing Algorithm for Finite Element Simulations 2024-10-30T23:57:13Z

In this work, we introduce a new algorithm for N-to-M checkpointing in finite element simulations. This new algorithm allows efficient saving/loading of functions representing physical quantities associated with the mesh representing the physical domain. Specifically, the algorithm allows for using different numbers of parallel processes for saving and loading, allowing for restarting and post-processing on the process count appropriate to the given phase of the simulation and other conditions. For demonstration, we implemented this algorithm in PETSc, the Portable, Extensible Toolkit for Scientific Computation, and added a convenient high-level interface into Firedrake, a system for solving partial differential equations using finite element methods. We evaluated our new implementation by saving and loading data involving 8.2 billion finite element degrees of freedom using 8,192 parallel processes on ARCHER2, the UK National Supercomputing Service.

2024-01-11T12:20:50Z author accepted manuscript SIAM SISC 46(6):B830-B859 (2024) David A. Ham Vaclav Hapla Matthew G. Knepley Lawrence Mitchell Koki Sagiyama 10.1137/23M1613724 http://arxiv.org/abs/2410.22652v1 Development of a Python-Based Software for Calculating the Jones Polynomial: Insights into the Behavior of Polymers and Biopolymers 2024-10-30T02:41:42Z

This thesis details a Python-based software designed to calculate the Jones polynomial, a vital mathematical tool from Knot Theory used for characterizing the topological and geometrical complexity of curves in $ \mathbb{R}^3 $, which is essential in understanding physical systems of filaments, including the behavior of polymers and biopolymers. The Jones polynomial serves as a topological invariant capable of distinguishing between different knot structures. This capability is fundamental to characterizing the architecture of molecular chains, such as proteins and DNA. Traditional computational methods for deriving the Jones polynomial have been limited by closure-schemes and high execution costs, which can be impractical for complex structures like those that appear in real life. This software implements methods that significantly reduce calculation times, allowing for more efficient and practical applications in the study of biological polymers. It utilizes a divide-and-conquer approach combined with parallel computing and applies recursive Reidemeister moves to optimize the computation, transitioning from an exponential to a near-linear runtime for specific configurations. This thesis provides an overview of the software's functions, detailed performance evaluations using protein structures as test cases, and a discussion of the implications for future research and potential algorithmic improvements.

2024-10-30T02:41:42Z Caleb Musfeldt http://arxiv.org/abs/2411.00819v1 A Bellman-Ford algorithm for the path-length-weighted distance in graphs 2024-10-28T15:31:34Z

Consider a finite directed graph without cycles in which the arrows are weighted. We present an algorithm for the computation of a new distance, called path-length-weighted distance, which has proven useful for graph analysis in the context of fraud detection. The idea is that the new distance explicitly takes into account the size of the paths in the calculations. Thus, although our algorithm is based on arguments similar to those at work for the Bellman-Ford and Dijkstra methods, it is in fact essentially different. We lay out the appropriate framework for its computation, showing the constraints and requirements for its use, along with some illustrative examples.

2024-10-28T15:31:34Z 20 pages, 10 figures R. Arnau J. M. Calabuig L. M. García Raffi E. A. Sánchez Pérez S. Sanjuan 10.3390/math12162590 http://arxiv.org/abs/2405.07819v2 Local Adjoints for Simultaneous Preaccumulations with Shared Inputs 2024-10-27T19:11:54Z

In shared-memory parallel automatic differentiation, inputs that are shared among simultaneous thread-local preaccumulations lead to data races if Jacobians are accumulated with a single, shared vector of adjoint variables. In this work, we discuss the benefits and tradeoffs of re-enabling such preaccumulations by a transition to suitable local adjoints. We propose different vector- and map-based approaches for storing local adjoint variables and analyze them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads. We implement the approaches in CoDiPack and benchmark them in parallel discrete adjoint computations in the multiphysics simulation suite SU2.

2024-05-13T15:01:18Z 12 pages, 5 figures. Updated and extended all parts of the paper Johannes Blühdorn Nicolas R. Gauger 10.1137/1.9781611979039.13 http://arxiv.org/abs/2312.08006v2 Performance of linear solvers in tensor-train format on current multicore architectures 2024-10-24T14:02:01Z

Tensor networks are a class of algorithms aimed at reducing the computational complexity of high-dimensional problems. They are used in an increasing number of applications, from quantum simulations to machine learning. Exploiting data parallelism in these algorithms is key to using modern hardware. However, there are several ways to map required tensor operations onto linear algebra routines ("building blocks"). Optimizing this mapping impacts the numerical behavior, so computational and numerical aspects must be considered hand-in-hand. In this paper we discuss the performance of solvers for low-rank linear systems in the tensor-train format (also known as matrix-product states). We consider three popular algorithms: TT-GMRES, MALS, and AMEn. We illustrate their computational complexity based on the example of discretizing a simple high-dimensional PDE in, e.g., $50^{10}$ grid points. This shows that the projection to smaller sub-problems for MALS and AMEn reduces the number of floating-point operations by orders of magnitude. We suggest optimizations regarding orthogonalization steps, singular value decompositions, and tensor contractions. In addition, we propose a generic preconditioner based on a TT-rank-1 approximation of the linear operator. Overall, we obtain roughly a 5x speedup over the reference algorithm for the fastest method (AMEn) on a current multicore CPU.

2023-12-13T09:28:09Z 28 pages, 8 figures, submitted to IJHPCA Melven Röhrig-Zöllner Manuel Joey Becklas Jonas Thies Achim Basermann http://arxiv.org/abs/2410.15963v1 An Efficient Local Optimizer-Tracking Solver for Differential-Algebriac Equations with Optimization Criteria 2024-10-21T12:48:12Z

A sequential solver for differential-algebraic equations with embedded optimization criteria (DAEOs) was developed to take advantage of the theoretical work done by Deussen et al. Solvers of this type separate the optimization problem from the differential equation and solve each individually. The new solver relies on the reduction of a DAEO to a sequence of differential inclusions separated by jump events. These jump events occur when the global solution to the optimization problem jumps to a new value. Without explicit treatment, these events will reduce the order of convergence of the integration step to one. The solver implements a "local optimizer tracking" procedure to detect and correct these jump events. Local optimizer tracking is much less expensive than running a deterministic global optimizer at every time step. This preserves the order of convergence of the integrator component without sacrificing performance to perform deterministic global optimization at every time step. The newly developed solver produces correct solutions to DAEOs and runs much faster than sequential DAEO solvers that rely only on global optimization.

2024-10-21T12:48:12Z 8 pages, 5 figures Alexander Fleming Jens Deussen Uwe Naumann http://arxiv.org/abs/2410.12942v1 modOpt: A modular development environment and library for optimization algorithms 2024-10-16T18:30:23Z

Recent advances in computing hardware and modeling software have given rise to new applications for numerical optimization. These new applications occasionally uncover bottlenecks in existing optimization algorithms and necessitate further specialization of the algorithms. However, such specialization requires expert knowledge of the underlying mathematical theory and the software implementation of existing algorithms. To address this challenge, we present modOpt, an open-source software framework that facilitates the construction of optimization algorithms from modules. The modular environment provided by modOpt enables developers to tailor an existing algorithm for a new application by only altering the relevant modules. modOpt is designed as a platform to support students and beginner developers in quickly learning and developing their own algorithms. With that aim, the entirety of the framework is written in Python, and it is well-documented, well-tested, and hosted open-source on GitHub. Several additional features are embedded into the framework to assist both beginner and advanced developers. In addition to providing stock modules, the framework also includes fully transparent implementations of pedagogical optimization algorithms in Python. To facilitate testing and benchmarking of new algorithms, the framework features built-in visualization and recording capabilities, interfaces to modeling frameworks such as OpenMDAO and CSDL, interfaces to general-purpose optimization algorithms such as SNOPT and SLSQP, an interface to the CUTEst test problem set, etc. In this paper, we present the underlying software architecture of modOpt, review its various features, discuss several educational and performance-oriented algorithms within modOpt, and present numerical studies illustrating its unique benefits.

2024-10-16T18:30:23Z 37 pages with 13 figures. For associated code, see https://github.com/LSDOlab/modopt Anugrah Jo Joshy John T. Hwang http://arxiv.org/abs/2410.12614v1 Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration 2024-10-16T14:32:10Z

In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision implementation strategies which are provably robust to low-precision computations. Indeed, these algorithms are accurate to the lower-precision unit roundoff with an error constant that is independent from: the conditioning of FE basis function evaluations, the ill-posedness of the cell, the polynomial degree, and the number of quadrature nodes. Consequently, we present the first AMX-accelerated FE kernel implementations on Intel Sapphire Rapids CPUs. Numerical experiments demonstrate that the proposed mixed- (single/half-) precision algorithms are up to 60 times faster than their double precision equivalent while being orders of magnitude more accurate than their fully half-precision counterparts.

2024-10-16T14:32:10Z Keywords: Mixed precision, finite element method, finite element kernel and assembly, rounding error analysis, hardware acceleration, matrix units, Intel AMX M. Croci G. N. Wells http://arxiv.org/abs/2407.15973v3 Mixed Precision Block-Jacobi Preconditioner: Algorithms, Performance Evaluation and Feature Analysis 2024-10-15T13:17:13Z

In this paper, we propose two mixed precision algorithms for Block-Jacobi preconditioner(BJAC): a fixed low precision strategy and an adaptive precision strategy. We evaluate the performance improvement of the proposed mixed precision BJAC preconditioners combined with the preconditioned conjugate gradient algorithm using problems including diffusion equations and radiation hydrodynamics equations. Numerical results show that, compared to the uniform high precision PCG algorithm, the mixed precision preconditioners can achieve speedups from 1.3 to 1.8 without sacrificing accuracy. Furthermore, we observe the phenomenon of convergence delay in some test cases for the mixed precision preconditioners, and further analyse the matrix features associate with the convergence delay behavior.

2024-07-22T18:35:05Z Ningxi Tian Silu Huang Xiaowen Xu http://arxiv.org/abs/2404.10143v2 Computing with Hypergeometric-Type Terms 2024-10-14T23:01:11Z

Take a multiplicative monoid of sequences in which the multiplication is given by Hadamard product. The set of linear combinations of interleaving monoid elements then yields a ring. For hypergeometric sequences, the resulting ring is a subring of the ring of holonomic sequences. We present two algorithms in this setting: one for computing holonomic recurrence equations from hypergeometric-type normal forms and the other for finding products of hypergeometric-type terms. These are newly implemented commands in our Maple package $HyperTypeSeq$, available at \url{https://github.com/T3gu1a/HyperTypeSeq}, which we also describe.

2024-04-15T21:24:18Z Mainly correcting a miscopy of the explicit formula that the code outputs for the sequence at https://oeis.org/A212579 (see equation (3)). This is the version considered for ISSAC'24 software presentation Bertrand Teguia Tabuguia