https://arxiv.org/api/IBu5hL0pmCyos/AKIJQALp/7eT02026-06-22T08:10:08Z266437515http://arxiv.org/abs/2502.16015v1On the computation of the cumulative distribution function of the Normal Inverse Gaussian distribution2025-02-22T00:12:53ZIn this paper, we obtain various series and asymptotic expansions involving the modified Bessel function of the second kind for the normal inverse Gaussian cumulative distribution function. The new expansions accelerate computations, complementing the numerical integration methods implemented in statistical software packages. We also provide a detailed description of the algorithm and its corresponding implementation in C++. The performance and accuracy of the algorithm are extensively tested and benchmarked with open-source implementations, offering superior accuracy and speed-ups of a factor from 5 to 60.2025-02-22T00:12:53ZGuillermo Navas-Palenciahttp://arxiv.org/abs/2502.13090v1tn4ml: Tensor Network Training and Customization for Machine Learning2025-02-18T17:57:29ZTensor Networks have emerged as a prominent alternative to neural networks for addressing Machine Learning challenges in foundational sciences, paving the way for their applications to real-life problems. This paper introduces tn4ml, a novel library designed to seamlessly integrate Tensor Networks into optimization pipelines for Machine Learning tasks. Inspired by existing Machine Learning frameworks, the library offers a user-friendly structure with modules for data embedding, objective function definition, and model training using diverse optimization strategies. We demonstrate its versatility through two examples: supervised learning on tabular data and unsupervised learning on an image dataset. Additionally, we analyze how customizing the parts of the Machine Learning pipeline for Tensor Networks influences performance metrics.2025-02-18T17:57:29ZEma PuljakSergio Sanchez-RamirezSergi Masot-LlimaJofre Vallès-MunsArtur Garcia-SaezMaurizio Pierinihttp://arxiv.org/abs/2502.10831v1A Novel SIMD-Optimized Implementation for Fast and Memory-Efficient Trigonometric Computation2025-02-15T15:18:48ZThis paper proposes a novel set of trigonometric implementations which are 5x faster than the inbuilt C++ functions. The proposed implementation is also highly memory efficient requiring no precomputations of any kind. Benchmark comparisons are done versus inbuilt functions and an optimized taylor implementation. Further, device usage estimates are also obtained, showing significant hardware usage reduction compared to inbuilt functions. This improvement could be particularly useful for low-end FPGAs or other resource-constrained devices.2025-02-15T15:18:48ZNikhil Dev GoyalParth Arorahttp://arxiv.org/abs/2502.08382v1Assembly of FETI dual operator using CUDA2025-02-12T13:18:19ZFETI is a numerical method used to solve engineering problems. It builds on the ideas of domain decomposition, which makes it highly scalable and capable of efficiently utilizing whole supercomputers. One of the most time-consuming parts of the FETI solver is the application of the dual operator F in every iteration of the solver.
It is traditionally performed on the CPU using an implicit approach of applying the individual sparse matrices that form F right-to-left. Another approach is to apply the dual operator explicitly, which primarily involves a simple dense matrix-vector multiplication and can be efficiently performed on the GPU. However, this requires additional preprocessing on the CPU where the dense matrix is assembled, which makes the explicit approach beneficial only after hundreds of iterations are performed.
In this paper, we use the GPU to accelerate the assembly process as well. This significantly shortens the preprocessing time, thus decreasing the number of solver iterations needed to make the explicit approach beneficial.
With a proper configuration, we only need a few tens of iterations to achieve speedup relative to the implicit CPU approach. Compared to the CPU-only explicit approach, we achieved up to 10x speedup for the preprocessing and 25x for the application.2025-02-12T13:18:19Z10 pages, 12 figures, submitted for review to PDSEC 2025 workshop, part of IPDPS 2025 conference2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2025, pp. 365-374Jakub HomolaIT4Innovations, VSB - Technical University of OstravaRadim VavříkIT4Innovations, VSB - Technical University of OstravaOndřej MecaIT4Innovations, VSB - Technical University of OstravaTomáš BrzobohatýIT4Innovations, VSB - Technical University of OstravaLubomír ŘíhaIT4Innovations, VSB - Technical University of Ostrava10.1109/IPDPSW66978.2025.00062http://arxiv.org/abs/2502.08677v1RMCDA: The comprehensive R library for applying multi-criteria decision analysis methods2025-02-12T05:43:30ZMulti-Criteria Decision Making (MCDM) is a branch of operations research used in a variety of domains from health care to engineering to facilitate decision-making among multiple options based on specific criteria. Several R packages have been developed for the application of traditional MCDM approaches. However, as the discipline has advanced, many new approaches have emerged, necessitating the development of innovative and comprehensive tools to enhance the accessibility of these methodologies. Here, we introduce RMCDA, a comprehensive and universal R package that offers access to a variety of established MCDM approaches (e.g., AHP, TOPSIS, PROMETHEE, and VIKOR), along with newer techniques such as Stratified MCDM (SMCDM) and the Stratified Best-Worst Method (SBWM). Our open source software intends to broaden the practical use of these methods through supplementary visualization tools and straightforward installation.2025-02-12T05:43:30Z18 pages, 9 figuresAnnice NajafiShokoufeh Mirzaeihttp://arxiv.org/abs/2502.07935v1$\texttt{PrecisionLauricella}$: package for numerical computation of Lauricella functions depending on a parameter2025-02-11T20:28:20ZWe introduce the $\texttt{PrecisionLauricella}$ package, a computational tool developed in Wolfram Mathematica for high-precision numerical evaluations of Lauricella functions with indices linearly dependent on a parameter, $\varepsilon$. The package leverages a method based on analytical continuation via Frobenius generalized power series, providing an efficient and accurate alternative to conventional approaches relying on multi-dimensional series expansions or Mellin--Barnes representations. This one-dimensional approach is particularly advantageous for high-precision calculations and facilitates further optimization through $\varepsilon$-dependent reconstruction from evaluations at specific numerical values, enabling efficient parallelization. The underlying mathematical framework for this method has been detailed in our previous work, while the current paper focuses on the design, implementation, and practical applications of the $\texttt{PrecisionLauricella}$ package.2025-02-11T20:28:20Z21 pages, 6 figures. arXiv admin note: text overlap with arXiv:2502.03276M. A. BezuglovB. A. KniehlA. I. OnishchenkoO. L. Veretinhttp://arxiv.org/abs/2502.07117v1Choroidal image analysis for OCT image sequences with applications in systemic health2025-02-10T23:14:09ZThe choroid, a highly vascular layer behind the retina, is an extension of the central nervous system and has parallels with the renal cortex, with blood flow far exceeding that of the brain and kidney. Thus, there has been growing interest of choroidal blood flow reflecting physiological status of systemic disease. Optical coherence tomography (OCT) enables high-resolution imaging of the choroid, but conventional analysis methods remain manual or semi-automatic, limiting reproducibility, standardisation and clinical utility. In this thesis, I develop several new methods to analyse the choroid in OCT image sequences, with each successive method improving on its predecessors. I first develop two semi-automatic approaches for choroid region (Gaussian Process Edge Tracing, GPET) and vessel (Multi-scale Median Cut Quantisation, MMCQ) analysis, which improve on manual approaches but remain user-dependent. To address this, I introduce DeepGPET, a deep learning-based region segmentation method which improves on execution time, reproducibility, and end-user accessibility, but lacks choroid vessel analysis and automatic feature measurement. Improving on this, I developed Choroidalyzer, a deep learning-based pipeline to segment the choroidal space and vessels and generate fully automatic, clinically meaningful and reproducible choroidal features. I provide rigorous evaluation of these four approaches and consider their potential clinical value in three applications into systemic health: OCTANE, assessing choroidal changes in renal transplant recipients and donors; PREVENT, exploring choroidal associations with Alzheimer's risk factors at mid-life; D-RISCii, assessing choroidal variation and feasibility of OCT in critical care. In short, this thesis contributes many open-source tools for standardised choroidal measurement and highlights the choroid's potential as a biomarker in systemic health.2025-02-10T23:14:09ZPhD thesis toward a doctorate degree at the University of Edinburgh. PhD funded by the Medical Research Council (grant MR/N013166/1). Reviewed and examined by Dr. Roly Megaw (internal) and Prof. Pearse Keane (external) in December 2024 and ratified in the same month by the university. Official record found here: https://era.ed.ac.uk/handle/1842/42956Jamie Burke10.7488/era/5507http://arxiv.org/abs/2502.03402v2Tensor Evolution: A Framework for Fast Evaluation of Tensor Computations using Recurrences2025-02-07T16:07:14ZThis paper introduces a new mathematical framework for analysis and optimization of tensor expressions within an enclosing loop. Tensors are multi-dimensional arrays of values. They are common in high performance computing (HPC) and machine learning domains. Our framework extends Scalar Evolution - an important optimization pass implemented in both LLVM and GCC - to tensors. Scalar Evolution (SCEV) relies on the theory of `Chain of Recurrences' for its mathematical underpinnings. We use the same theory for Tensor Evolution (TeV). While some concepts from SCEV map easily to TeV -- e.g. element-wise operations; tensors introduce new operations such as concatenation, slicing, broadcast, reduction, and reshape which have no equivalent in scalars and SCEV. Not all computations are amenable to TeV analysis but it can play a part in the optimization and analysis parts of ML and HPC compilers. Also, for many mathematical/compiler ideas, applications may go beyond what was initially envisioned, once others build on it and take it further. We hope for a similar trajectory for the tensor-evolution concept.2025-02-05T17:43:17ZJaved AbsarSamarth NarangMuthu Baskaranhttp://arxiv.org/abs/2502.03439v1Linearized Optimal Transport pyLOT Library: A Toolkit for Machine Learning on Point Clouds2025-02-05T18:34:38ZThe pyLOT library offers a Python implementation of linearized optimal transport (LOT) techniques and methods to use in downstream tasks. The pipeline embeds probability distributions into a Hilbert space via the Optimal Transport maps from a fixed reference distribution, and this linearization allows downstream tasks to be completed using off the shelf (linear) machine learning algorithms. We provide a case study of performing ML on 3D scans of lemur teeth, where the original questions of classification, clustering, dimension reduction, and data generation reduce to simple linear operations performed on the LOT embedded representations.2025-02-05T18:34:38ZJun LinwuVarun KhuranaNicholas KarrisAlexander Cloningerhttp://arxiv.org/abs/2404.16730v2Finch: Sparse and Structured Tensor Programming with Control Flow2025-01-28T20:16:11ZFrom FORTRAN to NumPy, tensors have revolutionized how we express computation. However, tensors in these, and almost all prominent systems, can only handle dense rectilinear integer grids. Real world tensors often contain underlying structure, such as sparsity, runs of repeated values, or symmetry. Support for structured data is fragmented and incomplete. Existing frameworks limit the tensor structures and program control flow they support to better simplify the problem.
In this work, we propose a new programming language, Finch, which supports both flexible control flow and diverse data structures. Finch facilitates a programming model which resolves the challenges of computing over structured tensors by combining control flow and data structures into a common representation where they can be co-optimized. Finch automatically specializes control flow to data so that performance engineers can focus on experimenting with many algorithms. Finch supports a familiar programming language of loops, statements, ifs, breaks, etc., over a wide variety of tensor structures, such as sparsity, run-length-encoding, symmetry, triangles, padding, or blocks. Finch reliably utilizes the key properties of structure, such as structural zeros, repeated values, or clustered non-zeros. We show that this leads to dramatic speedups in operations such as SpMV and SpGEMM, image processing, and graph analytics.2024-04-25T16:41:12ZWillow AhrensTeodoro Fields CollinRadha PatelKyle DeedsChangwan HongSaman Amarasinghehttp://arxiv.org/abs/2501.15856v1rcpptimer: Rcpp Tic-Toc Timer with OpenMP Support2025-01-27T08:32:41ZEfficient code writing is both a critical and challenging task, especially with the growing demand for computationally intensive algorithms in statistical and machine-learning applications. Despite the availability of significant computational power today, the need for optimized algorithm implementations remains crucial. Many R users rely on Rcpp to write performant code in C++, but writing and benchmarking C++ code presents its own difficulties. While R's benchmarking tools are insufficient for measuring the execution times of C++ code segments, C++'s native profiling tools often come with a steep learning curve. The rcpptimer package bridges this gap by offering a simple and efficient solution for timing C++ code within the Rcpp ecosystem. This novel package introduces a user-friendly tic-toc class that supports overlapping and nested timers and OpenMP parallelism, providing nanosecond-level time resolution. Results, including summary statistics, are seamlessly passed back to R without requiring users to write any C++ code. This paper contextualizes the rcpptimer package within the broader ecosystem of R and C++ profiling tools, explains the motivation behind its development, and offers a comprehensive overview of its implementation. Supplementary to this paper, we provide multiple vignettes that thoroughly explain this package's usage.2025-01-27T08:32:41ZJonathan Berrischhttp://arxiv.org/abs/2406.09266v2SySTeC: A Symmetric Sparse Tensor Compiler2025-01-23T19:00:18ZSymmetric and sparse tensors arise naturally in many domains including linear algebra, statistics, physics, chemistry, and graph theory. Symmetric tensors are equal to their transposes, so in the $n$-dimensional case we can save up to a factor of $n!$ by avoiding redundant operations. Sparse tensors, on the other hand, are mostly zero, and we can save asymptotically by processing only nonzeros. Unfortunately, specializing for both symmetry and sparsity at the same time is uniquely challenging. Optimizing for symmetry requires consideration of $n!$ transpositions of a triangular kernel, which can be complex and error prone. Considering multiple transposed iteration orders and triangular loop bounds also complicates iteration through intricate sparse tensor formats. Additionally, since each combination of symmetry and sparse tensor formats requires a specialized implementation, this leads to a combinatorial number of cases. A compiler is needed, but existing compilers cannot take advantage of both symmetry and sparsity within the same kernel. In this paper, we describe the first compiler which can automatically generate symmetry-aware code for sparse or structured tensor kernels. We introduce a taxonomy for symmetry in tensor kernels, and show how to target each kind of symmetry. Our implementation demonstrates significant speedups ranging from 1.36x for SSYMV to 30.4x for a 5-dimensional MTTKRP over the non-symmetric state of the art.2024-06-13T16:06:29ZRadha PatelWillow AhrensSaman Amarasinghehttp://arxiv.org/abs/2406.14933v2ASTERIX: Module for modelling the water flow on vegetated hillslopes2025-01-23T10:04:29ZThe paper presents an open source software for numerical integration of an extended Saint-Venant model used as a mathematical tool to simulate the water flow from laboratory up to large-scale spatial domains applying physically-based principles of fluid mechanics. Many in-situ observations have shown that vegetation plays a key role in controlling the hydrological flux at catchment scale. In case of heavy rains, the infiltration and interception processes cease quickly, the remaining rainfall gives rise to the Hortonian overland flow and the flash flood is thus initiated. In this context, we also address the following problem: how do the gradient of soil surface and the vegetation influence the water dynamics in the Hortonian flow? The mathematical model and ASTERIX were kept as simple as possible in order to be accessible to a wide range of stakeholders interested in understanding the complex processes behind the water flow on hillslopes covered by plants.2024-06-21T07:41:46ZEnvironmental Modelling & Software 186 (2025)Stelian IonDorin MarinescuStefan-Gicu Cruceanu10.1016/j.envsoft.2025.106336http://arxiv.org/abs/2501.12349v1General Field Evaluation in High-Order Meshes on GPUs2025-01-21T18:27:19ZRobust and scalable function evaluation at any arbitrary point in the finite/spectral element mesh is required for querying the partial differential equation solution at points of interest, comparison of solution between different meshes, and Lagrangian particle tracking. This is a challenging problem, particularly for high-order unstructured meshes partitioned in parallel with MPI, as it requires identifying the element that overlaps a given point and computing the corresponding reference space coordinates. We present a robust and efficient technique for general field evaluation in large-scale high-order meshes with quadrilaterals and hexahedra. In the proposed method, a combination of globally partitioned and processor-local maps are used to first determine a list of candidate MPI ranks, and then locally candidate elements that could contain a given point. Next, element-wise bounding boxes further reduce the list of candidate elements. Finally, Newton's method with trust region is used to determine the overlapping element and corresponding reference space coordinates. Since GPU-based architectures have become popular for accelerating computational analyses using meshes with tensor-product elements, specialized kernels have been developed to utilize the proposed methodology on GPUs. The method is also extended to enable general field evaluation on surface meshes. The paper concludes by demonstrating the use of proposed method in various applications ranging from mesh-to-mesh transfer during r-adaptivity to Lagrangian particle tracking.2025-01-21T18:27:19Z52 pages, 17 figures, 1 tableKetan MittalAditya ParikSom DuttaPaul FischerTzanio KolevJames Lotteshttp://arxiv.org/abs/2401.01921v2The Cytnx Library for Tensor Networks2025-01-20T14:30:01ZWe introduce a tensor network library designed for classical and quantum physics simulations called Cytnx (pronounced as sci-tens). This library provides almost an identical interface and syntax for both C++ and Python, allowing users to effortlessly switch between two languages. Aiming at a quick learning process for new users of tensor network algorithms, the interfaces resemble the popular Python scientific libraries like NumPy, Scipy, and PyTorch. Not only multiple global Abelian symmetries can be easily defined and implemented, Cytnx also provides a new tool called Network that allows users to store large tensor networks and perform tensor network contractions in an optimal order automatically. With the integration of cuQuantum, tensor calculations can also be executed efficiently on GPUs. We present benchmark results for tensor operations on both devices, CPU and GPU. We also discuss features and higher-level interfaces to be added in the future.2024-01-03T14:59:50ZSciPost Phys. Codebases 53 (2025)Kai-Hsin WuChang-Teng LinKe HsuHao-Ti HungManuel SchneiderChia-Min ChungYing-Jer KaoPochung Chen10.21468/SciPostPhysCodeb.53