https://arxiv.org/api/r66sKMI/jvRXgO7dkFmtpQP9d902026-06-22T20:29:02Z266454015http://arxiv.org/abs/2304.06145v2R-Shiny Applications for Local Clustering to be Included in the growclusters for R Package2024-04-29T18:22:17Zgrowclusters for R is a package that estimates a partition structure for multivariate data. It does this by implementing a hierarchical version of k-means clustering that accounts for possible known dependencies in a collection of datasets, where each set draws its cluster means from a single, global partition. Each component data set in the collection corresponds to a known group in the data. This paper focuses on R Shiny applications that implement the clustering methodology and simulate data sets with known group structures. These Shiny applications implement novel ways of visualizing the results of the clustering. These visualizations include scatterplots of individual data sets in the context of the entire collection and cluster distributions versus component (or sub-domain) datasets. Data obtained from a collection of 2000-2013 articles from the Bureau of Labor Statistics (BLS) Monthly Labor Review (MLR) will be used to illustrate the R-Shiny applications. Here, the known grouping in the collection is the year of publication.2023-04-12T20:03:44Z17 pages, 10 figures, paper presented at 2023 Joint Statistical MeetingsRandall PowersWendy MartinezTerrance Savitskyhttp://arxiv.org/abs/2402.02301v2MATLAB Simulator of Level-Index Arithmetic2024-04-26T16:09:10ZLevel-index arithmetic appeared in the 1980s. One of its principal purposes is to abolish the issues caused by underflows and overflows in floating point. However, level-index arithmetic does not expand the set of numbers but spaces out the numbers of large magnitude even more than floating-point representations to move the infinities further away from zero: gaps between numbers on both ends of the range become very large. We revisit level index by presenting a custom precision simulator in MATLAB. This toolbox is useful for exploring performance of level-index arithmetic in research projects, such as using 8-bit and 16-bit representations in machine learning algorithms where narrow bit-width is desired but overflow/underflow of floating-point representations causes difficulties.2024-02-03T23:49:08ZMantas Mikaitishttp://arxiv.org/abs/2404.14973v1Symbolic Integration Algorithm Selection with Machine Learning: LSTMs vs Tree LSTMs2024-04-23T12:27:20ZComputer Algebra Systems (e.g. Maple) are used in research, education, and industrial settings. One of their key functionalities is symbolic integration, where there are many sub-algorithms to choose from that can affect the form of the output integral, and the runtime. Choosing the right sub-algorithm for a given problem is challenging: we hypothesise that Machine Learning can guide this sub-algorithm choice. A key consideration of our methodology is how to represent the mathematics to the ML model: we hypothesise that a representation which encodes the tree structure of mathematical expressions would be well suited. We trained both an LSTM and a TreeLSTM model for sub-algorithm prediction and compared them to Maple's existing approach. Our TreeLSTM performs much better than the LSTM, highlighting the benefit of using an informed representation of mathematical expressions. It is able to produce better outputs than Maple's current state-of-the-art meta-algorithm, giving a strong basis for further research.2024-04-23T12:27:20ZRashid BarketMatthew EnglandJürgen Gerhardhttp://arxiv.org/abs/2404.13216v1Robustness and Accuracy in Pipelined Bi-Conjugate Gradient Stabilized Method: A Comparative Study2024-04-19T23:49:04ZIn this article, we propose an accuracy-assuring technique for finding a solution for unsymmetric linear systems. Such problems are related to different areas such as image processing, computer vision, and computational fluid dynamics. Parallel implementation of Krylov subspace methods speeds up finding approximate solutions for linear systems. In this context, the refined approach in pipelined BiCGStab enhances scalability on distributed memory machines, yielding to substantial speed improvements compared to the standard BiCGStab method. However, it's worth noting that the pipelined BiCGStab algorithm sacrifices some accuracy, which is stabilized with the residual replacement technique. This paper aims to address this issue by employing the ExBLAS-based reproducible approach. We validate the idea on a set of matrices from the SuiteSparse Matrix Collection.2024-04-19T23:49:04ZMykhailo HavdiakJose I. AliagaRoman Iakymchukhttp://arxiv.org/abs/2404.12797v1Conversion of Boolean and Integer FlatZinc Builtins to Quadratic or Linear Integer Problems2024-04-19T11:24:03ZConstraint satisfaction or optimisation models -- even if they are formulated in high-level modelling languages -- need to be reduced into an equivalent format before they can be solved by the use of Quantum Computing. In this paper we show how Boolean and integer FlatZinc builtins over finite-domain integer variables can be equivalently reformulated as linear equations, linear inequalities or binary products of those variables, i.e. as finite-domain quadratic integer programs. Those quadratic integer programs can be further transformed into equivalent Quadratic Unconstrained Binary Optimisation problem models, i.e. a general format for optimisation problems to be solved on Quantum Computers especially on Quantum Annealers.2024-04-19T11:24:03ZArmin Wolfhttp://arxiv.org/abs/2310.08339v2TTK is Getting MPI-Ready2024-04-15T09:51:15ZThis system paper documents the technical foundations for the extension of the Topology ToolKit (TTK) to distributed-memory parallelism with the Message Passing Interface (MPI). While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in this paper a versatile approach (supporting both triangulated domains and regular grids) for the support of topological analysis pipelines, i.e. a sequence of topological algorithms interacting together. While developing this extension, we faced several algorithmic and software engineering challenges, which we document in this paper. We describe an MPI extension of TTK's data structure for triangulation representation and traversal, a central component to the global performance and generality of TTK's topological implementations. We also introduce an intermediate interface between TTK and MPI, both at the global pipeline level, and at the fine-grain algorithmic level. We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs and provide examples of hybrid MPI+thread parallelizations. Performance analyses show that parallel efficiencies range from 20% to 80% (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework induces a negligible computation time overhead. We illustrate the new distributed-memory capabilities of TTK with an example of advanced analysis pipeline, combining multiple algorithms, run on the largest publicly available dataset we have found (120 billion vertices) on a cluster with 64 nodes (for a total of 1536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.2023-10-12T13:57:32Z18 pages, 13 figuresEve Le GuillouMichael WillPierre GuillouJonas LukasczykPierre FortinChristoph GarthJulien Tiernyhttp://arxiv.org/abs/2404.09276v1Algorithm xxx: Faster Randomized SVD with Dynamic Shifts2024-04-14T14:58:02ZAiming to provide a faster and convenient truncated SVD algorithm for large sparse matrices from real applications (i.e. for computing a few of largest singular values and the corresponding singular vectors), a dynamically shifted power iteration technique is applied to improve the accuracy of the randomized SVD method. This results in a dynamic shifts based randomized SVD (dashSVD) algorithm, which also collaborates with the skills for handling sparse matrices. An accuracy-control mechanism is included in the dashSVD algorithm to approximately monitor the per vector error bound of computed singular vectors with negligible overhead. Experiments on real-world data validate that the dashSVD algorithm largely improves the accuracy of randomized SVD algorithm or attains same accuracy with fewer passes over the matrix, and provides an efficient accuracy-control mechanism to the randomized SVD computation, while demonstrating the advantages on runtime and parallel efficiency. A bound of the approximation error of the randomized SVD with the shifted power iteration is also proved.2024-04-14T14:58:02Z26 pages, accepted by ACM Transactions on Mathematical SoftwareXu FengWenjian YuYuyang XieJie Tanghttp://arxiv.org/abs/2404.07293v1sCWatter: Open source coupled wave scattering simulation for spectroscopy and microscopy2024-04-10T18:40:58ZSeveral emerging microscopy imaging methods rely on complex interactions between the incident light and the sample. These include interferometry, spectroscopy, and nonlinear optics. Reconstructing a sample from the measured scattered field relies on fast and accurate optical models. Fast approaches like ray tracing and the Born approximation have limitations that are limited when working with high numerical apertures. This paper presents sCWatter, an open-source tool that utilizes coupled wave theory (CWT) to simulate and visualize the 3D electric field scattered by complex samples. The sample refractive index is specified on a volumetric grid, while the incident field is provided as a 2D image orthogonal to the optical path. We introduce connection equations between layers that significantly reduce the dimensionality of the CW linear system, enabling efficient parallel processing on consumer hardware. Further optimizations using Intel MKL and CUDA significantly accelerate both field simulation and visualization.2024-04-10T18:40:58ZRuijiao SunRohith ReddyDavid Mayerichhttp://arxiv.org/abs/2404.07183v1Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants2024-04-10T17:35:36ZWe present a computational framework for piecewise constant functions (PCFs) and use this for several types of computations that are useful in statistics, e.g., averages, similarity matrices, and so on. We give a linear-time, allocation-free algorithm for working with pairs of PCFs at machine precision. From this, we derive algorithms for computing reductions of several PCFs. The algorithms have been implemented in a highly scalable fashion for parallel execution on CPU and, in some cases, (multi-)GPU, and are provided in a \proglang{Python} package. In addition, we provide support for multidimensional arrays of PCFs and vectorized operations on these. As a stress test, we have computed a distance matrix from 500,000 PCFs using 8 GPUs.2024-04-10T17:35:36Z23 pagesBjörn H. Wehlinhttp://arxiv.org/abs/2404.06241v1Confirmable Workflows in OSCAR2024-04-09T12:08:24ZWe discuss what is special about the reproducibility of workflows in computer algebra. It is emphasized how the programming language Julia and the new computer algebra system OSCAR support such a reproducibility, and how users can benefit for their own work.2024-04-09T12:08:24Z15 pagesMichael JoswigLars KastnerBenjamin Lorenzhttp://arxiv.org/abs/2404.05303v1SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers2024-04-08T08:46:40ZStencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator.2024-04-08T08:46:40Z6 pages, 5 figures, 2 tables. Accepted at DAC 2024Paul SchefflerLuca ColagrandeLuca Beninihttp://arxiv.org/abs/2405.01562v1Discrete Event Simulation: It's Easy with SimPy!2024-04-03T06:03:09ZThis paper introduces the practicalities and benefits of using SimPy, a discrete event simulation (DES) module written in Python, for modeling and simulating complex systems. Through a step-by-step exploration of the classical Dining Philosophers Problem, we demonstrate how SimPy enables the efficient construction of discrete event models, emphasizing system states, transitions, and event handling. We extend the scenario to introduce resources, such as chopsticks, to model contention and deadlock conditions, and showcase SimPy's capabilities in managing these scenarios. Furthermore, we explore the integration of SimPy with other Python libraries for statistical analysis, showcasing how simulation results inform system design and optimization. The versatility of SimPy is further highlighted through additional modeling scenarios, including resource constraints and customer service interactions, providing insights into the process of building, debugging, simulating, and optimizing models for a wide range of applications. This paper aims to make DES accessible to practitioners and researchers alike, emphasizing the ease with which complex simulations can be constructed, analyzed, and visualized using SimPy and the broader Python ecosystem.2024-04-03T06:03:09Z19 pages; 5 figures; first published in PragPub in 2018Dmitry Zinovievhttp://arxiv.org/abs/2404.00387v1Inexactness and Correction of Floating-Point Reciprocal, Division and Square Root2024-03-30T15:02:03ZFloating-point arithmetic performance determines the overall performance of important applications, from graphics to AI. Meeting the IEEE-754 specification for floating-point requires that final results of addition, subtraction, multiplication, division, and square root are correctly rounded based on the user-selected rounding mode. A frustrating fact for implementers is that naive rounding methods will not produce correctly rounded results even when intermediate results with greater accuracy and precision are available. In contrast, our novel algorithm can correct approximations of reciprocal, division and square root, even ones with slightly lower than target precision. In this paper, we present a family of algorithms that can both increase the accuracy (and potentially the precision) of an estimate and correctly round it according to all binary IEEE-754 rounding modes. We explain how it may be efficiently implemented in hardware, and for completeness, we present proofs that it is not necessary to include equality tests associated with round-to-nearest-even mode for reciprocal, division and square root functions, because it is impossible for input(s) in a given precision to have exact answers exactly midway between representable floating-point numbers in that precision. In fact, our simpler proofs are sometimes stronger.2024-03-30T15:02:03ZLucas M. DuttonChristopher Kumar AnandRobert EnenkelSilvia Melitta Müllerhttp://arxiv.org/abs/2403.18030v1EinExprs: Contraction Paths of Tensor Networks as Symbolic Expressions2024-03-26T18:38:00ZTensor Networks are graph representations of summation expressions in which vertices represent tensors and edges represent tensor indices or vector spaces. In this work, we present EinExprs.jl, a Julia package for contraction path optimization that offers state-of-art optimizers. We propose a representation of the contraction path of a Tensor Network based on symbolic expressions. Using this package the user may choose among a collection of different methods such as Greedy algorithms, or an approach based on the hypergraph partitioning problem. We benchmark this library with examples obtained from the simulation of Random Quantum Circuits (RQC), a well known example where Tensor Networks provide state-of-the-art methods.2024-03-26T18:38:00Z4 pages, 5 figures, submitted to JuliaCon Proceedings 2023Sergio Sanchez-RamirezJofre Vallès-MunsArtur Garcia-Saezhttp://arxiv.org/abs/2403.15632v1FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions2024-03-22T22:02:36ZReliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase.2024-03-22T22:02:36ZPresented at JuliaCon 2023; to appear in JuliaCon proceedingsTaylor AllredXinyi LiAshton WiersdorfBen GreenmanGanesh Gopalakrishnan