https://arxiv.org/api/3Cf61xgyu/vFoN67W8QFqQbk8SU 2026-06-21T20:05:19Z 2664 210 15 http://arxiv.org/abs/2511.01813v2 Disciplined Biconvex Programming 2025-11-10T13:20:19Z We introduce disciplined biconvex programming (DBCP), a modeling framework for specifying and solving biconvex optimization problems. Biconvex optimization problems arise in various applications, including machine learning, signal processing, computational science, and control. Solving a biconvex optimization problem in practice usually resolves to heuristic methods based on alternate convex search (ACS), which iteratively optimizes over one block of variables while keeping the other fixed, so that the resulting subproblems are convex and can be efficiently solved. However, designing and implementing an ACS solver for a specific biconvex optimization problem usually requires significant effort from the user, which can be tedious and error-prone. DBCP extends the principles of disciplined convex programming to biconvex problems, allowing users to specify biconvex optimization problems in a natural way based on a small number of syntax rules. The resulting problem can then be automatically split and transformed into convex subproblems, for which a customized ACS solver is then generated and applied. DBCP allows users to quickly experiment with different biconvex problem formulations, without expertise in convex optimization. We implement DBCP into the open source Python package dbcp, as an extension to the famous domain specific language CVXPY for convex optimization. 2025-11-03T18:20:03Z Hao Zhu Joschka Boedecker http://arxiv.org/abs/2511.04611v1 evomap: A Toolbox for Dynamic Mapping in Python 2025-11-06T18:02:58Z This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among objects as spatial representations, or maps. However, most existing statistical software supports only static mapping, which captures objects' relationships at a single point in time and lacks tools to analyze how these relationships evolve. evomap fills this gap by implementing the dynamic mapping framework EvoMap, originally proposed by Matthe, Ringel, and Skiera (2023), which adapts traditional static mapping methods for dynamic analyses. The package supports multiple mapping techniques, including variants of Multidimensional Scaling (MDS), Sammon Mapping, and t-distributed Stochastic Neighbor Embedding (t-SNE). It also includes utilities for data preprocessing, exploration, and result evaluation, offering a comprehensive toolkit for dynamic mapping applications. This paper outlines the foundations of static and dynamic mapping, describes the architecture and functionality of evomap, and illustrates its application through an extensive usage example. 2025-11-06T18:02:58Z Accepted for publication by the Journal of Statistical Software Maximilian Matthe http://arxiv.org/abs/2511.02655v1 Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks 2025-11-04T15:26:58Z Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework's capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance. 2025-11-04T15:26:58Z Johansell Villalobos Josef Ruzicka Silvio Rizzi 10.1109/CONCAPAN63470.2024.10933900 http://arxiv.org/abs/2511.20661v1 Evaluation of complex-valued error-like functions by the exponentially-convergent trapezoidal rule 2025-11-03T11:45:15Z The exponentially convergent trapezoidal rule is applied to a suitable integral representation of the Faddeeva function to derive a simple formula for its evaluation. I describe its properties, strategies for maximising its efficiency, and its coupling with other evaluation methods (asymptotic expansions and Maclaurin series). From knowledge of the values of the Faddeeva function, all other complex-valued error-like functions such as $\rm erf$ and $\rm erfc$ can be easily obtained. The resulting algorithm has been implemented in a publicly-available C/C++ library named $\texttt{erflike}$ in IEEE double precision arithmetic, and tested against more widespread valuation methods based on Taylor series and continued fractions, as provided by the widely used Faddeeva package. It is found that the algorithm presented here and its implementation achieve better accuracy and a more regular behaviour of the relative error over vast regions of the complex plane. In terms of speed of evaluation the $\texttt{erflike}$ library also outperforms the Faddeeva package for complex valued arguments, although not for real-valued ones. 2025-11-03T11:45:15Z Federico Maria Guercilena http://arxiv.org/abs/2504.07042v3 Towards a Higher Roofline for Matrix-Vector Multiplication in Matrix-Free HOSFEM 2025-11-03T03:12:05Z Modern GPGPUs provide massive arithmetic throughput, yet many scientific kernels remain limited by memory bandwidth. In particular, repeatedly loading precomputed auxiliary data wastes abundant compute resources while stressing the memory hierarchy. A promising strategy is to replace memory traffic with inexpensive recomputation, thereby alleviating bandwidth pressure and enabling applications to better exploit heterogeneous compute units. Guided by this strategy, we optimize the high-order/spectral finite element method (HOSFEM), a widely used approach for solving PDEs. Its performance is largely determined by AxLocal, a matrix-free kernel for element-local matrix-vector multiplications. In AxLocal, geometric factors dominate memory accesses while contributing minimally to computation, creating a bandwidth bottleneck that caps the performance roofline. To address this challenge, we propose the first practical, low-overhead on-the-fly recomputation of geometric factors for trilinear and parallelepiped elements. This reformulation reduces data movement and raises the achievable roofline, revealing untapped optimization potential for tensor contractions. With hardware-aware techniques including loop unrolling, Tensor Core acceleration, and constant memory utilization, the optimized kernels reach 85%-100% of the roofline efficiency. Compared with state-of-the-art implementations in the Nek series, they deliver speedups of 1.74x-4.10x on NVIDIA A100 and 1.99x-3.78x on Hygon K100, leading to a 1.12x-1.40x improvement in the full HOSFEM benchmark. These results demonstrate that combining algorithmic reformulation with hardware-specific tuning can remove long-standing bottlenecks and fully exploit the performance potential of large-scale high-order simulations. 2025-04-09T17:00:05Z 37 pages, 11 figures, 5 tables Zijian Cao Qiao Sun Tiangong Zhang Huiyuan Li http://arxiv.org/abs/2510.15649v2 A simple LAD-LASSO coordinate descent algorithm for interactive browser-based GPU applications 2025-10-27T11:31:39Z Simultaneous variable selection and robust data fitting are important aspects of many mathematical modelling projects and a wide array of optimisation tools and techniques exist to support them. When the intention is to embed this capability in run-time interactive decision support tools running hundreds of such modelling tasks simultaneously on a GPU, the choices of implementation approach are more limited. Recently, simple and fast Coordinate Descent algorithms have been proposed which can implement the LASSO approach to variable selection in conjunction with ordinary least squares (OLS) data fitting. However extending this to use the more robust Least Absolute Deviation (LAD) data fitting has been hampered by the multiple axis wise local minima that occur in the objective function for this LAD-LASSO approach. This paper suggests that these multiple axis wise local minima form a locus which is monotonic in all the axes and that this locus has a convex objective function. Hence allowing the locus to be searched using a ternary chop algorithm that uses Coordinate Descent to identify multiple local minima (points on this locus) as required to find the global minimum. The resulting algorithm is very simple making it practical to implement it as a single thread on a GPU. This opens up the possibility of running many hundreds of such threads in parallel using coarse parallelisation [2]. These are early results in a wider project to explore the use of combinatorial sub sets of data in interactive mathematical modelling support frameworks. 2025-10-17T13:39:57Z For a browser based interactive demonstration or to view the source code See https://steve--w.github.io/XIDEPages/LAD_LASSODemo.html . This will open a simple IDE in design mode. Press "Run Mode" to see the demonstration or navigate to the "Code" tab to see the Python source code Stephen Michael Wright http://arxiv.org/abs/2504.21780v2 MAGNET: an open-source library for mesh agglomeration by Graph Neural Networks 2025-10-24T14:18:52Z We introduce MAGNET, an open-source Python library designed for mesh agglomeration in both two- and three-dimensions, based on employing Graph Neural Networks (GNN). MAGNET serves as a comprehensive solution for training a variety of GNN models, integrating deep learning and other advanced algorithms such as METIS and k-means to facilitate mesh agglomeration and quality metric computation. The library's introduction is outlined through its code structure and primary features. The GNN framework adopts a graph bisection methodology that capitalizes on connectivity and geometric mesh information via SAGE convolutional layers, in line with the methodology proposed by Antonietti et al. (2024). Additionally, the proposed MAGNET library incorporates reinforcement learning to enhance the accuracy and robustness of the model for predicting coarse partitions within a multilevel framework. A detailed tutorial is provided to guide the user through the process of mesh agglomeration and the training of a GNN bisection model. We present several examples of mesh agglomeration conducted by MAGNET, demonstrating the library's applicability across various scenarios. Furthermore, the performance of the newly introduced models is contrasted with that of METIS and k-means, illustrating that the proposed GNN models are competitive regarding partition quality and computational efficiency. Finally, we exhibit the versatility of MAGNET's interface through its integration with Lymph, an open-source library implementing discontinuous Galerkin methods on polytopal grids for the numerical discretization of multiphysics differential problems. 2025-04-30T16:33:22Z Paola F. Antonietti Matteo Caldana Ilario Mazzieri Andrea Re Fraschini http://arxiv.org/abs/2310.19214v2 Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank Matrices 2025-10-23T21:18:06Z We consider multilevel low rank (MLR) matrices, defined as a row and column permutation of a sum of matrices, each one a block diagonal refinement of the previous one, with all blocks low rank given in factored form. MLR matrices extend low rank matrices but share many of their properties, such as the total storage required and complexity of matrix-vector multiplication. We address three problems that arise in fitting a given matrix by an MLR matrix in the Frobenius norm. The first problem is factor fitting, where we adjust the factors of the MLR matrix. The second is rank allocation, where we choose the ranks of the blocks in each level, subject to the total rank having a given value, which preserves the total storage needed for the MLR matrix. The final problem is to choose the hierarchical partition of rows and columns, along with the ranks and factors. This paper is accompanied by an open source package that implements the proposed methods. 2023-10-30T00:52:17Z Tetiana Parshakova Trevor Hastie Eric Darve Stephen Boyd http://arxiv.org/abs/2510.20929v1 Pty-Chi: A PyTorch-based modern ptychographic data analysis package 2025-10-23T18:40:20Z Ptychography has become an indispensable tool for high-resolution, non-destructive imaging using coherent light sources. The processing of ptychographic data critically depends on robust, efficient, and flexible computational reconstruction software. We introduce Pty-Chi, an open-source ptychographic reconstruction package built on PyTorch that unifies state-of-the-art analytical algorithms with automatic differentiation methods. Pty-Chi provides a comprehensive suite of reconstruction algorithms while supporting advanced experimental parameter corrections such as orthogonal probe relaxation and multislice modeling. Leveraging PyTorch as the computational backend ensures vendor-agnostic GPU acceleration, multi-device parallelization, and seamless access to modern optimizers. An object-oriented, modular design makes Pty-Chi highly extendable, enabling researchers to prototype new imaging models, integrate machine learning approaches, or build entirely new workflows on top of its core components. We demonstrate Pty-Chi's capabilities through challenging case studies that involve limited coherence, low overlap, and unstable illumination during scanning, which highlight its accuracy, versatility, and extensibility. With community-driven development and open contribution, Pty-Chi offers a modern, maintainable platform for advancing computational ptychography and for enabling innovative imaging algorithms at synchrotron facilities and beyond. 2025-10-23T18:40:20Z Ming Du Hanna Ruth Steven Henke Yi Jiang Viktor Nikitin Ashish Tripathi Junjing Deng Jeffrey Klug Peco Myint Tao Zhou Nicholas Schwarz Mathew Cherukara Alec Sandy Stefan Vogt http://arxiv.org/abs/2510.20184v1 A Unified and Scalable Method for Optimization over Graphs of Convex Sets 2025-10-23T04:08:39Z A Graph of Convex Sets (GCS) is a graph in which vertices are associated with convex programs and edges couple pairs of programs through additional convex costs and constraints. Any optimization problem over an ordinary weighted graph (e.g., the shortest-path, the traveling-salesman, and the minimum-spanning-tree problems) can be naturally generalized to a GCS, yielding a new class of problems at the interface of combinatorial and convex optimization with numerous applications. In this paper, we introduce a unified method for solving any such problem. Starting from an integer linear program that models an optimization problem over a weighted graph, our method automatically produces an efficient mixed-integer convex formulation of the corresponding GCS problem. This formulation is based on homogenization (perspective) transformations, and the resulting program is solved to global optimality using off-the-shelf branch-and-bound solvers. We implement this framework in GCSOPT, an open-source and easy-to-use Python library designed for fast prototyping. We illustrate the versatility and scalability of our approach through multiple numerical examples and comparisons. 2025-10-23T04:08:39Z Tobia Marcucci http://arxiv.org/abs/2510.19999v1 Enhanced Cyclic Coordinate Descent Methods for Elastic Net Penalized Linear Models 2025-10-22T20:01:25Z We present a novel enhanced cyclic coordinate descent (ECCD) framework for solving generalized linear models with elastic net constraints that reduces training time in comparison to existing state-of-the-art methods. We redesign the CD method by performing a Taylor expansion around the current iterate to avoid nonlinear operations arising in the gradient computation. By introducing this approximation, we are able to unroll the vector recurrences occurring in the CD method and reformulate the resulting computations into more efficient batched computations. We show empirically that the recurrence can be unrolled by a tunable integer parameter, $s$, such that $s > 1$ yields performance improvements without affecting convergence, whereas $s = 1$ yields the original CD method. A key advantage of ECCD is that it avoids the convergence delay and numerical instability exhibited by block coordinate descent. Finally, we implement our proposed method in C++ using Eigen to accelerate linear algebra computations. Comparison of our method against existing state-of-the-art solvers shows consistent performance improvements of $3\times$ in average for regularization path variant on diverse benchmark datasets. Our implementation is available at https://github.com/Yixiao-Wang-Stats/ECCD. 2025-10-22T20:01:25Z Equal contribution: Yixiao Wang and Zishan Shao. Correspondence: yw676@duke.edu Yixiao Wang Zishan Shao Ting Jiang Aditya Devarakonda http://arxiv.org/abs/2509.20020v3 The Syntax and Semantics of einsum 2025-10-20T13:29:52Z In 2011, einsum was introduced to NumPy as a practical and convenient notation for tensor expressions in machine learning, quantum circuit simulation, and other fields. It has since been implemented in additional Python frameworks such as PyTorch and TensorFlow, as well as in other programming languages such as Julia. Despite its practical success, the einsum notation still lacks a solid theoretical basis, and is not unified across the different frameworks, limiting opportunities for formal reasoning and systematic optimization. In this work, we discuss the terminology of tensor expressions and provide a formal definition of the einsum language. Based on this definition, we formalize and prove important equivalence rules for tensor expressions and highlight their relevance in practical applications. 2025-09-24T11:36:02Z 21 pages, 1 figure. Includes formal definitions, proofs of algebraic properties, and nesting/denesting rules for the einsum notation Maurice Wenig Paul G. Rump Mark Blacher Joachim Giesen http://arxiv.org/abs/2510.16625v1 QRTlib: A Library for Fast Quantum Real Transforms 2025-10-18T19:45:31Z Real-valued transforms such as the discrete cosine, sine, and Hartley transforms play a central role in classical computing, complementing the Fourier transform in applications from signal and image processing to data compression. However, their quantum counterparts have not evolved in parallel, and no unified framework exists for implementing them efficiently on quantum hardware. This article addresses this gap by introducing QRTlib, a library for fast and practical implementations of quantum real transforms, including the quantum Hartley, cosine, and sine transforms of various types. We develop new algorithms and circuit optimizations that make these transforms efficient and suitable for near-term devices. In particular, we present a quantum Hartley transform based on the linear combination of unitaries (LCU) technique, achieving a fourfold reduction in circuit size compared to prior methods, and an improved quantum sine transform of Type I that removes large multi-controlled operations. We also introduce circuit-level optimizations, including two's-complement and or-tree constructions. QRTlib provides the first complete implementations of these quantum real transforms in Qiskit. 2025-10-18T19:45:31Z Armin Ahmadkhaniha Lu Chen Jake Doliskani Zhifu Sun http://arxiv.org/abs/2510.16284v1 Communication-Efficient and Memory-Aware Parallel Bootstrapping using MPI 2025-10-18T00:47:47Z Bootstrapping is a powerful statistical resampling technique for estimating the sampling distribution of an estimator. However, its computational cost becomes prohibitive for large datasets or a high number of resamples. This paper presents a theoretical analysis and design of parallel bootstrapping algorithms using the Message Passing Interface (MPI). We address two key challenges: high communication overhead and memory constraints in distributed environments. We propose two novel strategies: 1) Local Statistic Aggregation, which drastically reduces communication by transmitting sufficient statistics instead of full resampled datasets, and 2) Synchronized Pseudo-Random Number Generation, which enables distributed resampling when the entire dataset cannot be stored on a single process. We develop analytical models for communication and computation complexity, comparing our methods against naive baseline approaches. Our analysis demonstrates that the proposed methods offer significant reductions in communication volume and memory usage, facilitating scalable parallel bootstrapping on large-scale systems. 2025-10-18T00:47:47Z 6 pages Di Zhang http://arxiv.org/abs/2510.14049v2 CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations 2025-10-17T15:05:53Z Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse. 2025-10-15T19:39:22Z Guangyi Chen Yunlong Deng Peiyuan Zhu Yan Li Yifan Shen Zijian Li Kun Zhang