https://arxiv.org/api/3vSBaS4oWa+r+VdmPkbTNRzjnPg 2026-06-21T18:51:37Z 2664 195 15 http://arxiv.org/abs/2311.09922v4 Fast multiplication by two's complement addition of numbers represented as a set of polynomial radix 2 indexes, stored as an integer list for massively parallel computation 2025-11-27T22:43:30Z

We demonstrate a multiplication method based on numbers represented as set of polynomial radix 2 indices stored as an integer list. The 'polynomial integer index multiplication' method is a set of algorithms implemented in python code. We demonstrate the method to be faster than both the Number Theoretic Transform (NTT) and Karatsuba for multiplication within a certain bit range. Also implemented in python code for comparison purposes with the polynomial radix 2 integer method. We demonstrate that it is possible to express any integer or real number as a list of integer indices, representing a finite series in base two. The finite series of integer index representation of a number can then be stored and distributed across multiple CPUs / GPUs. We show that operations of addition and multiplication can be applied as two's complement additions operating on the index integer representations and can be fully distributed across a given CPU / GPU architecture. We demonstrate fully distributed arithmetic operations such that the 'polynomial integer index multiplication' method overcomes the current limitation of parallel multiplication methods. Ie, the need to share common core memory and common disk for the calculation of results and intermediate results.

2023-11-16T14:21:13Z This paper has been withdrawn after an error was identified in a key proof. The revision requires substantial re-derivation and may replace the main theorem. A corrected version may be posted once the results are verified. therefore we require additional time to rework the argument Mark Stocks http://arxiv.org/abs/2512.00093v1 Making the RANMAR pseudorandom number generator in LAMMPS up to four times faster, with an implementation of jump-ahead 2025-11-26T15:47:58Z

Massively parallel molecular simulations require pseudorandom number streams that are provably non-overlapping and reproducible across thousands of compute units in parallel computing environments. In the widely used LAMMPS package, the standard RANMAR generator lacks a mathematically exact mechanism to jump ahead; distinct seeds are typically assigned instead, which does not ensure disjoint streams. We introduce a mathematically exact jump-ahead extension for RANMAR in LAMMPS. In practice, a single random sequence can be partitioned into consecutive, non-overlapping blocks of length $J$, with one block assigned to each compute unit under formal non-overlap guarantees. In our approach, we develop an algebraic reformulation that enables efficient jump-ahead even for very large $J$ by casting state advancement into polynomial computations over finite residue rings while keeping memory small. We implement the extension in C++ using Number Theory Library (NTL) and integrate it into LAMMPS without altering user workflows. Beyond enabling exact partitioning, converting the 24-bit floating-point recurrence to an equivalent 24-bit integer recurrence accelerates generation itself: across diverse CPUs, generation is approximately two to four times faster than the floating-point baseline. Computing very large jumps (e.g., $J \approx 2^{120}$) remains practical.

2025-11-26T15:47:58Z Hiroshi Haramoto Kosuke Suzuki http://arxiv.org/abs/2511.20783v1 A derivative-free trust-region approach for Low Order-Value Optimization problems 2025-11-25T19:19:34Z

The Low Order-Value Optimization (LOVO) problem involves minimizing the minimum among a finite number of function values within a feasible set. LOVO has several practical applications such as robust parameter estimation, protein alignment, portfolio optimization, among others. In this work, we are interested in the constrained nonlinear optimization LOVO problem of minimizing the minimum between a finite number of function values subject to a nonempty closed convex set where each function is a black-box and continuously differentiable, but the derivatives are not available. We develop the first derivative-free trust-region algorithm for constrained LOVO problems with convergence to weakly critical points. Under suitable conditions, we establish the global convergence of the algorithm and also its worst-case iteration complexity analysis. An initial open-source implementation using only linear interpolation models is developed. Extensive numerical experiments and comparison with existing alternatives show the properties and the efficiency of the proposed approach when solving LOVO problems.

2025-11-25T19:19:34Z Anderson E. Schwertner Francisco N. C. Sobral http://arxiv.org/abs/2511.20198v1 Compilation of Generalized Matrix Chains with Symbolic Sizes 2025-11-25T11:23:49Z

Generalized Matrix Chains (GMCs) are products of matrices where each matrix carries features (e.g., general, symmetric, triangular, positive-definite) and is optionally transposed and/or inverted. GMCs are commonly evaluated via sequences of calls to BLAS and LAPACK kernels. When matrix sizes are known, one can craft a sequence of kernel calls to evaluate a GMC that minimizes some cost, e.g., the number of floating-point operations (FLOPs). Even in these circumstances, high-level languages and libraries, upon which users usually rely, typically perform a suboptimal mapping of the input GMC onto a sequence of kernels. In this work, we go one step beyond and consider matrix sizes to be symbolic (unknown); this changes the nature of the problem since no single sequence of kernel calls is optimal for all possible combinations of matrix sizes. We design and evaluate a code generator for GMCs with symbolic sizes that relies on multi-versioning. At compile-time, when the GMC is known but the sizes are not, code is generated for a few carefully selected sequences of kernel calls. At run-time, when sizes become known, the best generated variant for the matrix sizes at hand is selected and executed. The code generator uses new theoretical results that guarantee that the cost is within a constant factor from optimal for all matrix sizes and an empirical tuning component that further tightens the gap to optimality in practice. In experiments, we found that the increase above optimal in both FLOPs and execution time of the generated code was less than 15\% for 95\% of the tested chains.

2025-11-25T11:23:49Z 15 pages, 6 figures Proceedings of 2026 IEEE/ACM International Symposium on Code Generation and Optimization, Sydney, Australia, 31st January-4th February, 2026 Francisco López Lars Karlsson Paolo Bientinesi http://arxiv.org/abs/2511.16174v1 Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures 2025-11-20T09:26:34Z

Large symmetric eigenvalue problems are commonly observed in many disciplines such as Chemistry and Physics, and several libraries including cuSOLVERMp, MAGMA and ELPA support computing large eigenvalue decomposition on multi-GPU or multi-CPU-GPU hybrid architectures. However, these libraries do not provide satisfied performance that all of the libraries only utilize around 1.5\% of the peak multi-GPU performance. In this paper, we propose a pipelined two-stage eigenvalue decomposition algorithm instead of conventional subsequent algorithm with substantial optimizations. On an 8$\times$A100 platform, our implementation surpasses state-of-the-art cuSOLVERMp and MAGMA baselines, delivering mean speedups of 5.74$\times$ and 6.59$\times$, with better strong and weak scalability.

2025-11-20T09:26:34Z 11 pages,16 figures. Our manuscript was submitted to the PPoPP'26 conference but was not accepted. The reviewers acknowledged it as a complete and solid piece of work; however, they noted that it lacks sufficient ablation studies Hansheng Wang Ruiyi Zhan Dajun Huang Xingchen Liu Qiao Li Hancong Duan Dingwen Tao Guangming Tan Shaoshuai Zhang http://arxiv.org/abs/2507.02164v2 Hardware-Accelerated Algorithm for Complex Function Roots Density Graph Plotting 2025-11-19T20:23:55Z

Solving and visualizing the potential roots of complex functions is essential in both theoretical and applied domains, yet often computationally intensive. We present a hardware-accelerated algorithm for complex function roots density graph plotting by approximating functions with polynomials and solving their roots using single-shift QR iteration. By leveraging the Hessenberg structure of companion matrices and optimizing QR decomposition with Givens rotations, we design a pipelined FPGA architecture capable of processing a large amount of polynomials with high throughput. Our implementation achieves up to 65x higher energy efficiency than CPU-based approaches, and while it trails modern GPUs in performance. Compared with state-of-the-art QR decomposition solutions, our design specificly optimize QR decomposition for complex-valued Hessenberg matrices up to size 6x6, exhibiting a moderate throughput of 16.5M QR decompositions per second, while prior works have predominantly focused on 4x4 general matrices.

2025-07-02T21:42:39Z Ruibai Tang Chengbin Quan 10.1109/TCAD.2025.3639508 http://arxiv.org/abs/2509.02202v2 DEViaN-LM: An R Package for Detecting Abnormal Values in the Gaussian Linear Model 2025-11-19T09:48:36Z

The DEViaN-LM is a R package that allows to detect the values poorly explained by a Gaussian linear model. The procedure is based on the maximum of the absolute value of the studentized residuals, which is a free statistic of the parameters of the model. This approach makes it possible to generalize several procedures used to detect abnormal values during longitudinal monitoring of certain biological markers. In this article, we describe the method used, and we show how to implement it on different real datasets.

2025-09-02T11:15:44Z Geoffroy Berthelot IRMES - URP\_7329, RELAIS Guillaume Saulière IRMES - URP\_7329 Jérôme Dedecker MAP5 - UMR 8145 http://arxiv.org/abs/2511.14966v1 A Graph-Based, Distributed Memory, Modeling Abstraction for Optimization 2025-11-18T23:11:10Z

We present a general, flexible modeling abstraction for building and working with distributed optimization problems called a RemoteOptiGraph. This abstraction extends the OptiGraph model in Plasmo$.$jl, where optimization problems are represented as hypergraphs with nodes that define modular subproblems (variables, constraints, and objectives) and edges that encode algebraic linking constraints between nodes. The RemoteOptiGraph allows OptiGraphs to be utilized in distributed memory environments through InterWorkerEdges, which manage linking constraints that span workers. This abstraction offers a unified approach for modeling optimization problems on distributed memory systems (avoiding bespoke modeling approaches), and provides a basis for developing general-purpose meta-algorithms that can exploit distributed memory structure such as Benders or Lagrangian decompositions. We implement this abstraction in the open-source package, Plasmo$.$jl and we illustrate how it can be used by solving a mixed integer capacity expansion model for the western United States containing over 12 million variables and constraints. The RemoteOptiGraph abstraction together with Benders decomposition performs 7.5 times faster than solving the same problem without decomposition.

2025-11-18T23:11:10Z 32 pages, 7 Figures David L. Cole Jordan Jalving Jonah Langlieb Jesse D. Jenkins http://arxiv.org/abs/2511.13963v1 Hessians in Birkhoff-Theoretic Trajectory Optimization 2025-11-17T22:40:51Z

This paper derives various Hessians associated with Birkhoff-theoretic methods for trajectory optimization. According to a theorem proved in this paper, approximately 80% of the eigenvalues are contained in the narrow interval [-2, 4] for all Birkhoff-discretized optimal control problems. A preliminary analysis of computational complexity is also presented with further discussions on the grand challenge of solving a million point trajectory optimization problem.

2025-11-17T22:40:51Z This paper appeared as an Engineering Note in the J. Guid. Control & Dynamics Journal of Guidance Control and Dynamics, Vol. 48, No. 9, September 2025, 2105--2112 I. M. Ross 10.2514/1.G008778 http://arxiv.org/abs/2511.13808v1 Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora 2025-11-17T17:46:23Z

A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.

2025-11-17T17:46:23Z Published in CIKM 2025 In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (2025) Edward Raff Ryan R. Curtin Derek Everett Robert J. Joyce James Holt 10.1145/3746252.3761551 http://arxiv.org/abs/2511.13262v1 Case study of a differentiable heterogeneous multiphysics solver for a nuclear fusion application 2025-11-17T11:23:16Z

This work presents a case study of a heterogeneous multiphysics solver from the nuclear fusion domain. At the macroscopic scale, an auto-differentiable ODE solver in JAX computes the evolution of the pulsed power circuit and bulk plasma parameters for a compressing Z Pinch. The ODE solver requires a closure for the impedance of the plasma load obtained via root-finding at every timestep, which we solve efficiently using gradient-based Newton iteration. However, incorporating non-differentiable production-grade plasma solvers like Gkeyll (a C/CUDA plasma simulation suite) into a gradient-based workflow is non-trivial. The ''Tesseract'' software addresses this challenge by providing a multi-physics differentiable abstraction layer made fully compatible with JAX (through the `tesseract_jax` adapter). This architecture ensures end-to-end differentiability while allowing seamless interchange between high-fidelity solvers (Gkeyll), neural surrogates, and analytical approximations for rapid, progressive prototyping.

2025-11-17T11:23:16Z Jack B. Coughlin Archis Joglekar Jonathan Brodrick Alexander Lavin http://arxiv.org/abs/2511.12705v1 An Innovative Algorithm For Robust, Interactive, Piecewise-Linear Data Exploration 2025-11-16T17:44:56Z

Many mathematical modelling tasks (such as in Economics and Finance) are informed by data that is "found" rather than being the result of carefully designed experiments. This often results in data series that are short, noisy, multidimensional and contaminated with outliers, regime shifts, and confounding, uninformative or co-linear variables. We present a generalization of the Theil-Sen algorithm to reflect modes (rather than the median) in the parameter space distribution (of partial fits to the data). This can provide a robust piecewise-linear fit to the data while also allowing for extensions to including elements of cluster analysis, regularization and cross-validation in a unified (distribution free) approach that can:- 1. Exploit piecewise linearity to reduce the need to pre-specify the form of the underlying data generating process. 2. Detect non-homogeneity (e.g. regime shifts, multiple data generating processes etc.) in the data using an innovative non-parametric (Hamming-Distance/Affinity-Matrix) cluster analysis technique. 3. Enable dimension reduction and resistance to the effects of multi-co-linearity by including LASSO regularization as an integral part of the algorithm. 4. Estimate measures of accuracy, such as standard errors, bias, and confidence intervals, without needing to rely on traditional distributional assumptions. Taken together these extensions to the traditional Theil-Sen algorithm simplify the traditional process of parameter fitting by providing a single-stage analysis controlled by a multidimensional search of Scale/Parsimony/Precision hyper-parameters. These are early days in this research and the main limitation in this approach is that it assumes that compute power is infinite and compute time is small enough to allow interactive use.

2025-11-16T17:44:56Z For a browser based interactive demonstration or to view the source code See https://steve--w.github.io/XIDEPages/ExtendedThielSenDemo.html This will open a simple IDE in design mode. Press "Run Mode" to see the demonstration or navigate to the "Code" tab to see the Python source code Stephen Wright Colin Paterson http://arxiv.org/abs/2511.07737v1 TurboSAT: Gradient-Guided Boolean Satisfiability Accelerated on GPU-CPU Hybrid System 2025-11-11T01:41:40Z

While accelerated computing has transformed many domains of computing, its impact on logical reasoning, specifically Boolean satisfiability (SAT), remains limited. State-of-the-art SAT solvers rely heavily on inherently sequential conflict-driven search algorithms that offer powerful heuristics but limit the amount of parallelism that could otherwise enable significantly more scalable SAT solving. Inspired by neural network training, we formulate the SAT problem as a binarized matrix-matrix multiplication layer that could be optimized using a differentiable objective function. Enabled by this encoding, we combine the strengths of parallel differentiable optimization and sequential search to accelerate SAT on a hybrid GPU-CPU system. In this system, the GPUs leverage parallel differentiable solving to rapidly evaluate SAT clauses and use gradients to stochastically explore the solution space and optimize variable assignments. Promising partial assignments generated by the GPUs are post-processed on many CPU threads which exploit conflict-driven sequential search to further traverse the solution subspaces and identify complete assignments. Prototyping the hybrid solver on an NVIDIA DGX GB200 node, our solver achieves runtime speedups up to over 200x when compared to a state-of-the-art CPU-based solver on public satisfiable benchmark problems from the SAT Competition.

2025-11-11T01:41:40Z 7 pages, 5 equations, 5 figures, 1 table Steve Dai Cunxi Yu Kalyan Krishnamani Brucek Khailany http://arxiv.org/abs/2511.07728v1 A New Initial Approximation Bound in the Durand Kerner Algorithm for Finding Polynomial Zeros 2025-11-11T01:16:50Z

The Durand-Kerner algorithm is a widely used iterative technique for simultaneously finding all the roots of a polynomial. However, its convergence heavily depends on the choice of initial approximations. This paper introduces two novel approaches for determining the initial values: New bound 1 and the lambda maximal bound, aimed at improving the stability and convergence speed of the algorithm. Theoretical analysis and numerical experiments were conducted to evaluate the effectiveness of these bounds. The lambda maximal bound consistently ensures that all the roots lie within the complex circle, leading to faster and more stable convergence. Comparative results demonstrate that while New bound 1 guarantees convergence, but it yields excessively large radii.

2025-11-11T01:16:50Z 18 pages, 6 figures, ICompac 2025 B. A. Sanjoyo M. Yunus N. Hidayat http://arxiv.org/abs/2511.07616v1 A waveform iteration implementation for black-box multi-rate higher-order coupling 2025-11-10T20:38:29Z

Many multiphysics simulations involve processes evolving on disparate time scales, posing a challenge for efficient coupling. A naive approach that synchronizes all processes using the smallest time scale wastes computational resources on slower processes and typically achieves only linear convergence in time. Waveform iteration is a promising numerical technique that enables higher-order, multi-rate coupling while treating coupled components as black boxes. However, applying this approach to PDE-based coupled simulations is nontrivial. In this paper, we integrate waveform iteration into the black-box coupling library preCICE with minimal modifications to its API. We detail how this extension interacts with key preCICE features, including data mapping for non-matching meshes, quasi-Newton acceleration for strongly coupled problems, and parallel peer-to-peer communication. We then showcase that waveform iteration significantly reduces numerical errors -- often by orders of magnitude. This advancement greatly enhances preCICE, benefiting its extensive user community.

2025-11-10T20:38:29Z 26 pages, 16 Figures, 1 Table, 2 Code Listings Submitted to SIAM SISC The manuscript summarizes key parts of the dissertation of Benjamin Rodenberg; Flexible and robust time stepping for partitioned multiphysics Benjamin Rodenberg Benjamin Uekermann