https://arxiv.org/api/3vSBaS4oWa+r+VdmPkbTNRzjnPg2026-06-21T18:51:37Z266419515http://arxiv.org/abs/2311.09922v4Fast multiplication by two's complement addition of numbers represented as a set of polynomial radix 2 indexes, stored as an integer list for massively parallel computation2025-11-27T22:43:30ZWe demonstrate a multiplication method based on numbers represented as set of polynomial radix 2 indices stored as an integer list. The 'polynomial integer index multiplication' method is a set of algorithms implemented in python code. We demonstrate the method to be faster than both the Number Theoretic Transform (NTT) and Karatsuba for multiplication within a certain bit range. Also implemented in python code for comparison purposes with the polynomial radix 2 integer method. We demonstrate that it is possible to express any integer or real number as a list of integer indices, representing a finite series in base two. The finite series of integer index representation of a number can then be stored and distributed across multiple CPUs / GPUs. We show that operations of addition and multiplication can be applied as two's complement additions operating on the index integer representations and can be fully distributed across a given CPU / GPU architecture. We demonstrate fully distributed arithmetic operations such that the 'polynomial integer index multiplication' method overcomes the current limitation of parallel multiplication methods. Ie, the need to share common core memory and common disk for the calculation of results and intermediate results.2023-11-16T14:21:13ZThis paper has been withdrawn after an error was identified in a key proof. The revision requires substantial re-derivation and may replace the main theorem. A corrected version may be posted once the results are verified. therefore we require additional time to rework the argumentMark Stockshttp://arxiv.org/abs/2512.00093v1Making the RANMAR pseudorandom number generator in LAMMPS up to four times faster, with an implementation of jump-ahead2025-11-26T15:47:58ZMassively parallel molecular simulations require pseudorandom number streams that are provably non-overlapping and reproducible across thousands of compute units in parallel computing environments. In the widely used LAMMPS package, the standard RANMAR generator lacks a mathematically exact mechanism to jump ahead; distinct seeds are typically assigned instead, which does not ensure disjoint streams. We introduce a mathematically exact jump-ahead extension for RANMAR in LAMMPS. In practice, a single random sequence can be partitioned into consecutive, non-overlapping blocks of length $J$, with one block assigned to each compute unit under formal non-overlap guarantees. In our approach, we develop an algebraic reformulation that enables efficient jump-ahead even for very large $J$ by casting state advancement into polynomial computations over finite residue rings while keeping memory small. We implement the extension in C++ using Number Theory Library (NTL) and integrate it into LAMMPS without altering user workflows. Beyond enabling exact partitioning, converting the 24-bit floating-point recurrence to an equivalent 24-bit integer recurrence accelerates generation itself: across diverse CPUs, generation is approximately two to four times faster than the floating-point baseline. Computing very large jumps (e.g., $J \approx 2^{120}$) remains practical.2025-11-26T15:47:58ZHiroshi HaramotoKosuke Suzukihttp://arxiv.org/abs/2511.20783v1A derivative-free trust-region approach for Low Order-Value Optimization problems2025-11-25T19:19:34ZThe Low Order-Value Optimization (LOVO) problem involves minimizing the minimum among a finite number of function values within a feasible set. LOVO has several practical applications such as robust parameter estimation, protein alignment, portfolio optimization, among others. In this work, we are interested in the constrained nonlinear optimization LOVO problem of minimizing the minimum between a finite number of function values subject to a nonempty closed convex set where each function is a black-box and continuously differentiable, but the derivatives are not available. We develop the first derivative-free trust-region algorithm for constrained LOVO problems with convergence to weakly critical points. Under suitable conditions, we establish the global convergence of the algorithm and also its worst-case iteration complexity analysis. An initial open-source implementation using only linear interpolation models is developed. Extensive numerical experiments and comparison with existing alternatives show the properties and the efficiency of the proposed approach when solving LOVO problems.2025-11-25T19:19:34ZAnderson E. SchwertnerFrancisco N. C. Sobralhttp://arxiv.org/abs/2511.20198v1Compilation of Generalized Matrix Chains with Symbolic Sizes2025-11-25T11:23:49ZGeneralized Matrix Chains (GMCs) are products of matrices where each matrix carries features (e.g., general, symmetric, triangular, positive-definite) and is optionally transposed and/or inverted. GMCs are commonly evaluated via sequences of calls to BLAS and LAPACK kernels. When matrix sizes are known, one can craft a sequence of kernel calls to evaluate a GMC that minimizes some cost, e.g., the number of floating-point operations (FLOPs). Even in these circumstances, high-level languages and libraries, upon which users usually rely, typically perform a suboptimal mapping of the input GMC onto a sequence of kernels. In this work, we go one step beyond and consider matrix sizes to be symbolic (unknown); this changes the nature of the problem since no single sequence of kernel calls is optimal for all possible combinations of matrix sizes. We design and evaluate a code generator for GMCs with symbolic sizes that relies on multi-versioning. At compile-time, when the GMC is known but the sizes are not, code is generated for a few carefully selected sequences of kernel calls. At run-time, when sizes become known, the best generated variant for the matrix sizes at hand is selected and executed. The code generator uses new theoretical results that guarantee that the cost is within a constant factor from optimal for all matrix sizes and an empirical tuning component that further tightens the gap to optimality in practice. In experiments, we found that the increase above optimal in both FLOPs and execution time of the generated code was less than 15\% for 95\% of the tested chains.2025-11-25T11:23:49Z15 pages, 6 figuresProceedings of 2026 IEEE/ACM International Symposium on Code Generation and Optimization, Sydney, Australia, 31st January-4th February, 2026Francisco LópezLars KarlssonPaolo Bientinesihttp://arxiv.org/abs/2511.16174v1Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures2025-11-20T09:26:34ZLarge symmetric eigenvalue problems are commonly observed in many disciplines such as Chemistry and Physics, and several libraries including cuSOLVERMp, MAGMA and ELPA support computing large eigenvalue decomposition on multi-GPU or multi-CPU-GPU hybrid architectures. However, these libraries do not provide satisfied performance that all of the libraries only utilize around 1.5\% of the peak multi-GPU performance. In this paper, we propose a pipelined two-stage eigenvalue decomposition algorithm instead of conventional subsequent algorithm with substantial optimizations. On an 8$\times$A100 platform, our implementation surpasses state-of-the-art cuSOLVERMp and MAGMA baselines, delivering mean speedups of 5.74$\times$ and 6.59$\times$, with better strong and weak scalability.2025-11-20T09:26:34Z11 pages,16 figures. Our manuscript was submitted to the PPoPP'26 conference but was not accepted. The reviewers acknowledged it as a complete and solid piece of work; however, they noted that it lacks sufficient ablation studiesHansheng WangRuiyi ZhanDajun HuangXingchen LiuQiao LiHancong DuanDingwen TaoGuangming TanShaoshuai Zhanghttp://arxiv.org/abs/2507.02164v2Hardware-Accelerated Algorithm for Complex Function Roots Density Graph Plotting2025-11-19T20:23:55ZSolving and visualizing the potential roots of complex functions is essential in both theoretical and applied domains, yet often computationally intensive. We present a hardware-accelerated algorithm for complex function roots density graph plotting by approximating functions with polynomials and solving their roots using single-shift QR iteration. By leveraging the Hessenberg structure of companion matrices and optimizing QR decomposition with Givens rotations, we design a pipelined FPGA architecture capable of processing a large amount of polynomials with high throughput. Our implementation achieves up to 65x higher energy efficiency than CPU-based approaches, and while it trails modern GPUs in performance. Compared with state-of-the-art QR decomposition solutions, our design specificly optimize QR decomposition for complex-valued Hessenberg matrices up to size 6x6, exhibiting a moderate throughput of 16.5M QR decompositions per second, while prior works have predominantly focused on 4x4 general matrices.2025-07-02T21:42:39ZRuibai TangChengbin Quan10.1109/TCAD.2025.3639508http://arxiv.org/abs/2509.02202v2DEViaN-LM: An R Package for Detecting Abnormal Values in the Gaussian Linear Model2025-11-19T09:48:36ZThe DEViaN-LM is a R package that allows to detect the values poorly explained by a Gaussian linear model. The procedure is based on the maximum of the absolute value of the studentized residuals, which is a free statistic of the parameters of the model. This approach makes it possible to generalize several procedures used to detect abnormal values during longitudinal monitoring of certain biological markers. In this article, we describe the method used, and we show how to implement it on different real datasets.2025-09-02T11:15:44ZGeoffroy BerthelotIRMES - URP\_7329, RELAISGuillaume SaulièreIRMES - URP\_7329Jérôme DedeckerMAP5 - UMR 8145http://arxiv.org/abs/2511.14966v1A Graph-Based, Distributed Memory, Modeling Abstraction for Optimization2025-11-18T23:11:10ZWe present a general, flexible modeling abstraction for building and working with distributed optimization problems called a RemoteOptiGraph. This abstraction extends the OptiGraph model in Plasmo$.$jl, where optimization problems are represented as hypergraphs with nodes that define modular subproblems (variables, constraints, and objectives) and edges that encode algebraic linking constraints between nodes. The RemoteOptiGraph allows OptiGraphs to be utilized in distributed memory environments through InterWorkerEdges, which manage linking constraints that span workers. This abstraction offers a unified approach for modeling optimization problems on distributed memory systems (avoiding bespoke modeling approaches), and provides a basis for developing general-purpose meta-algorithms that can exploit distributed memory structure such as Benders or Lagrangian decompositions. We implement this abstraction in the open-source package, Plasmo$.$jl and we illustrate how it can be used by solving a mixed integer capacity expansion model for the western United States containing over 12 million variables and constraints. The RemoteOptiGraph abstraction together with Benders decomposition performs 7.5 times faster than solving the same problem without decomposition.2025-11-18T23:11:10Z32 pages, 7 FiguresDavid L. ColeJordan JalvingJonah LangliebJesse D. Jenkinshttp://arxiv.org/abs/2511.13963v1Hessians in Birkhoff-Theoretic Trajectory Optimization2025-11-17T22:40:51ZThis paper derives various Hessians associated with Birkhoff-theoretic methods for trajectory optimization. According to a theorem proved in this paper, approximately 80% of the eigenvalues are contained in the narrow interval [-2, 4] for all Birkhoff-discretized optimal control problems. A preliminary analysis of computational complexity is also presented with further discussions on the grand challenge of solving a million point trajectory optimization problem.2025-11-17T22:40:51ZThis paper appeared as an Engineering Note in the J. Guid. Control & DynamicsJournal of Guidance Control and Dynamics, Vol. 48, No. 9, September 2025, 2105--2112I. M. Ross10.2514/1.G008778http://arxiv.org/abs/2511.13808v1Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora2025-11-17T17:46:23ZA classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.2025-11-17T17:46:23ZPublished in CIKM 2025In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (2025)Edward RaffRyan R. CurtinDerek EverettRobert J. JoyceJames Holt10.1145/3746252.3761551http://arxiv.org/abs/2511.13262v1Case study of a differentiable heterogeneous multiphysics solver for a nuclear fusion application2025-11-17T11:23:16ZThis work presents a case study of a heterogeneous multiphysics solver from the nuclear fusion domain. At the macroscopic scale, an auto-differentiable ODE solver in JAX computes the evolution of the pulsed power circuit and bulk plasma parameters for a compressing Z Pinch. The ODE solver requires a closure for the impedance of the plasma load obtained via root-finding at every timestep, which we solve efficiently using gradient-based Newton iteration. However, incorporating non-differentiable production-grade plasma solvers like Gkeyll (a C/CUDA plasma simulation suite) into a gradient-based workflow is non-trivial. The ''Tesseract'' software addresses this challenge by providing a multi-physics differentiable abstraction layer made fully compatible with JAX (through the `tesseract_jax` adapter). This architecture ensures end-to-end differentiability while allowing seamless interchange between high-fidelity solvers (Gkeyll), neural surrogates, and analytical approximations for rapid, progressive prototyping.2025-11-17T11:23:16ZJack B. CoughlinArchis JoglekarJonathan BrodrickAlexander Lavinhttp://arxiv.org/abs/2511.12705v1An Innovative Algorithm For Robust, Interactive, Piecewise-Linear Data Exploration2025-11-16T17:44:56ZMany mathematical modelling tasks (such as in Economics and Finance) are informed by data that is "found" rather than being the result of carefully designed experiments. This often results in data series that are short, noisy, multidimensional and contaminated with outliers, regime shifts, and confounding, uninformative or co-linear variables.
We present a generalization of the Theil-Sen algorithm to reflect modes (rather than the median) in the parameter space distribution (of partial fits to the data). This can provide a robust piecewise-linear fit to the data while also allowing for extensions to including elements of cluster analysis, regularization and cross-validation in a unified (distribution free) approach that can:- 1. Exploit piecewise linearity to reduce the need to pre-specify the form of the underlying data generating process. 2. Detect non-homogeneity (e.g. regime shifts, multiple data generating processes etc.) in the data using an innovative non-parametric (Hamming-Distance/Affinity-Matrix) cluster analysis technique. 3. Enable dimension reduction and resistance to the effects of multi-co-linearity by including LASSO regularization as an integral part of the algorithm. 4. Estimate measures of accuracy, such as standard errors, bias, and confidence intervals, without needing to rely on traditional distributional assumptions.
Taken together these extensions to the traditional Theil-Sen algorithm simplify the traditional process of parameter fitting by providing a single-stage analysis controlled by a multidimensional search of Scale/Parsimony/Precision hyper-parameters. These are early days in this research and the main limitation in this approach is that it assumes that compute power is infinite and compute time is small enough to allow interactive use.2025-11-16T17:44:56ZFor a browser based interactive demonstration or to view the source code See https://steve--w.github.io/XIDEPages/ExtendedThielSenDemo.html This will open a simple IDE in design mode. Press "Run Mode" to see the demonstration or navigate to the "Code" tab to see the Python source codeStephen WrightColin Patersonhttp://arxiv.org/abs/2511.07737v1TurboSAT: Gradient-Guided Boolean Satisfiability Accelerated on GPU-CPU Hybrid System2025-11-11T01:41:40ZWhile accelerated computing has transformed many domains of computing, its impact on logical reasoning, specifically Boolean satisfiability (SAT), remains limited. State-of-the-art SAT solvers rely heavily on inherently sequential conflict-driven search algorithms that offer powerful heuristics but limit the amount of parallelism that could otherwise enable significantly more scalable SAT solving. Inspired by neural network training, we formulate the SAT problem as a binarized matrix-matrix multiplication layer that could be optimized using a differentiable objective function. Enabled by this encoding, we combine the strengths of parallel differentiable optimization and sequential search to accelerate SAT on a hybrid GPU-CPU system. In this system, the GPUs leverage parallel differentiable solving to rapidly evaluate SAT clauses and use gradients to stochastically explore the solution space and optimize variable assignments. Promising partial assignments generated by the GPUs are post-processed on many CPU threads which exploit conflict-driven sequential search to further traverse the solution subspaces and identify complete assignments. Prototyping the hybrid solver on an NVIDIA DGX GB200 node, our solver achieves runtime speedups up to over 200x when compared to a state-of-the-art CPU-based solver on public satisfiable benchmark problems from the SAT Competition.2025-11-11T01:41:40Z7 pages, 5 equations, 5 figures, 1 tableSteve DaiCunxi YuKalyan KrishnamaniBrucek Khailanyhttp://arxiv.org/abs/2511.07728v1A New Initial Approximation Bound in the Durand Kerner Algorithm for Finding Polynomial Zeros2025-11-11T01:16:50ZThe Durand-Kerner algorithm is a widely used iterative technique for simultaneously finding all the roots of a polynomial. However, its convergence heavily depends on the choice of initial approximations. This paper introduces two novel approaches for determining the initial values: New bound 1 and the lambda maximal bound, aimed at improving the stability and convergence speed of the algorithm. Theoretical analysis and numerical experiments were conducted to evaluate the effectiveness of these bounds. The lambda maximal bound consistently ensures that all the roots lie within the complex circle, leading to faster and more stable convergence. Comparative results demonstrate that while New bound 1 guarantees convergence, but it yields excessively large radii.2025-11-11T01:16:50Z18 pages, 6 figures, ICompac 2025B. A. SanjoyoM. YunusN. Hidayathttp://arxiv.org/abs/2511.07616v1A waveform iteration implementation for black-box multi-rate higher-order coupling2025-11-10T20:38:29ZMany multiphysics simulations involve processes evolving on disparate time scales, posing a challenge for efficient coupling. A naive approach that synchronizes all processes using the smallest time scale wastes computational resources on slower processes and typically achieves only linear convergence in time. Waveform iteration is a promising numerical technique that enables higher-order, multi-rate coupling while treating coupled components as black boxes. However, applying this approach to PDE-based coupled simulations is nontrivial.
In this paper, we integrate waveform iteration into the black-box coupling library preCICE with minimal modifications to its API. We detail how this extension interacts with key preCICE features, including data mapping for non-matching meshes, quasi-Newton acceleration for strongly coupled problems, and parallel peer-to-peer communication. We then showcase that waveform iteration significantly reduces numerical errors -- often by orders of magnitude. This advancement greatly enhances preCICE, benefiting its extensive user community.2025-11-10T20:38:29Z26 pages, 16 Figures, 1 Table, 2 Code Listings Submitted to SIAM SISC The manuscript summarizes key parts of the dissertation of Benjamin Rodenberg; Flexible and robust time stepping for partitioned multiphysicsBenjamin RodenbergBenjamin Uekermann