https://arxiv.org/api/prbKQrVfBn/T2gS7kyX4jZHtbv02026-06-22T06:52:48Z266436015http://arxiv.org/abs/2504.01212v1Cooper: A Library for Constrained Optimization in Deep Learning2025-04-01T21:52:53ZCooper is an open-source package for solving constrained optimization problems involving deep learning models. Cooper implements several Lagrangian-based first-order update schemes, making it easy to combine constrained optimization algorithms with high-level features of PyTorch such as automatic differentiation, and specialized deep learning architectures and optimizers. Although Cooper is specifically designed for deep learning applications where gradients are estimated based on mini-batches, it is suitable for general non-convex continuous constrained optimization. Cooper's source code is available at https://github.com/cooper-org/cooper.2025-04-01T21:52:53ZJose Gallego-PosadaJuan RamirezMeraj HashemizadehSimon Lacoste-Julienhttp://arxiv.org/abs/2503.21078v1Sub-ODEs Simplify Taylor Series Algorithms for Ordinary Differential Equations2025-03-27T01:35:32ZA Taylor method for solving an ordinary differential equation initial-value problem $\dot x = f(t,x)$, $x(t_0) = x_0$, computes the Taylor series (TS) of the solution at the current point, truncated to some order, and then advances to the next point by summing the TS with a suitable step size.
A standard ODE method (e.g. Runge-Kutta) treats function $f$ as a black box, but a Taylor solver requires $f$ to be preprocessed into a code-list of elementary operations that it interprets as operations on (truncated) TS.
The trade-off for this extra work includes arbitrary order, typically enabling much larger step sizes.
For a standard function, such as $\exp$, this means evaluating $v(t)=\exp(u(t))$, where $u(t),v(t)$ are TS.
The sub-ODE method applies the ODE $d v/d u=v$, obeyed by $v=\exp(u)$, to in-line this operation as $\dot v=v\dot u$.
This gives economy of implementation: each function that satisfies a simple ODE goes into the "Taylor library" with a few lines of code--not needing a separate recurrence relation, which is the typical approach.
Mathematically, however, the use of sub-ODEs generally transforms the original ODE into a differential-algebraic system, making it nontrivial to ensure a sound system of recurrences for Taylor coefficients.
We prove that, regardless of how many sub-ODEs are incorporated into $f$, this approach guarantees a sound system.
We introduce our sub-ODE-based Matlab ODE solver and show that its performance compares favorably with solvers from the Matlab ODE suite.2025-03-27T01:35:32Z25 pagesNedialko S. NedialkovJohn D. Prycehttp://arxiv.org/abs/2502.17513v2Int2Int: a framework for mathematics with transformers2025-03-24T19:11:58ZThis paper documents Int2Int, an open source code base for using transformers on problems of mathematical research, with a focus on number theory and other problems involving integers. Int2Int is a complete PyTorch implementation of a transformer architecture, together with training and evaluation loops, and classes and functions to represent, generate and decode common mathematical objects. Ancillary code for data preparation, and Jupyter Notebooks for visualizing experimental results are also provided. This document presents the main features of Int2Int, serves as its user manual, and provides guidelines on how to extend it. Int2Int is released under the MIT licence, at https://github.com/f-charton/Int2Int.2025-02-22T13:43:28ZFrançois Chartonhttp://arxiv.org/abs/2503.16753v1EarlyStopping: Implicit Regularization for Iterative Learning Procedures in Python2025-03-20T23:53:01ZIterative learning procedures are ubiquitous in machine learning and modern statistics.
Regularision is typically required to prevent inflating the expected loss of a procedure in
later iterations via the propagation of noise inherent in the data.
Significant emphasis has been placed on achieving this regularisation implicitly by stopping
procedures early.
The EarlyStopping-package provides a toolbox of (in-sample) sequential early stopping rules for
several well-known iterative estimation procedures, such as truncated SVD, Landweber (gradient
descent), conjugate gradient descent, L2-boosting and regression trees.
One of the central features of the package is that the algorithms allow the specification of the
true data-generating process and keep track of relevant theoretical quantities.
In this paper, we detail the principles governing the implementation of the EarlyStopping-package and provide
a survey of recent foundational advances in the theoretical literature.
We demonstrate how to use the EarlyStopping-package to explore core features of implicit regularisation
and replicate results from the literature.2025-03-20T23:53:01ZEric ZiebellRatmir MiftachovBernhard StankewitzLaura Huckerhttp://arxiv.org/abs/2009.12009v2AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications2025-03-19T23:59:02ZBlock-structured adaptive mesh refinement (AMR) provides the basis for the temporal and spatial discretization strategy for a number of ECP applications in the areas of accelerator design, additive manufacturing, astrophysics, combustion, cosmology, multiphase flow, and wind plant modelling. AMReX is a software framework that provides a unified infrastructure with the functionality needed for these and other AMR applications to be able to effectively and efficiently utilize machines from laptops to exascale architectures. AMR reduces the computational cost and memory footprint compared to a uniform mesh while preserving accurate descriptions of different physical processes in complex multi-physics algorithms. AMReX supports algorithms that solve systems of partial differential equations (PDEs) in simple or complex geometries, and those that use particles and/or particle-mesh operations to represent component physical processes. In this paper, we will discuss the core elements of the AMReX framework such as data containers and iterators as well as several specialized operations to meet the needs of the application projects. In addition we will highlight the strategy that the AMReX team is pursuing to achieve highly performant code across a range of accelerator-based architectures for a variety of different applications.2020-09-25T02:59:30Z16 pages, 9 figures, published in IJHPCAThe International Journal of High Performance Computing Applications. 2021;35(6):508-526Weiqun ZhangAndrew MyersKevin GottAnn AlmgrenJohn Bell10.1177/10943420211022811http://arxiv.org/abs/2503.13795v1BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems2025-03-18T00:52:12ZIn this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.2025-03-18T00:52:12Z46 pages, 7 figures, 19 tablesKonstantin BurlachenkoPeter Richtárikhttp://arxiv.org/abs/2503.11355v1TypedMatrices.jl: An Extensible and Type-Based Matrix Collection for Julia2025-03-14T12:44:11ZTypedMatrices.jl is a Julia package to organize test matrices. By default, the package comes with a number of built-in matrices and interfaces to help users select test cases based on their properties. The package is designed to be extensible, allowing users to define their own matrix types. We discuss the design and implementation of the package and demonstrate its usage with a number of examples.2025-03-14T12:44:11ZAnzhi ZhangMassimiliano Fasihttp://arxiv.org/abs/2503.10451v1The Willing Kingdon Clifford Algebra Library2025-03-13T15:16:57ZKingdon is an open-source Python package designed to seamlessly integrate Geometric Algebra (GA) into existing workflows. Unlike previous GA libraries, kingdon is input-type-agnostic, and hence supports GA's over e.g. PyTorch tensors, NumPy arrays, or SymPy symbolic expressions, to name but a few. Despite this refusal to specialize, it delivers high performance by symbolically optimizing operators and leveraging input sparsity for Just-In-Time compiled expressions. Additionally, its visualization capabilities in Jupyter notebooks using ganja align with the rapid prototyping workflow common to scientific research.2025-03-13T15:16:57Z13 pages, 4 figuresMartin Roelfshttp://arxiv.org/abs/2403.09408v3A computer algebra package for bivariate asymptotics with explicit error terms2025-03-12T10:23:43ZMaking use of a newly developed package in the computer mathematics system SageMath, we show how to perform a full asymptotic analysis of certain types of sums that occur frequently in combinatorics, including explicit error bounds. We present two applications of the general approach to illustrate its use: the first concerns a classical problem due to Ramanujan, while the second one concerns a question of Bóna and DeJonge on 132-avoiding permutations with a unique longest increasing subsequence that can be translated into an inequality for a certain binomial sum.2024-03-14T14:01:19ZFull version of extended abstract presented at the 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2024), published in proceedings at https://doi.org/10.4230/LIPIcs.AofA.2024.19Benjamin HacklStephan Wagnerhttp://arxiv.org/abs/2503.06010v1InfoFusion Controller: Informed TRRT Star with Mutual Information based on Fusion of Pure Pursuit and MPC for Enhanced Path Planning2025-03-08T02:12:42ZIn this paper, we propose the InfoFusion Controller, an advanced path planning algorithm that integrates both global and local planning strategies to enhance autonomous driving in complex urban environments. The global planner utilizes the informed Theta-Rapidly-exploring Random Tree Star (Informed-TRRT*) algorithm to generate an optimal reference path, while the local planner combines Model Predictive Control (MPC) and Pure Pursuit algorithms. Mutual Information (MI) is employed to fuse the outputs of the MPC and Pure Pursuit controllers, effectively balancing their strengths and compensating for their weaknesses. The proposed method addresses the challenges of navigating in dynamic environments with unpredictable obstacles by reducing uncertainty in local path planning and improving dynamic obstacle avoidance capabilities. Experimental results demonstrate that the InfoFusion Controller outperforms traditional methods in terms of safety, stability, and efficiency across various scenarios, including complex maps generated using SLAM techniques.
The code for the InfoFusion Controller is available at https: //github.com/DrawingProcess/InfoFusionController.2025-03-08T02:12:42ZSeongjun ChoiYoungbum KimNam Woo KimMansun ShinByunggi ChaeSungjin Leehttp://arxiv.org/abs/2404.02218v2A shared compilation stack for distributed-memory parallelism in stencil DSLs2025-03-07T17:35:44ZDomain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem.2024-04-02T18:11:55ZFix some bibtex links, journal refIn ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 38-56 (2024)George BisbasAnton LydikeEmilien BauerNick BrownMathieu FehrLawrence MitchellGabriel Rodriguez-CanalMaurice JamiesonPaul H. J. KellyMichel SteuwerTobias Grosser10.1145/3620666.3651344http://arxiv.org/abs/2503.03897v1Endpoint-Explicit Differential Dynamic Programming via Exact Resolution2025-03-05T20:55:16ZWe introduce a novel method for handling endpoint constraints in constrained differential dynamic programming (DDP). Unlike existing approaches, our method guarantees quadratic convergence and is exact, effectively managing rank deficiencies in both endpoint and stagewise equality constraints. It is applicable to both forward and inverse dynamics formulations, making it particularly well-suited for model predictive control (MPC) applications and for accelerating optimal control (OC) solvers. We demonstrate the efficacy of our approach across a broad range of robotics problems and provide a user-friendly open-source implementation within CROCODDYL.2025-03-05T20:55:16Z7 pages, IEEE ICRA paperIEEE International Conference on Robotics and Automation, 2025Maria ParilliSergi MartinezCarlos Mastallihttp://arxiv.org/abs/2503.03016v1QCLAB: A Matlab Toolbox for Quantum Computing2025-03-04T21:25:46ZWe introduce QCLAB, an object-oriented MATLAB toolbox for constructing, representing, and simulating quantum circuits. Designed with an emphasis on numerical stability, efficiency, and performance, QCLAB provides a reliable platform for prototyping and testing quantum algorithms. For advanced performance needs, QCLAB++ serves as a complementary C++ package optimized for GPU-accelerated quantum circuit simulations. Together, QCLAB and QCLAB++ form a comprehensive toolkit, balancing the simplicity of MATLAB scripting with the computational power of GPU acceleration. This paper serves as an introduction to the package and its features along with a hands-on tutorial that invites researchers to explore its capabilities right away.2025-03-04T21:25:46Z12 pagesSophia KeipDaan CampsRoel Van Beeumenhttp://arxiv.org/abs/2503.00588v1An Improved NSGA-II with local search for multi-objective energy-efficient flowshop scheduling problem2025-03-01T18:56:06ZThere has been an increasing concern to reduce the energy consumption in manufacturing and other industries. Energy consumption in manufacturing industries is directly related to efficient schedules. The contribution of this paper includes: i) a permutation flowshop scheduling problem (PFLSP) mathematical model by considering energy consumed by each machine in the system. ii) an improved non-dominated sorted genetic algorithm with Taguchi method with further incorporating local search (NSGA-II_LS) is proposed for the multi-objective PFLSP model. iii) solved 90 benchmarks problems of Taillard (1993) for the minimisation of flowtime (FT) and energy consumption (EC). The performance of the proposed NSGA_LS algorithm is evaluated on the benchmark problems selected from the published literature Li et. al, (2018). From these results, it is noted that the proposed algorithm performed better on both the objectives i.e., FT and EC minimization in 5 out of 9 cases. On FT objective our algorithm performed better in 8 out of 9 cases and on EC objective 5 out of 9 cases. Overall, the proposed algorithm achieved 47% and 15.44% average improvement in FT and EC minimization respectively on the benchmark problems. From the results of 90 benchmark problems, it is observed that average difference in FT and EC between two solutions is decreasing as the problem size increases from 5 machines to 10 machines with an exception in one case. Further, it is observed that the performance of the proposed algorithm is better as the problem size increases in both jobs and machines. These results can act as standard solutions for further research.2025-03-01T18:56:06Z34 pagesVigneshwar PesaruVenkataramanaiah Saddikutihttp://arxiv.org/abs/2202.01085v4Giga-scale Kernel Matrix Vector Multiplication on GPU2025-02-24T00:50:28ZKernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing. However, as KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints. In this paper, we propose a novel approximation procedure coined \textit{Faster-Fast and Free Memory Method} ($\fthreem$) to address these scaling issues of KMVM for tall~($10^8\sim 10^9$) and skinny~($D\leq7$) data. Extensive experiments demonstrate that $\fthreem$ has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under a minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 1.5-5.5 times} at the cost of $<1\%$ drop in accuracy. We further demonstrate competitive results on \emph{Gaussian Process regression} coupled with significant speedups on a variety of real-world datasets.2022-02-02T15:28:15ZRobert HuSiu Lun ChauDino SejdinovicJoan Alexis Glaunès