MIRGE: An Array-Based Computational Framework for Scientific Computing

2025-12-18T21:57:15Z

MIRGE is a computational approach for scientific computing based on NumPy-like array computation, but using lazy evaluation to recast computation as data-flow graphs, where nodes represent immutable, multi-dimensional arrays. Evaluation of an array expression is deferred until its value is needed, at which point a pipeline is invoked that transforms high-level array expressions into lower-level intermediate representations (IR) and finally into executable code, through a multi-stage process. Domain-specific transformations, such as metadata-driven optimizations, GPU-parallelization strategies, and loop fusion techniques, improve performance and memory efficiency. MIRGE employs "array contexts" to abstract the interface between array expressions and heterogeneous execution environments (for example, lazy evaluation via OpenCL, or eager evaluation via NumPy or CuPy). The framework thus enables performance portability as well as separation of concerns between application logic, low-level implementation, and optimizations. By enabling scientific expressivity while facilitating performance tuning, MIRGE offers a robust, extensible platform for both computational research and scientific application development. This paper provides an overview of MIRGE. We further describe an application of MIRGE called MIRGE-Com, for supersonic combusting flows in a discontinuous Galerkin finite-element setting. We demonstrate its capabilities as a solver and highlight its performance characteristics on large-scale GPU hardware.

tensorflow-riemopt: A Library for Optimization on Riemannian Manifolds

2025-12-17T18:08:09Z

This paper presents tensorflow-riemopt, a Python library for geometric machine learning in TensorFlow. The library provides efficient implementations of neural network layers with manifold-constrained parameters, geometric operations on Riemannian manifolds, and stochastic optimization algorithms for non-Euclidean spaces. Designed for integration with TensorFlow Extended, it supports both research prototyping and production deployment of machine learning pipelines. The code and documentation are distributed under the MIT license and available at https://github.com/master/tensorflow-riemopt

A Unified Framework for Automated Assembly Sequence and Production Line Planning using Graph-based Optimization

2025-12-15T11:32:45Z

This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at https://github.com/TUM-utg/PyCAALP, promotes further collaboration and adoption in both industrial and production research applications.

Comparison of SeDuMi and SDPT3 Solvers for Stability of Continuous-time Linear System

2025-12-15T11:08:07Z

SeDuMi and SDPT3 are two solvers for solving Semi-definite Programming (SDP) or Linear Matrix Inequality (LMI) problems. A computational performance comparison of these two are undertaken in this paper regarding the Stability of Continuous-time Linear Systems. The comparison mainly focuses on computational times and memory requirements for different scales of problems. To implement and compare the two solvers on a set of well-posed problems, we employ YALMIP, a widely used toolbox for modeling and optimization in MATLAB. The primary goal of this study is to provide an empirical assessment of the relative computational efficiency of SeDuMi and SDPT3 under varying problem conditions. Our evaluation indicates that SDPT3 performs much better in large-scale, high-precision calculations.

Fast and accurate computation of classical Gaussian quadratures

2025-12-11T22:56:39Z

Algorithms for computing the classical Gaussian quadrature rules (Gauss--Jacobi, Gauss--Laguerre, and Gauss--Hermite) are presented, based on globally convergent fourth-order iterative methods combined with asymptotic approximations, which are applied in complementary regions of the parameter space. This approach yields methods that improve upon existing algorithms in speed, accuracy, and computational range. The MATLAB algorithm for Gauss--Jacobi is faster than previous methods and lifts the upper restrictions on the parameters imposed by those methods ($α,β\le 5$); for example, for degrees up to $10^6$ all nodes and weights can be computed within the underflow limit for $-1<α,β\le 30$, and the computable range of parameters is much larger for smaller degrees, limited only by intrinsic overflow/underflow constraints. For the particular case of Gauss--Legendre quadrature ($α=β=0$), a specific asymptotic approach is considered, which yields the most efficient MATLAB implementation available so far. The Gauss--Laguerre and Gauss--Hermite algorithms incorporate subsampling, and scaling is also available in order to extend the computational range. Gauss--Radau and Gauss--Lobatto variants are also considered, along with the computation of the associated barycentric weights. Additionally, arbitrary-precision algorithms (in Maple) are offered for the symmetric cases (Gauss--Gegenbauer and Gauss--Hermite), which can be used to compute thousands of nodes with hundreds of digits in a matter of seconds.

A Hybrid Residue Floating Numerical Architecture for High Precision Arithmetic on FPGAs

2025-12-09T22:09:50Z

Floating point arithmetic remains expensive on FPGA platforms due to wide datapaths and normalization logic, motivating alternative representations that preserve dynamic range at lower cost. This work introduces the Hybrid Residue Floating Numerical Architecture (HRFNA), a unified arithmetic system that combines carry free residue channels with a lightweight floating point scaling factor. We develop the full mathematical framework, derive bounded error normalization rules, and present FPGA optimized microarchitectures for modular multiplication, exponent management, and hybrid reconstruction. HRFNA is implemented on a Xilinx ZCU104, with Vitis simulation, RTL synthesis, and on chip ILA traces confirming cycle accurate correctness. The architecture achieves over 2.1 times throughput improvement and 38-52 percent LUT reduction compared to IEEE 754 single precision baselines while maintaining numerical stability across long iterative sequences. These results demonstrate that HRFNA offers an efficient and scalable alternative to floating point computation on modern FPGA devices.

An abstraction for solving multi-domain problems using finite element methods

2025-12-05T20:57:16Z

We introduce a new abstraction for the representation and solution of multi-domain problems using finite element methods. This is an advance over previous work in that it achieves a single higher-level abstraction that represents multi-domain problems in the mixed variational problem formalism. We implemented our new abstraction in UFL and Firedrake, and validated our implementations solving a quad-triangle mixed-cell-type problem, a hex-quad mixed-cell-type problem, and a fluid-structure interaction benchmark problem.

ORTHOCUB: integral and differential cubature rules by orthogonal moments

2025-12-05T10:23:11Z

We discuss a numerical package, named ORTHOCUB, for the computation of linear functionals of both integral and differential type on multivariate polynomial spaces. The weighted sums corresponding to such integral and differential cubatures are implemented via orthogonal polynomial moments and auxiliary near-minimal algebraic cubature in a bounding box, with no conditioning issue since no matrix inversion or factorization is needed. The whole computational process indeed reduces to moment computation and dense matrix-vector products of relatively small size. The Matlab and Python codes are freely available, to be used as building blocks for integral and differential problems.

Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

2025-12-05T08:19:43Z

This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.

OpenSQP: A Reconfigurable Open-Source SQP Algorithm in Python for Nonlinear Optimization

2025-12-05T03:18:21Z

Sequential quadratic programming (SQP) methods have been remarkably successful in solving a broad range of nonlinear optimization problems. These methods iteratively construct and solve quadratic programming (QP) subproblems to compute directions that converge to a local minimum. While numerous open-source and commercial SQP algorithms are available, their implementations lack the transparency and modularity necessary to adapt and fine-tune them for specific applications or to swap out different modules to create a new optimizer. To address this gap, we present OpenSQP, a modular and reconfigurable SQP algorithm implemented in Python that achieves robust performance comparable to leading algorithms. We implement OpenSQP in a manner that allows users to easily modify or replace components such as merit functions, line search procedures, Hessian approximations, and QP solvers. This flexibility enables the creation of tailored variants of the algorithm for specific needs. To demonstrate reliability, we present numerical results using the standard configuration of OpenSQP that employs a smooth augmented Lagrangian merit function for the line search and a quasi-Newton BFGS method for approximating the Hessians. We benchmark this configuration on a comprehensive set of problems from the CUTEst test suite. The results demonstrate performance that is competitive with proven nonlinear optimization algorithms such as SLSQP, SNOPT, and IPOPT.

Near Real-time Adaptive Isotropic and Anisotropic Image-to-mesh Conversion for Numerical Simulations Involving Cerebral Aneurysms

2025-12-04T18:15:51Z

Presented are two techniques that are designed to help streamline the discretization of complex vascular geometries within the numerical modeling process. The first method integrates multiple software tools into a single pipeline which can generate adaptive anisotropic meshes from segmented medical images. The pipeline is shown to satisfy quality, fidelity, smoothness, and robustness requirements while providing near real-time performance for medical image-to-mesh conversion. The second method approximates a user-defined sizing function to generate adaptive isotropic meshes of good quality and fidelity in real-time. Tested with two brain aneurysm cases and utilizing up to 96 CPU cores within a single, multicore node on Purdue University's Anvil supercomputer, the parallel adaptive anisotropic meshing method utilizes a hierarchical load balancing model (designed for large, cc-NUMA shared memory architectures) and contains an optimized local reconnection operation that performs three times faster than its original implementation from previous studies. The adaptive isotropic method is shown to generate a mesh of up to approximately 50 million elements in less than a minute while the adaptive anisotropic method is shown to generate approximately the same number of elements in about 5 minutes.

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

2025-12-03T22:07:32Z

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.

Maestro: Intelligent Execution for Quantum Circuit Simulation

2025-12-03T19:39:51Z

Quantum circuit simulation remains essential for developing and validating quantum algorithms, especially as current quantum hardware is limited in scale and quality. However, the growing diversity of simulation methods and software tools creates a high barrier to selecting the most suitable backend for a given circuit. We introduce Maestro, a unified interface for quantum circuit simulation that integrates multiple simulation paradigms - state vector, MPS, tensor network, stabilizer, GPU-accelerated, and p-block methods - under a single API. Maestro includes a predictive runtime model that automatically selects the optimal simulator based on circuit structure and available hardware, and applies backend-specific optimizations such as multiprocessing, GPU execution, and improved sampling. Benchmarks across heterogeneous workloads demonstrate that Maestro outperforms individual simulators in both single-circuit and large batched settings, particularly in high-performance computing environments. Maestro provides a scalable, extensible platform for quantum algorithm research, hybrid quantum-classical workflows, and emerging distributed quantum computing architectures.

On the Challenges of Energy-Efficiency Analysis in HPC Systems: Evaluating Synthetic Benchmarks and Gromacs

2025-12-03T11:40:27Z

This paper discusses the challenges encountered when analyzing the energy efficiency of synthetic benchmarks and the Gromacs package on the Fritz and Alex HPC clusters. Experiments were conducted using MPI parallelism on full sockets of Intel Ice Lake and Sapphire Rapids CPUs, as well as Nvidia A40 and A100 GPUs. The metrics and measurements obtained with the Likwid and Nvidia profiling tools are presented, along with the results. The challenges and pitfalls encountered during experimentation and analysis are revealed and discussed. Best practices for future energy efficiency analysis studies are suggested.

Virtual Parameter Sharpening: Dynamic Low-Rank Perturbations for Inference-Time Reasoning Enhancement

2025-12-02T16:54:22Z

I introduce Virtual Parameter Sharpening (VPS), an inference-time technique that augments frozen transformer linear layers with dynamic, activation-conditioned low-rank perturbations. Unlike parameter-efficient fine-tuning methods such as LoRA, which learn static low-rank adapters, VPS constructs its perturbation factors on the fly from batch activation statistics and optional gradient signals, enabling test-time adaptation without persistent parameter updates. The perturbation takes the form Delta W = gamma * W^T V U^T W, where selector matrices U and V are constructed via sparse activation-guided selection or Sylvester-coupled regression. We provide a theoretical analysis of the perturbation's spectral properties and describe an adaptive policy system that modulates perturbation magnitude based on activation energy and token-level entropy. This system incorporates multi-objective verification with iterative refinement for tasks with ground-truth supervision. We present the complete algorithmic framework, analyze its mathematical foundations, and discuss the mechanisms by which activation-conditioned computation may enhance reasoning capabilities in large language models. Implementation and experimental code are available at https://github.com/Saba-Kublashvili/vps-virtual-parameter-synthesis .