paint-brush
Non-Allocating Static Nonlinear Solvers for GPU Kernels: Speed and Efficiencyby@linearization

Non-Allocating Static Nonlinear Solvers for GPU Kernels: Speed and Efficiency

by Linearization TechnologyMarch 27th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Explore non-allocating static solvers for GPU kernels. Speed up nonlinear equation solving with NonlinearSolve.jl's optimized GPU algorithms.
featured image - Non-Allocating Static Nonlinear Solvers for GPU Kernels: Speed and Efficiency
Linearization Technology HackerNoon profile picture
0-item

Abstract and 1. Introduction

2. Mathematical Description and 2.1. Numerical Algorithms for Nonlinear Equations

2.2. Globalization Strategies

2.3. Sensitivity Analysis

2.4. Matrix Coloring & Sparse Automatic Differentiation

3. Special Capabilities

3.1. Composable Building Blocks

3.2. Smart PolyAlgortihm Defaults

3.3. Non-Allocating Static Algorithms inside GPU Kernels

3.4. Automatic Sparsity Exploitation

3.5. Generalized Jacobian-Free Nonlinear Solvers using Krylov Methods

4. Results and 4.1. Robustness on 23 Test Problems

4.2. Initializing the Doyle-Fuller-Newman (DFN) Battery Model

4.3. Large Ill-Conditioned Nonlinear Brusselator System

5. Conclusion and References

3.3. Non-Allocating Static Algorithms inside GPU Kernels

NonlinearSolve.jl comes bundled with SimpleNonlinearSolve.jl, which provides specialized non-allocating solvers for extremely efficient solving of very small nonlinear systems on GPUs. These solvers implement algorithms like Newton-Raphson and Trust-Region as static, non-allocating routines that operate directly on StaticArrays of fixed size, avoiding the overhead of allocations and dynamic dispatch. This makes them ideal for embedding inside GPU kernels using KernelAbstractions.jl [55] to solve many independent small nonlinear systems in parallel across GPU threads. In the following example, we solve the generalized Rosenbrock problem [Equation (2.12)] for 1024 different initial conditions on CPU, AMD ROCm GPUs and NVIDIA CUDA GPUs using the same code.



The simpler solvers outperform the more general solvers in NonlinearSolve.jl significantly for small static problems [Figure 6]. Their high performance enables applications like massively parallel global optimization [56] and parameter estimation problems, where solving many small independent nonlinear systems on the GPU is advantageous. SimpleNonlinearSolve.jl provides a portable, vendor-agnostic implementation that can target different GPU architectures like CUDA, ROCm, etc., with the same code.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) AVIK PAL, CSAIL MIT, Cambridge, MA;

(2) FLEMMING HOLTORF;

(3) AXEL LARSSON;

(4) TORKEL LOMAN;

(5) UTKARSH;

(6) FRANK SCHÄFER;

(7) QINGYU QU;

(8) ALAN EDELMAN;

(9) CHRIS RACKAUCKAS, CSAIL MIT, Cambridge, MA.