Skip to main content
High Performance Computing & GPU-Accelerated Scientific Software

High Performance Computing & GPU-Accelerated Scientific Software

Robert Carson
Author
Robert Carson
Working at the intersection of HPC systems and solid mechanics. Builder of open-source scientific software in Rust and C++. Landscape photographer.

Overview
#

For roughly forty years, scientific codes got faster by waiting for the next hardware generation. That era ended. Peak floating-point throughput stopped scaling the way it used to, and the compute density now available in leadership-class machines comes almost entirely from GPU accelerators — thousands of streaming multiprocessors per node, operating under a fundamentally different programming model than the CPU clusters that most production scientific software was written for.

The US Exascale Computing Project (ECP), organized around the deployment of Frontier (ORNL) and El Capitan (LLNL), recognized that reaching exascale performance required rebuilding large portions of the scientific software stack — not just porting existing codes, but rethinking algorithms, data structures, and software interfaces to exploit the arithmetic intensity and parallelism that GPUs demand. A second constraint made this harder: the three major GPU vendors (NVIDIA, AMD, Intel) each had distinct hardware designs and programming models, and the community needed solutions that worked across all of them without separate codebases.

My contributions to this effort span four layers of the scientific computing stack. At the bottom is foundational dense linear algebra (MAGMA). Above that, high-order finite element algorithms and assembly (MFEM). Above that, domain-specific simulation codes for materials applications (ExaConstit, ExaCMech). And at the top, scientific workflow systems for managing large ensemble computations at full-machine scale (RADICAL-EnTK). Each layer presents distinct technical challenges that are largely independent of the specific scientific application — performance portability, arithmetic intensity, batched small-system solvers, distributed task management — and the solutions generalize across domains.

The Exascale Algorithmic Challenge
#

Making code fast on a modern GPU means satisfying constraints that CPU-optimized code typically violates simultaneously. Computations need massive fine-grained parallelism. Memory access patterns need to be regular enough to exploit coalescing. The ratio of floating-point operations to memory accesses — arithmetic intensity — needs to be high enough that the memory system isn’t the bottleneck.

High-order finite element methods are attractive for GPU computing precisely because they improve arithmetic intensity: higher polynomial degree means more floating-point work per degree of freedom and more operations per byte of data moved from memory. The partial assembly approach — applying FE operators matrix-free rather than forming and storing the assembled stiffness matrix — further reduces memory traffic at the cost of additional compute, a trade-off that is favorable on GPU architectures. These are the algorithmic ideas underlying the codes described here.

Performance portability across vendors requires abstraction. RAJA and Kokkos present unified programming interfaces that generate efficient code for NVIDIA, AMD, and Intel GPU backends from a single source. Designing algorithms around these portability layers rather than around any specific vendor’s hardware was a central engineering challenge throughout.


Papers & Software
#

MAGMA: Enabling Exascale Performance with Accelerated BLAS and LAPACK for Diverse GPU Architectures
#

Dense linear algebra — matrix factorizations, linear system solves, eigenvalue computations — is foundational infrastructure for scientific computing across virtually every domain. On CPUs, highly tuned vendor libraries provide near-peak performance. On GPUs, achieving good performance requires algorithms designed specifically for their memory hierarchy and execution model, and the algorithms that work well on one vendor’s hardware may need substantial modification for another’s.

MAGMA (Matrix Algebra for GPU and Multicore Architectures) addresses this with approximately 750 numerical routines in four floating-point precisions. This paper describes MAGMA’s adaptation to the diverse GPU landscape of DOE’s exascale systems — NVIDIA A100 on Summit, AMD MI250X on Frontier, Intel Ponte Vecchio on Aurora. The interesting computing science here is in the hybrid CPU-GPU algorithm designs: many linear algebra operations are memory-bandwidth limited on GPU but not on CPU, or require irregular data access that is expensive on GPU. MAGMA addresses this by partitioning work between CPU and GPU based on which handles each phase efficiently, synchronizing through managed or pinned memory.

The downstream impact is broad — MAGMA is integrated into ExaConstit, HiOP, MARBL, STRUMPACK, and many other ECP applications. Performance improvements in a well-designed foundational library flow upward to all downstream users simultaneously.

Citation: A. Abdelfattah, N. Beams, R. Carson, P. Ghysels, T. Kolev, T. Stitt, A. Vargas, S. Tomov, J. Dongarra. International Journal of High Performance Computing Applications 38(5), pp. 468–490, 2024. doi:10.1177/10943420241261960


GPU Algorithms for Efficient Exascale Discretizations
#

This paper, from the Center for Efficient Exascale Discretizations (CEED) collaboration within ECP, covers GPU algorithm development across the CEED software ecosystem: libCEED, MFEM, MAGMA, libParanumal, and Nek. The core theme is that high-order FE methods achieve significantly better arithmetic intensity than their low-order counterparts on GPU hardware — and that this makes them particularly well-suited to the compute-heavy, bandwidth-constrained environment of modern accelerator architectures.

Performance results on both NVIDIA and AMD GPU systems at leadership-class facilities demonstrate that carefully designed high-order FE implementations can achieve substantial fractions of theoretical peak GPU performance. This establishes the algorithmic foundation that ExaConstit is built on, through its dependence on MFEM for FE assembly and MAGMA for the inner linear algebra.

Citation: A. Abdelfattah, V. Barra, N. Beams, R. Bleile, J. Brown, …, R. Carson, …, K. Weiss. Parallel Computing 108, 102841, 2021. doi:10.1016/j.parco.2021.102841 | arXiv:2109.05072


ExaConstit [Open-Source Software]
#

ExaConstit is LLNL’s open-source finite element code for simulating how polycrystalline metals deform at the grain scale—answering questions like “what are the bulk mechanical properties of this alloy microstructure?” at a fidelity that experiments alone can’t easily reach. The core numerical approach is an implicit, velocity-based updated Lagrangian finite element method: rather than tracking displacements directly, the code solves for the velocity field at each time step and integrates forward, which gives better numerical behavior at the large deformations typical in metal forming and plasticity problems. Getting that to run efficiently across CPUs, GPUs, and large MPI clusters—without rewriting the physics for each platform—is where the interesting engineering lives.

The first design challenge is simply getting people running. The dependency graph for a GPU-capable HPC code is significant: MPI, a finite element library (MFEM), a performance portability layer (RAJA), memory management (Umpire), a crystal plasticity library (ExaCMech), and several linear algebra packages. Rather than documenting all of this and hoping users figure it out, ExaConstit ships platform-specific install scripts—CPU, CUDA, and HIP/ROCm variants—that encode the full build sequence and known-good dependency versions for each backend. Simulation configuration itself is handled through human-readable TOML files that control everything from mesh parameters and solver tolerances to material model selection and output options, so users never need to touch source code for routine use.

The GPU portability story is built around RAJA, LLNL’s performance portability framework. The key idea is that compute kernels—the hot loops over mesh nodes or quadrature points—are written once against RAJA’s abstraction layer and compiled to CPU sequential, OpenMP, CUDA, or HIP targets purely through build-time configuration and a runtime backend selector. MFEM handles the equally important challenge of keeping finite element operator data structures resident on the device and minimizing host-device transfers, supporting multiple assembly strategies—from fully matrix-free “partial assembly” (ideal for very high polynomial orders) to “element assembly” (which only forms individual element matrices rather than a global system, hitting a practical sweet spot for GPU memory bandwidth on typical problems). The result is that the same physics code runs on a laptop CPU or an AMD MI300A GPU cluster with no algorithmic changes—just a different build flag and a one-line config switch.

The FEM framework is organized around a clean separation between three concerns that are often tangled together in legacy codes: the time-stepping and nonlinear solve logic, the finite element assembly machinery, and the material model. The nonlinear mechanics problem at each time step is solved via Newton-Raphson, which requires assembling both a residual (internal forces) and a tangent stiffness (Jacobian) from contributions at every quadrature point in the mesh. The key design decision here is that the assembly layer communicates with the material model through a well-defined interface: it hands off the local deformation state at each quadrature point and gets back stress and a material tangent stiffness, with no knowledge of what the material model actually does internally. This separation means the FEM machinery, the solver infrastructure, and the constitutive models can all evolve independently—a new solver strategy doesn’t require touching material model code, and a new material model doesn’t require understanding anything about FEM assembly.

That material model interface is where user extensibility lives. ExaConstit ships with first-class support for ExaCMech, LLNL’s GPU-native crystal plasticity library covering a range of physically-based models for FCC, BCC, and HCP metals. But it also supports the UMAT interface—the standard constitutive model calling convention originally from Abaqus—meaning any model written to that spec (in C, C++, or Fortran) can be dropped in, either compiled directly or loaded at runtime as a shared library without recompilation. Multi-material simulations, where different mesh regions use different models, are handled through a composite design pattern that presents a single unified model interface to the rest of the code regardless of how many underlying models are active. On the output side, the postprocessing system supports full field output to VisIt and ParaView, alongside a volume-averaging system that computes per-region aggregate quantities—stresses, strains, lattice strains across all crystal symmetry groups—formatted for downstream Python analysis workflows.

Repository: github.com/LLNL/ExaConstit Citation: R.A. Carson, S.R. Wopschall, J.A. Bramwell. ExaConstit [Computer Software], 2019. doi:10.11578/dc.20191024.2

Assembly formulation references:
A.K. Gupta, B. Mohraz. “A method of computing numerically integrated stiffness matrices.” International Journal for Numerical Methods in Engineering 5(1), pp. 83–89, 1972. doi:10.1002/nme.1620050108
A.K. Gupta. “Efficient numerical integration of element stiffness matrices.” International Journal for Numerical Methods in Engineering 19(9), pp. 1410–1413, 1983. doi:10.1002/nme.1620190910


ExaCMech [Open-Source Software]
#

ExaCMech is an open-source LLNL library for simulating how crystalline materials — metals like copper or steel — deform under extreme conditions. Running these simulations at scale means evaluating complex material behavior at millions of points simultaneously, which demands careful thought about how the software is structured. At its outermost layer, ExaCMech presents a simple, stable interface to host simulation codes: tell me your material parameters, hand me your data, tell me where you want to run. Underneath that interface, the library uses RAJA and CHAI — two portability layers also developed at LLNL — to dispatch work to whatever hardware is available, whether that’s a serial CPU, an OpenMP thread pool, or a GPU via CUDA or HIP. Crucially, none of the physics code has to change when the target hardware changes. The same model implementation compiles and runs correctly everywhere, with the portability concern handled entirely by the abstraction layer.

The most counter-intuitive performance decision in ExaCMech is its approach to GPU execution. The conventional wisdom in GPU programming is to decompose work into many small, focused kernels — one for each logical step of a computation — so each kernel is simple and easy to reason about. ExaCMech does the opposite: the entire material update for each point, including an iterative nonlinear solve that may take dozens of internal steps to converge, is executed as a single fused unit of work per GPU thread. The reason comes down to memory. The intermediate values generated during that iterative solve — trial stresses, slip rates, partial derivatives — are only ever needed locally, at that one material point. If the computation were split across multiple kernel launches, all of that intermediate state would have to be written out to global GPU memory and read back in, which is expensive. By keeping everything fused into one kernel, those values live in registers and fast local cache for the lifetime of the solve, never touching slow memory. This “fat kernel” / fused-kernel pattern has become increasingly common in HPC GPU codes over the past few years, particularly wherever kernel launch overheads are dominating performance runtimes.

The third key design decision governs how material models themselves are built and extended. Crystal plasticity models are naturally decomposable into a handful of independent physics concerns: the geometry of a crystal’s slip systems, the kinetic law governing how fast slip occurs, the thermoelastic response, and the equation of state. ExaCMech encodes this decomposition using C++ templates, assembling a complete model by combining one concrete implementation of each component. This pays off in two ways. First, the compiler knows the exact structure of every model at build time — array sizes, loop counts, system dimensions — so it can optimize aggressively in ways that aren’t possible with runtime polymorphism. Second, it dramatically lowers the cost of introducing a new material model. A developer who wants to implement a new hardening law, for example, only needs to supply the rate equations and their derivatives with respect to stress and internal state. The surrounding machinery — the nonlinear solver, the orientation update, the tangent stiffness computation needed by finite element codes — is already in place and picks up the new model automatically. The library even has a Python/JAX companion that lets developers prototype new model forms using automatic differentiation, skipping the need to derive those derivative terms by hand during early-stage exploration.

Taken together, these three decisions — a clean portability abstraction, aggressive kernel fusion, and template-based model composition — reflect a consistent design philosophy: push complexity to the places where it can be managed once, and keep the interfaces that people actually extend as narrow as possible. For anyone building physics simulation libraries targeting modern heterogeneous HPC platforms, ExaCMech is a compelling example of how those tradeoffs can be made to work in practice.

Repository: github.com/LLNL/ExaCMech Citation: N.R. Barton, R.A. Carson, S.R. Wopschall. ExaCMech [Computer Software], 2018. doi:10.11578/dc.20190809.2


Scaling on Frontier: Uncertainty Quantification Workflow Applications Using ExaWorks to Enable Full System Utilization
#

Co-first author

Fast individual simulations are necessary but not sufficient for large-scale science. The problems that actually motivate exascale computing — uncertainty quantification, ensemble-based inference, parameter studies — require coordinating thousands of heterogeneous simulation tasks as part of a larger workflow. Managing this introduces distributed systems challenges that have no analog in single-simulation HPC: heterogeneous resource requirements across pipeline stages, task size and runtime variability, hardware failures that must be recovered from gracefully without restarting the whole pipeline, and scheduling efficiency measured in terms of fraction of the full machine actually utilized.

This paper describes the ExaAM UQ pipeline implemented using RADICAL-EnTK (Ensemble Toolkit) from the ECP ExaWorks SDK, demonstrated on up to 8,000 compute nodes of Frontier. The pipeline connects three codes with fundamentally different resource profiles — AdditiveFOAM (CPU-only), ExaCA (CPU-GPU), and ExaConstit (GPU-intensive) — and Stage 3 alone required orchestrating 7,875 independent tasks simultaneously. EnTK maintained 90% resource utilization across the full machine with 85 seconds of management overhead against roughly 8,000 seconds of total simulation time, and handled all 10 task failures automatically without human intervention. The scheduling throughput of 269 tasks/s and launch rate of 51 tasks/s demonstrate that workflow middleware can scale to full exascale utilization without becoming a bottleneck.

Citation: M. Titov*, R. Carson*, M. Rolchigo, J. Coleman, J. Belak, M. Bement, D. Laney, M. Turilli, S. Jha. (*co-first authors) Proceedings of WORKS ‘23, SC'23 Workshops, 2023. doi:10.1145/3624062.3626283


Enhancements Supporting IC Usage of PEM Libraries on Next-Generation Platforms
#

Porting a production scientific software ecosystem to a new GPU architecture is a substantial effort that rarely gets the recognition it deserves as a research contribution. This technical report documents the multi-year effort to port the Physics and Engineering Models (PEM) library suite — including ExaConstit and ExaCMech — from NVIDIA-only GPU support to AMD GPU architecture. Performance-portable abstractions handle basic control flow portability, but achieving performance parity requires profiling to identify architecture-specific bottlenecks, tuning launch configurations for different memory hierarchy sizes, and occasionally choosing different algorithms for the two architectures. The report documents this process systematically, providing a record of the performance engineering decisions and their outcomes.

Citation: D.F. Richards, B.S. Ryujin, N. Barton, …, R. Carson, et al. LLNL-TR-823775, 2021. doi:10.2172/1798430


Connections to Other Work
#

The software described here is the enabling infrastructure for the additive manufacturing UQ work — without the ~15× GPU speedup in ExaConstit and the workflow scaling demonstrated on Frontier, the ensemble sizes needed for statistical convergence wouldn’t be feasible. MAGMA and the CEED algorithms are general-purpose tools used across many ECP applications well beyond materials simulation. ExaConstit and ExaCMech are open source and available for community use, including for crystal plasticity applications in other domains.