What question did this study set out to answer?

The research aims to enhance computational efficiency in explicit large-deformation finite element analysis using a GPU-resident pipeline.

June 5, 2026Open Access

Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis

Key Points

The research aims to enhance computational efficiency in explicit large-deformation finite element analysis using a GPU-resident pipeline.
Developed a GPU-execution pipeline for large-deformation finite element analysis.
Conducted tests on a hemisphere compression benchmark across six mesh resolutions (83 K–1.89 M elements).
Utilized a shared-memory brute-force proximity search on the device for contact handling.
Achieved per-step speedups of 99× to 138× over a single CPU core, depending on problem size.
Demonstrated the GPU solver is about 94× faster than the best 8-core CPU configuration.
Identified limitations like single-GPU execution and rigid-body contact search as areas for improvement.

Abstract

We describe a GPU-resident execution pipeline for explicit large-deformation finite element analysis in which every stage of the timestep—internal force evaluation, contact processing, nodal update, time integration, and minimum edge-length reduction—operates on arrays that remain in device memory, so per-step bulk transfers across PCIe are avoided. Contact is handled on the device through a shared-memory brute-force proximity search with warp-ballot stream compaction. We exercise the solver on a hemisphere compression benchmark at six mesh resolutions (83 K–1.89 M elements). On an NVIDIA L40, per-step speedups over a single CPU core range from about 99× to 138×, increasing with problem size and approaching a plateau near 137× for the largest meshes (above roughly 1 M elements); the contact-enabled configuration adds a net ON/OFF overhead of +13% to +21% to the step time. Against LS-DYNA running in SMP mode on the same problem, the proposed solver is roughly 94× faster than the best 8-core configuration, a margin consistent with the multicore saturation observed in the SMP measurements. The remaining limitations—single-GPU execution, FP32 arithmetic, and rigid-body contact search without a BVH broad phase—are identified as specific targets for multi-GPU, mixed-precision, and scalable-contact extensions.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Kim et al. (Wed,) studied this question.

synapsesocial.com/papers/6a2269a2763171746d5484b7 https://doi.org/https://doi.org/10.3390/jmmp10060197

Bookmark

View Full Paper