We describe a GPU-resident execution pipeline for explicit large-deformation finite element analysis in which every stage of the timestep—internal force evaluation, contact processing, nodal update, time integration, and minimum edge-length reduction—operates on arrays that remain in device memory, so per-step bulk transfers across PCIe are avoided. Contact is handled on the device through a shared-memory brute-force proximity search with warp-ballot stream compaction. We exercise the solver on a hemisphere compression benchmark at six mesh resolutions (83 K–1.89 M elements). On an NVIDIA L40, per-step speedups over a single CPU core range from about 99× to 138×, increasing with problem size and approaching a plateau near 137× for the largest meshes (above roughly 1 M elements); the contact-enabled configuration adds a net ON/OFF overhead of +13% to +21% to the step time. Against LS-DYNA running in SMP mode on the same problem, the proposed solver is roughly 94× faster than the best 8-core configuration, a margin consistent with the multicore saturation observed in the SMP measurements. The remaining limitations—single-GPU execution, FP32 arithmetic, and rigid-body contact search without a BVH broad phase—are identified as specific targets for multi-GPU, mixed-precision, and scalable-contact extensions.
Kim et al. (Wed,) studied this question.