June 1, 2024

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study.

Preguntar a la IA

Me gusta

Guardar