What question did this study set out to answer?

This research aims to address training instabilities and quality uncertainties in large language models using sparse expert designs.

February 17, 2022Open Access

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Key Points

This research aims to address training instabilities and quality uncertainties in large language models using sparse expert designs.
Developed Stable and Transferable Mixture-of-Experts model with 269B parameters.
Compared computational costs with a 32B dense encoder-decoder Transformer.
Evaluated across diverse natural language tasks including reasoning, summarization, and question answering.
Sparse model achieves state-of-the-art performance on SuperGLUE and ARC datasets.
Effective transfer learning demonstrated across multiple tasks such as XSum and WebQA.
Similar computational cost to smaller dense models while outperforming on challenging benchmarks.

Abstract

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Barret Zoph

University of Southern California

Irwan Bello

Google (United States)

Sameer Kumar

Delhi Technological University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider