Large Language Models (LLMs) are the best at processing natural language. Due to their growing size and computing needs, scalability, efficiency, and deployment are difficult. Traditional dense transformer designs activate all parameters during inference, increasing power consumption, delay, and cost. The biggest challenge is scaling LLMs without compromising performance, interpretability, or generalizability. Current compression and parameter-sharing methods fail to balance efficiency and accuracy in large-scale deployments, despite slight advantages. Sparse Mixture-of-Experts Transformers (SMOE-T) are a novel paradigm for large-scale language modeling that uses conditional computation to improve efficiency. Sparse gating activates only select expert modules for each input, saving computational power without compromising model expressiveness in SMoE-T. The approach allows experts to specialize in specific languages or subjects, increasing productivity and flexibility. On common NLP datasets for language modeling, machine translation, and question answering, SMoE-T outperforms dense transformers in accuracy and reduces FLOPs and inference latency by 60%. SMoE-T is scalable, making it easy to train and deploy in distributed systems. Finally, SMoE-T allows resource-efficient and scalable LLM training and deployment, enabling long-term, widely available generative AI solutions.
Taher M. Ghazal (Thu,) studied this question.