Efficient Continual Pre-training for Building Domain Specific Large Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Large language models (LLMs) have demonstrated remarkable open-domain capabilities.LLMs tailored for a domain are typically trained entirely on domain corpus to excel at handling domain-specific tasks.In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs over an existing open-domain LLM.We introduce FinPythia-6.9B,developed through domainadaptive continual pre-training on the financial domain.Continual pre-trained Fin-Pythia showcases consistent improvements on financial tasks over the original foundational model.We further explore simple but effective data selection strategies for continual pre-training.Our data selection strategies outperform vanilla continual pretraining's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks.Our work proposes an alternative solution to building domain-specific LLMs cost-effectively.

Mark Helpful

Bookmark

Relay

View Full Paper