Key points are not available for this paper at this time.
Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0. Research question * The selection of in-domain data usually resorts to human intuition. * Can we use simple similarity measures to nominate in-domain data? BERT Target Labeled Data In-domain data Standard approach * Pretrain BERT from scratch on generic data * Fine-tune BERT on target labeled data Domain-specific BERT * Pretrain BERT from scratch on generic data * Continue pretraining BERT on domain-specific corpus (in-domain data) * Fine-tune BERT on target labeled data
Dai et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: