Key points are not available for this paper at this time.
Pre-trained language models have achieved great success in a wide variety of natural language processing (NLP) tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency information retrieval (IR) systems. To address the problem, we present TwinBERT model, which has two improvements: 1) represent query and document separately using twin-structured encoders and 2) each encoder is a highly compressed BERT-like model with less than one third of the parameters. The former allows document embeddings to be pre-computed offline and cached in memory, which is different from BERT, where the two input sentences are concatenated and encoded together. The change saves large amount of computation time, however, it is still not sufficient for real-time retrieval considering the complexity of BERT model itself. To further reduce computational cost, a compressed multi-layer transformer encoder is proposed with special training strategies as a substitution of the original complex BERT encoder. Lastly, two versions of TwinBERT are developed to combine the query and keyword embeddings for retrieval and relevance tasks correspondingly. Both of them have met the real-time latency requirement and achieve close or on-par performance to BERT-Base model.
Lu et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: