Key points are not available for this paper at this time.
A possible explanation for the impressive performance of masked language (MLM) pre-training is that such models have learned to represent the structures prevalent in classical NLP pipelines. In this paper, we a different explanation: MLMs succeed on downstream tasks almost due to their ability to model higher-order word co-occurrence. To demonstrate this, we pre-train MLMs on sentences with randomly word order, and show that these models still achieve high accuracy fine-tuning on many downstream tasks -- including on tasks specifically to be challenging for models that ignore word order. Our models surprisingly well according to some parametric syntactic probes, possible deficiencies in how we test representations for syntactic. Overall, our results show that purely distributional information explains the success of pre-training, and underscore the importance of challenging evaluation datasets that require deeper linguistic.
Sinha et al. (Wed,) studied this question.