Key points are not available for this paper at this time.
Techniques such as ensembling and distillation promise model quality when paired with almost any base model. However, due to increased-time cost (for ensembles) and increased complexity of the training (for distillation), these techniques are challenging to use in settings. In this paper we explore a variant of distillation which relatively straightforward to use as it does not require a complicated-stage setup or many new hyperparameters. Our first claim is that online enables us to use extra parallelism to fit very large datasets twice as fast. Crucially, we can still speed up training even after we already reached the point at which additional parallelism provides no for synchronous or asynchronous stochastic gradient descent. Two neural trained on disjoint subsets of the data can share knowledge by each model to agree with the predictions the other model would have. These predictions can come from a stale version of the other model so can be safely computed using weights that only rarely get transmitted. Our claim is that online distillation is a cost-effective way to make the predictions of a model dramatically more reproducible. We support our using experiments on the Criteo Display Ad Challenge dataset, ImageNet, the largest to-date dataset used for neural language modeling, containing6\ 10^11 tokens and based on the Common Crawl repository of web data.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rohan Anil
Gabriel Pereyra
Alexandre Passos
Building similarity graph...
Analyzing shared references across papers
Loading...
Anil et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69f3b97e8121f29bd60dbd45 — DOI: https://doi.org/10.48550/arxiv.1804.03235