Key points are not available for this paper at this time.
While contrastive approaches of self-supervised learning (SSL) learn by minimizing the distance between two augmented views of the data point (positive pairs) and maximizing views from different data (negative pairs), recent -contrastive SSL (e. g. , BYOL and) show remarkable performance \ without negative pairs, with an learnable predictor and a stop-gradient operation. A fundamental question: why do these methods not collapse into trivial representations? We this question via a simple theoretical study and propose a novel, DirectPred, that sets the linear predictor based on statistics of its inputs, without gradient training. On ImageNet, it comparably with more complex two-layer non-linear predictors that BatchNorm and outperforms a linear predictor by 2. 5\\% in 300-epoch (and 5\\% in 60-epoch). DirectPred is motivated by our theoretical of the nonlinear learning dynamics of non-contrastive SSL in simple networks. Our study yields conceptual insights into how non-contrastive methods learn, how they avoid representational collapse, and how multiple, like predictor networks, stop-gradients, exponential moving averages, weight decay all come into play. Our simple theory recapitulates the of real-world ablation studies in both STL-10 and ImageNet. Code is https: //github. com/facebookresearch/luckmatters/tree/master/ssl.
Tian et al. (Fri,) studied this question.