What question did this study set out to answer?

This paper aims to introduce a three-level framework of generalization in deep learning, highlighting how models learn to generalize across different levels.

June 13, 2026Open Access

Three Levels of Generalization in Deep Learning

Key Points

This paper aims to introduce a three-level framework of generalization in deep learning, highlighting how models learn to generalize across different levels.
Proposes a framework categorizing generalization into three levels: supervised cross-sample, self-supervised representation, and autoregressive predictive.
Discusses the distinctions between models like BERT and GPT based on their generalization capabilities.
Identifies level 1 as task-specific mappings; level 2 focuses on reusable internal representations; level 3 emphasizes predictive structures in multimodal contexts.
Highlights that future advances in deep learning depend not just on scale but on richer generalization objects and interfaces.

Abstract

Deep learning is usually described through architectures, benchmarks, datasets, and scale: convolutional networks, Transformers, ImageNet, BERT, GPT, larger models, larger corpora, and larger compute. This paper argues that such descriptions miss a deeper organizing principle. The major transitions in deep learning are transitions in what models learn to generalize. This paper proposes a three-level framework of generalization in deep learning. Level 1 is supervised cross-sample generalization: models learn task-specific mappings from labeled examples and apply them to unseen samples. Level 2 is self-supervised representation generalization: models learn reusable internal representations from data-derived supervision and transfer them across downstream tasks. Level 3 is autoregressive and predictive high-dimensional relational generalization: models learn predictive structure over broad symbolic or multimodal streams, and the predictive process itself becomes a capability interface. The distinction between BERT and GPT is therefore not merely architectural. BERT makes representations reusable; GPT makes prediction interactive. This difference marks a transition from representation transfer to predictive interaction. The proposed framework explains deep learning progress as an expansion of the object of generalization: from task mappings, to reusable representations, to predictive relational structures. It also suggests that future advances will depend not only on scale, but on discovering richer generalization objects and more powerful interfaces through which learned structure can be used.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper