Feature representation learning focuses on extracting and encoding meaningful information from raw data, such as images or text. These encoded representations form the foundation for a wide range of downstream tasks, such as classification, regression, clustering, and generation. In recent years, the field has witnessed rapid progress, largely driven by advances in self-supervised learning and multimodal modeling. By leveraging large-scale, readily available datasets and designing pre-training tasks that do not rely on human annotation, researchers have been able to learn highly effective and transferable feature representations. Despite these advancements, several key challenges persist. These include: (1) designing pre-training objectives that capture the inherent structure and semantics of data, (2) ensuring learned features generalize across diverse downstream tasks, and (3) learning effectively from noisy, weakly aligned data. This research focuses on addressing these issues. Specifically, the dissertation investigates feature representation learning across multiple modalities, including human skeleton data, visual data (images and videos), and text. It explores the full spectrum of representation learning challenges, covering the design of pre-training tasks, the formulation of representation structures, and the mitigation of noise in large-scale data. To this end, the dissertation presents: (1) a hierarchical encoder combined with a pretext-based self-supervised framework for modeling structured skeleton data; (2) a codebook-based representation strategy that improves alignment between image and text modalities and addresses semantic mismatches during pretraining; and (3) an automatic data filtering system that leverages language models and multi-pathway alignment to filter noisy supervision in video-language data. Together, these contributions offer a unified perspective on learning robust, transferable, and interpretable feature representations across diverse domains and challenges.
Yuxiao Chen (Thu,) studied this question.