What question did this study set out to answer?

The research aims to improve feature representation learning by addressing the challenges of generalization and noise across different data types.

February 5, 2026Open Access

Towards robust and generalizable feature representation learning

Key Points

The research aims to improve feature representation learning by addressing the challenges of generalization and noise across different data types.
Investigated feature representation challenges in human skeleton, visual, and text data.
Developed a hierarchical encoder within a self-supervised framework for skeleton data.
Created a codebook-based strategy for aligning image and text modalities during pretraining.
Implemented an automatic data filtering system for reducing noise in video-language data.
Proposed methods enhance alignment between image and text representations.
Presented a novel approach for modeling structured data effectively.
Demonstrated improved robustness in feature representations across multiple modalities.

Abstract

Feature representation learning focuses on extracting and encoding meaningful information from raw data, such as images or text. These encoded representations form the foundation for a wide range of downstream tasks, such as classification, regression, clustering, and generation. In recent years, the field has witnessed rapid progress, largely driven by advances in self-supervised learning and multimodal modeling. By leveraging large-scale, readily available datasets and designing pre-training tasks that do not rely on human annotation, researchers have been able to learn highly effective and transferable feature representations. Despite these advancements, several key challenges persist. These include: (1) designing pre-training objectives that capture the inherent structure and semantics of data, (2) ensuring learned features generalize across diverse downstream tasks, and (3) learning effectively from noisy, weakly aligned data. This research focuses on addressing these issues. Specifically, the dissertation investigates feature representation learning across multiple modalities, including human skeleton data, visual data (images and videos), and text. It explores the full spectrum of representation learning challenges, covering the design of pre-training tasks, the formulation of representation structures, and the mitigation of noise in large-scale data. To this end, the dissertation presents: (1) a hierarchical encoder combined with a pretext-based self-supervised framework for modeling structured skeleton data; (2) a codebook-based representation strategy that improves alignment between image and text modalities and addresses semantic mismatches during pretraining; and (3) an automatic data filtering system that leverages language models and multi-pathway alignment to filter noisy supervision in video-language data. Together, these contributions offer a unified perspective on learning robust, transferable, and interpretable feature representations across diverse domains and challenges.

Towards robust and generalizable feature representation learning

Key Points

Abstract

Cite This Study