What question did this study set out to answer?

This research aims to develop an unsupervised learning model that can effectively disentangle factors of variation in natural videos.

July 21, 2020Open Access

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Key Points

This research aims to develop an unsupervised learning model that can effectively disentangle factors of variation in natural videos.
Introduced SlowVAE, a model employing a sparse prior on temporal observations.
Provided proof of identifiability for the model.
Tested the model on benchmark datasets and demonstrated its capabilities in disentangling representations.
Showed improved disentangled representations compared to the current state-of-the-art on several benchmarks.
Achieved reliable performance on video datasets with natural dynamics, such as Natural Sprites and KITTI Masks.
Demonstrated ability to learn without prior assumptions on the number of changing factors.

Abstract

We construct an unsupervised learning model that achieves nonlinear of underlying factors of variation in naturalistic videos. work suggests that representations can be disentangled if all but a factors in the environment stay constant at any point in time. As a result, proposed for this problem have only been tested on carefully datasets with this exact property, leaving it unclear whether they transfer to natural scenes. Here we provide evidence that objects in natural movies undergo transitions that are typically small in with occasional large jumps, which is characteristic of a temporally distribution. We leverage this finding and present SlowVAE, a model for representation learning that uses a sparse prior on temporally observations to disentangle generative factors without any assumptions the number of changing factors. We provide a proof of identifiability and that the model reliably learns disentangled representations on several benchmark datasets, often surpassing the current state-of-the-art. additionally demonstrate transferability towards video datasets with natural, Natural Sprites and KITTI Masks, which we contribute as benchmarks guiding disentanglement research towards more natural data domains.

Bookmark

View Full Paper