We construct an unsupervised learning model that achieves nonlinear of underlying factors of variation in naturalistic videos. work suggests that representations can be disentangled if all but a factors in the environment stay constant at any point in time. As a result, proposed for this problem have only been tested on carefully datasets with this exact property, leaving it unclear whether they transfer to natural scenes. Here we provide evidence that objects in natural movies undergo transitions that are typically small in with occasional large jumps, which is characteristic of a temporally distribution. We leverage this finding and present SlowVAE, a model for representation learning that uses a sparse prior on temporally observations to disentangle generative factors without any assumptions the number of changing factors. We provide a proof of identifiability and that the model reliably learns disentangled representations on several benchmark datasets, often surpassing the current state-of-the-art. additionally demonstrate transferability towards video datasets with natural, Natural Sprites and KITTI Masks, which we contribute as benchmarks guiding disentanglement research towards more natural data domains.
Klindt et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: