The partially supervised Compositional Zero-Shot Learning (pCZSL) recognizes new compositions of states and objects, where for every image in the training set either the state or the object annotation is available. In pCZSL, features of a state vary depending on the object in the composition (e.g. the features of state ripe are different for ripe banana and ripe apple). Understanding the variation in features across scales of objects is also a key challenge. In the proposed architecture, a swin transformer based Hierarchical Feature Extractor (HFE) captures the large range of semantic interactions between state and object features. The Discriminative Context Aggregation module utilizes features from the intermediate layers of the HFE to understand the features of object at their corresponding scales. To leverage the partially labeled data in pCZSL, we pass strongly and weakly augmented versions of the input image to the proposed architecture. The predicted class probabilities for strongly and weakly augmented images are encouraged to be similar, minimizing a distribution alignment loss. This loss incorporates class specific re-weighting approach to alleviate the effect of data imbalance for pCZSL. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed approach.
Panda et al. (Thu,) studied this question.