Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning | Synapse