Siamese Vision Transformers are Scalable Audio-visual Learners | Synapse