How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models | Synapse