What matters when building vision-language models? | Synapse