Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders | Synapse