Efficiently Gluing Pre-trained Language and Vision Models for Image Captioning | Synapse