Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning | Synapse