Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System | Synapse