Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation | Synapse