VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Synapse