Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models | Synapse