Deconfounded Multimodal Learning for Spatio-temporal Video Grounding | Synapse