Human-Centric Spatio-Temporal Video Grounding With Visual Transformers | Synapse