Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking | Synapse