Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models | Synapse