Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation | Synapse