The Contrastive Language-Image Pre-training (CLIP) model uses contrastive learning to align image and text representations, and fine-tuning CLIP with federated learning can extend its application to professional fields. However, federated CLIP fine-tuning faces two key challenges: insufficient alignment of fine-grained semantics between vision and text modalities and poor adaptability to non-independent and identically distributed (non-IID) data. This paper proposes the Optimal Transport Dual Prompt Personalization (OTDPP) framework, injects prompt parameters into the deep networks of both visual and text encoders, achieves fine-grained cross-modal alignment through optimal transport, and designs a dual prompt tuning mechanism. The framework splits prompt parameters into a shared global part aggregated by the server and a private local part reserved by clients, and it enables personalized adaptation without updating large backbone encoders. Extensive experiments show that compared with classic prompt tuning baseline methods, OTDPP reduces computational and communication overhead, retains client-specific personalized features, significantly improves model adaptability and performance, and thus demonstrates broad application prospects.
Shi et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: