Background/Objectives: Vision–language models such as BiomedCLIP are increasingly investigated for their diagnostic potential in medical imaging. Although these foundation models show promise in general radiographic interpretation, their application in pediatric domains—particularly for subtle, postoperative findings like esophageal strictures—remains underexplored. This study aimed to evaluate the diagnostic performance of BiomedCLIP in classifying pediatric esophageal radiographs into three clinically relevant categories: presence of contrast agent, full esophageal visibility, and presence of esophageal stricture. Methods: We retrospectively analyzed 143 pediatric esophageal X-rays collected between 2021 and 2025. Each image was annotated by two pediatric radiology experts and categorized according to esophageal visibility, contrast presence, and stricture occurrence. BiomedCLIP was used in a zero-shot classification setup without fine-tuning. Model predictions were converted into binary outcomes and assessed against the ground truth using a comprehensive suite of 27 performance metrics, including accuracy, sensitivity, specificity, F1-score, AUC, and calibration analyses. Results: BiomedCLIP achieved high precision (88.7%) and a favorable AUC (85.4%) in detecting contrast agent presence, though specificity remained low (20%), leading to a high false-positive rate. The model correctly identified all cases of non-visible esophagus, but was untestable in predicting full visibility due to the absence of positive cases. Critically, its performance in detecting esophageal strictures was poor, with accuracy at 24%, sensitivity at 44%, specificity at 18%, and AUC of 0.26. Statistical overlap between contrast and stricture predictions indicated a lack of semantic differentiation within the model’s latent space. Conclusions: BiomedCLIP shows potential in detecting high-salience features such as contrast but fails to reliably identify esophageal strictures. Limitations include class imbalance, absence of fine-tuning, and architectural constraints in recognizing subtle morphologic abnormalities. These findings emphasize the need for domain-specific adaptation of foundation models before clinical implementation in pediatric radiology.
Fabijan et al. (Mon,) studied this question.