The incidence of thyroid nodules is relatively high. Doctors typically distinguish the benign and malignant nodules based on ultrasound images, but this method has the risk of misdiagnosis, causing serious consequences for patients. Therefore, improving diagnostic accuracy through Computer Aided Diagnosis (CAD) is crucial. In this study, we propose a novel feature fusion network ResNet-ViT, based on ResNet18 and ViT-l-16, to predict the benign and malignant nature of thyroid nodules. This model adopts the conv layer, layer1 and layer2 of ResNet18 to extract local features, and uses ViT-l-16 without the class token to extract global features. Finally, the convolutional block is used to fuse the local features and global features. We applied ResNet-ViT model to the DDTI and TN5000 dataset and compared it with eight other popular methods, namely, ResNet18, ResNet50, Densenet121, AlexNet, ViT-l-16, Cross-ViT, Hybrid and EfficientViT. The results showed that the predictive performance of ResNet-ViT after 5-fold cross-validation is superior to that of other models. In addition, we utilized the MCB algorithm to fuse image features extracted by ResNet-ViT with clinical features, constructing a ResNet-ViT multimodal model. Experimental results demonstrated that the predictive performance of the ResNet-ViT multimodal model was significantly improved and outperformed eight other models under the same conditions. Our study indicates that the ResNet-ViT multimodal model is capable of effectively capturing both image and clinical features while exhibiting a certain degree of stability. Furthermore, comparative experiments on datasets containing varying extents of surrounding tissue revealed that incorporating some surrounding tissue aids in distinguishing between benign and malignant nodules.
Zhou et al. (Wed,) studied this question.