OBJECTIVE: Multimodal medical image registration has extensive applications in clinical diagnosis and is fundamental for a series of medical analysis tasks. However, the presence of modality differences makes the registration process challenging. Existing methods often employ modality-independent feature descriptors that are sensitive to noise, or attempt to bridge differences within networks, which typically results in translation inaccuracies and misaligned anatomical information. Approach. In this paper, we propose a novel unsupervised approach utilizing a Feature Perceptual Contrast Learning Network (FP-net) to learn descriptors that bridge modality differences while accurately capturing common details. We unify the feature representation of anatomical information under homogeneous and heterogeneous intensity distributions through local sampling-based feature perceptual contrast learning and image reconstruction learning. The trained FP-net is subsequently employed to drive an unsupervised registration framework without requiring ground-truth deformation fields. Main results. We extensively evaluated our method on two public benchmarks: the BraTS 2021 dataset for brain T2-T1 and T1-T1ce registration, and the Learn2Reg 2021 dataset for challenging abdominal CT-MR registration. By passing multimodal image pairs with shape differences through the fixed FP-net, we generate optimization gradients that successfully update the registration network. Quantitative evaluations demonstrate our method's superiority over state-of-the-art baselines. Specifically, our model achieved a Dice Similarity Coefficient (DSC) of 76.3\% and 77.7\% in tumor-bearing T2-T1 and T1-T1ce tasks, respectively. Furthermore, in the complex abdominal CT-MR task, it reached a DSC of 50.1\%, significantly improving structural alignment. Significance. Our method effectively shifts the burden of bridging the modality gap away from the registration network, enabling standard U-Net architectures to achieve state-of-the-art deformable registration. This provides a robust, accurate, and easily deployable unsupervised solution for complex clinical multimodal image analysis.
Lin et al. (Thu,) studied this question.