Key points are not available for this paper at this time.
Self-Supervised Learning (SSL) has demonstrated promising results in 3D medical image analysis. However, the lack of high-level semantics in pre-training still heavily hinders the performance of downstream tasks. We ob-serve that 3D medical images contain relatively consistent contextual position information, i.e., consistent geometric relations between different organs, which leads to a potential way for us to learn consistent semantic representations in pre-training. In this paper, we propose a simple-yet-effective Volume Contrast (VoCo) framework to leverage the contextual position priors for pre-training. Specif-ically, we first generate a group of base crops from different regions while enforcing feature discrepancy among them, where we employ them as class assignments of dif-ferent regions. Then, we randomly crop sub-volumes and predict them belonging to which class (located at which re-gion) by contrasting their similarity to different base crops, which can be seen as predicting contextual positions of different sub-volumes. Through this pretext task, VoCo implic-itly encodes the contextual position priors into model rep-resentations without the guidance of annotations, enabling us to effectively improve the performance of downstream tasks that require high-level semantics. Extensive exper-imental results on six downstream tasks demonstrate the superior effectiveness of VoCo. Code will be available at httpsu/github.com/luffytls/vo'Co.
Wu et al. (Sun,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: