Abstract Vision Transformers (ViTs) have shown remarkable performance across various computer vision tasks, but their fine-tuning for dense prediction tasks such as semantic segmentation remains computationally intensive. This work proposes a novel dual-task architectural application of the LyCORIS Low-Rank Adaptation for Convolutions (LyCORIS LoCon) framework, which introduces learnable low-rank convolutional modules into pre-trained ViTs. This method is applied to Depth Anything V2 (DAV2), augmenting its decoder to support dual-task outputs; monocular depth estimation and binary human semantic segmentation, without disrupting its original capabilities. By injecting only 150K trainable parameters, this approach significantly reduces the adaptation cost while achieving segmentation performance comparable to state-of-the-art models like SAM, MaskFormer, and SegFormer. Extensive experiments on filtered COCO 1 and ImageNet subsets show that Conv-LoRA enhances task-specific learning with minimal computational overhead. The method achieves an mAP of 89.69% and an mIoU of 79.17% for human segmentation, performing competitively alongside state-of-the-art models like Mask2Former, while preserving the depth prediction accuracy of the base model.
Srinivasan et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: