What question did this study set out to answer?

To enhance the efficiency of semantic segmentation in Vision Transformers by utilizing a low-rank adaptation method.

June 15, 2026Open Access

Convolutional low-rank adaptation for efficient semantic segmentation in vision transformers

Key Points

To enhance the efficiency of semantic segmentation in Vision Transformers by utilizing a low-rank adaptation method.
Introduced learnable low-rank convolutional modules into pre-trained Vision Transformers using LyCORIS framework.
Applied the method to Depth Anything V2, supporting monocular depth estimation and binary human semantic segmentation.
Conducted extensive experiments on filtered COCO 1 and ImageNet subsets.
Achieved an mAP of 89.69% and an mIoU of 79.17% for human segmentation.
Maintained depth prediction accuracy while adapting with only 150K trainable parameters.
Demonstrated competitive performance with state-of-the-art models like Mask2Former.

Abstract

Abstract Vision Transformers (ViTs) have shown remarkable performance across various computer vision tasks, but their fine-tuning for dense prediction tasks such as semantic segmentation remains computationally intensive. This work proposes a novel dual-task architectural application of the LyCORIS Low-Rank Adaptation for Convolutions (LyCORIS LoCon) framework, which introduces learnable low-rank convolutional modules into pre-trained ViTs. This method is applied to Depth Anything V2 (DAV2), augmenting its decoder to support dual-task outputs; monocular depth estimation and binary human semantic segmentation, without disrupting its original capabilities. By injecting only 150K trainable parameters, this approach significantly reduces the adaptation cost while achieving segmentation performance comparable to state-of-the-art models like SAM, MaskFormer, and SegFormer. Extensive experiments on filtered COCO 1 and ImageNet subsets show that Conv-LoRA enhances task-specific learning with minimal computational overhead. The method achieves an mAP of 89.69% and an mIoU of 79.17% for human segmentation, performing competitively alongside state-of-the-art models like Mask2Former, while preserving the depth prediction accuracy of the base model.

AIに質問

Bookmark

View Full Paper