RPViT: Vision Transformer Based on Region Proposal

Key Points

Key points are not available for this paper at this time.

Abstract

Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

Mark Helpful

Bookmark

Relay

Mark Helpful

Bookmark

Relay

RPViT: Vision Transformer Based on Region Proposal

Key Points

Abstract

Cite This Study