March 20, 2024Open Access

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Humans see low and high spatial frequency components at the same time, and combine the information from both to form a visual scene. Drawing on this neuroscientific inspiration, we propose an altered Vision Transformer architecture where patches from scaled down versions of the input image are added to the input of the first Transformer Encoder layer. We name this model Retina Vision Transformer (RetinaViT) due to its inspiration from the human visual system. Our experiments show that when trained on the ImageNet-1K dataset with a moderate configuration, RetinaViT achieves a 3.3% performance improvement over the original ViT. We hypothesize that this improvement can be attributed to the inclusion of low spatial frequency components in the input, which improves the ability to capture structural features, and to select and forward important features to deeper layers. RetinaViT thereby opens doors to further investigations into vertical pathways and attention patterns.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Shu et al. (Wed,) studied this question.

synapsesocial.com/papers/68e733cdb6db6435876ada47 https://doi.org/https://doi.org/10.48550/arxiv.2403.13677

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo