What does this research mean for the field?

A novel hierarchical Vision Mixture of Experts (VMoE) framework utilizing cosine similarity distillation improves classification accuracy in complex urban visual scenes by up to 4.3% while reducing computational costs compared to existing baselines. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to enhance the evaluation methods for urban physical examinations in diverse aging environments by introducing a new dataset and framework.

June 3, 2026Open Access

Cosine Similarity Distillation Vision Mixture-of-Experts for Intelligent Housing-Dimensional Urban Physical Examinations

Puntos clave

This research aims to enhance the evaluation methods for urban physical examinations in diverse aging environments by introducing a new dataset and framework.
Developed the Housing-dimensiOnal visUal inSpection imagE Dataset (HOUSED) with a hierarchical labeling scheme.
Proposed a Vision Mixture of Experts (VMoE) framework utilizing a CS-Soft routing mechanism for expert assignment.
Implemented a composite loss function using Supervised Contrastive Loss and Focal Loss to handle data imbalance and accelerate convergence.
Achieved an accuracy improvement of 4.3% over the ViT-Tiny baseline and 1.81% over the best-performing VMoE baseline.
Demonstrated lower computational costs while enhancing performance.
Validated generalizability and competitive performance across various mixed public vision datasets.

Resumen

Intelligent housing-dimensional urban physical examination requires evaluating complex visual scenes in aging communities. Existing methods and datasets are insufficient for these heterogeneous tasks and severe class imbalances. To address this, we introduce the Housing-dimensiOnal visUal inSpection imagE Dataset (HOUSED) with a hierarchical labeling scheme, and propose a hierarchical Vision Mixture of Experts (VMoE) framework. At its core, the proposed CS-DisVMoE module utilizes a CS-Soft routing mechanism to capture spatial feature correlations, optimizing expert assignment and reducing inference overhead. Additionally, a FENNEL-based non-linear graph partitioning mechanism converts pre-trained dense weights into semantically coherent expert initializations, accelerating convergence while preserving localized visual clustering. To address the hierarchical labels, we design a composite loss function: a Supervised Contrastive Loss acts as a parent-category soft constraint to accelerate convergence, while Focal Loss mitigates data imbalance and handles fine-grained subcategory classification via hard sample mining. Across evaluated datasets, the full proposed framework improves accuracy by an average of 4.3% over the ViT-Tiny baseline and 1.81% over the best-performing VMoE baseline. Furthermore, it achieves these improvements with lower computational costs. Further tests on mixed public vision datasets verify its generalizability and competitive performance for complex-scene applications.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Zhao et al. (Sun,) studied this question.

synapsesocial.com/papers/6a1fc47adee9eb8c0dce5fc9 https://doi.org/https://doi.org/10.3390/s26113473

Me gusta

Guardar

Ver artículo completo