Intelligent housing-dimensional urban physical examination requires evaluating complex visual scenes in aging communities. Existing methods and datasets are insufficient for these heterogeneous tasks and severe class imbalances. To address this, we introduce the Housing-dimensiOnal visUal inSpection imagE Dataset (HOUSED) with a hierarchical labeling scheme, and propose a hierarchical Vision Mixture of Experts (VMoE) framework. At its core, the proposed CS-DisVMoE module utilizes a CS-Soft routing mechanism to capture spatial feature correlations, optimizing expert assignment and reducing inference overhead. Additionally, a FENNEL-based non-linear graph partitioning mechanism converts pre-trained dense weights into semantically coherent expert initializations, accelerating convergence while preserving localized visual clustering. To address the hierarchical labels, we design a composite loss function: a Supervised Contrastive Loss acts as a parent-category soft constraint to accelerate convergence, while Focal Loss mitigates data imbalance and handles fine-grained subcategory classification via hard sample mining. Across evaluated datasets, the full proposed framework improves accuracy by an average of 4.3% over the ViT-Tiny baseline and 1.81% over the best-performing VMoE baseline. Furthermore, it achieves these improvements with lower computational costs. Further tests on mixed public vision datasets verify its generalizability and competitive performance for complex-scene applications.
Zhao et al. (Sun,) studied this question.