Urban building recognition plays a central role in applications such as urban mapping, heritage documentation, autonomous navigation, and smart city monitoring. Although recent advances have been driven mainly by deep learning approaches, classical visual pipelines remain an attractive alternative in scenarios where datasets are limited, interpretability is required, and computational resources are constrained. In this study, a systematic evaluation of a Bag-of-Features (BoF) representation combined with a Support Vector Machine (SVM) classifier is presented for urban building recognition using the Sheffield Building Image Dataset (SBID). The experimental protocol includes dataset balancing, a reproducible training–testing split, and an extensive investigation of visual vocabulary sizes ranging from 100 to 3000 visual words. The results indicate that increasing the vocabulary size generally improves recognition performance up to a saturation point, with the best trade-off achieved using 2000 visual words. Under this configuration, the proposed approach achieved an overall accuracy of 97.5% while maintaining an average inference time below 25 ms per image, demonstrating competitive performance with low computational cost. A detailed analysis based on confusion matrices and per-class metrics (accuracy, precision, recall, and F1-score) shows that most building categories were recognized with high reliability, while misclassifications were mainly concentrated among visually similar façade types. These findings confirm that BoF representations, when properly tuned, remain highly effective for structured urban recognition tasks. Moreover, the obtained results are consistent with those commonly reported in the literature for the same dataset and problem domain, reinforcing the robustness of the proposed pipeline. Overall, the results highlight the continued relevance of classical computer vision methods in contexts where transparency, reproducibility, and efficiency are essential. Future work will investigate hybrid strategies that combine BoF representations with deep convolutional descriptors, as well as more robust evaluation protocols, aiming to improve generalization across different building datasets and urban environments.
Silva et al. (Fri,) studied this question.