What question did this study set out to answer?

The aim is to evaluate an effective approach for urban building recognition using classical visual techniques.

February 2, 2026Open Access

Recognition of Urban Buildings in Challenging Images Using Bag of Features and SVM

Key Points

The aim is to evaluate an effective approach for urban building recognition using classical visual techniques.
Utilized a Bag-of-Features representation with a Support Vector Machine classifier.
Conducted experiments on the Sheffield Building Image Dataset (SBID).
Implemented dataset balancing and a reproducible training-testing split.
Investigated visual vocabulary sizes from 100 to 3000 visual words.
Achieved an overall accuracy of 97.5% with 2000 visual words.
Maintained average inference time below 25 ms per image.
High reliability in recognizing most building categories, though misclassifications occurred among similar façade types.

Abstract

Urban building recognition plays a central role in applications such as urban mapping, heritage documentation, autonomous navigation, and smart city monitoring. Although recent advances have been driven mainly by deep learning approaches, classical visual pipelines remain an attractive alternative in scenarios where datasets are limited, interpretability is required, and computational resources are constrained. In this study, a systematic evaluation of a Bag-of-Features (BoF) representation combined with a Support Vector Machine (SVM) classifier is presented for urban building recognition using the Sheffield Building Image Dataset (SBID). The experimental protocol includes dataset balancing, a reproducible training–testing split, and an extensive investigation of visual vocabulary sizes ranging from 100 to 3000 visual words. The results indicate that increasing the vocabulary size generally improves recognition performance up to a saturation point, with the best trade-off achieved using 2000 visual words. Under this configuration, the proposed approach achieved an overall accuracy of 97.5% while maintaining an average inference time below 25 ms per image, demonstrating competitive performance with low computational cost. A detailed analysis based on confusion matrices and per-class metrics (accuracy, precision, recall, and F1-score) shows that most building categories were recognized with high reliability, while misclassifications were mainly concentrated among visually similar façade types. These findings confirm that BoF representations, when properly tuned, remain highly effective for structured urban recognition tasks. Moreover, the obtained results are consistent with those commonly reported in the literature for the same dataset and problem domain, reinforcing the robustness of the proposed pipeline. Overall, the results highlight the continued relevance of classical computer vision methods in contexts where transparency, reproducibility, and efficiency are essential. Future work will investigate hybrid strategies that combine BoF representations with deep convolutional descriptors, as well as more robust evaluation protocols, aiming to improve generalization across different building datasets and urban environments.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper