Accurate dietary assessment is increasingly dependent on automated food recognition systems capable of operating effectively in real-world environments. While most vision-based models perform well on single-item datasets, their performance degrades significantly in complex multi-dish settings. This scenario is particularly evident in Indian thalis, which contain overlapping food items with diverse textures and high visual variability. These challenges make large-scale multi-dish annotation expensive and limit practical deployment of such systems. To address this gap, we propose a novel two-stage framework that enables recognition of multi-dish food images using only single-item training data. The proposed pipeline incorporates class-agnostic segmentation using the Segment Anything Model (SAM), followed by classification with an SE-DenseNet121 network optimized via Optuna-based hyperparameter tuning.The model is trained exclusively on single-item annotated images and generalizes to multi-item thali images at inference time through a segmentation-classification mapping strategy. This zero-shot segmentation approach eliminates the need for multi-dish ground-truth annotations. As a result, the annotation complexity is reduced from O ( N × M ) to O ( N ). The proposed system achieves accuracy of 97.48% on single-item food image classification and demonstrates strong applicability to multi-dish Indian thali images through region-wise inference on segmented food items. Furthermore, the framework is computationally efficient, achieving 2 × faster inference with a latency of 1.58 ms while using only 70% of the parameters required by transformer-based baselines. It operates with low computational cost (2.90 GFLOPs), significantly fewer parameters (8.06M compared to 26.69–86.77M), and delivers higher throughput (633.32 samples/s). These results demonstrate that the proposed method provides a scalable and practical solution for real-time dietary assessment applications.
Garisa et al. (Tue,) studied this question.