What question did this study set out to answer?

The study aims to develop a system for recognizing multi-dish food images using single-item training data.

March 13, 2026Open Access

Single-item training for multi-dish recognition: a class-agnostic framework for Indian food platters

Key Points

The study aims to develop a system for recognizing multi-dish food images using single-item training data.
Proposed a two-stage framework for multi-dish food recognition.
Utilized class-agnostic segmentation with the Segment Anything Model (SAM).
Employed SE-DenseNet121 network for classification, optimized via Optuna-based tuning.
Trained exclusively on single-item annotated images to generalize inference for multi-item thalis.
Implemented zero-shot segmentation to avoid multi-dish ground-truth annotations.
Achieved 97.48% accuracy in single-item food image classification.
Demonstrated efficient multi-dish recognition through region-wise inference.
Reduced annotation complexity from O(N × M) to O(N).
Achieved 2x faster inference at 1.58 ms latency with fewer parameters than transformer models.
Lower computational cost at 2.90 GFLOPs with higher throughput (633.32 samples/s).

Abstract

Accurate dietary assessment is increasingly dependent on automated food recognition systems capable of operating effectively in real-world environments. While most vision-based models perform well on single-item datasets, their performance degrades significantly in complex multi-dish settings. This scenario is particularly evident in Indian thalis, which contain overlapping food items with diverse textures and high visual variability. These challenges make large-scale multi-dish annotation expensive and limit practical deployment of such systems. To address this gap, we propose a novel two-stage framework that enables recognition of multi-dish food images using only single-item training data. The proposed pipeline incorporates class-agnostic segmentation using the Segment Anything Model (SAM), followed by classification with an SE-DenseNet121 network optimized via Optuna-based hyperparameter tuning.The model is trained exclusively on single-item annotated images and generalizes to multi-item thali images at inference time through a segmentation-classification mapping strategy. This zero-shot segmentation approach eliminates the need for multi-dish ground-truth annotations. As a result, the annotation complexity is reduced from O ( N × M ) to O ( N ). The proposed system achieves accuracy of 97.48% on single-item food image classification and demonstrates strong applicability to multi-dish Indian thali images through region-wise inference on segmented food items. Furthermore, the framework is computationally efficient, achieving 2 × faster inference with a latency of 1.58 ms while using only 70% of the parameters required by transformer-based baselines. It operates with low computational cost (2.90 GFLOPs), significantly fewer parameters (8.06M compared to 26.69–86.77M), and delivers higher throughput (633.32 samples/s). These results demonstrate that the proposed method provides a scalable and practical solution for real-time dietary assessment applications.

Bookmark

View Full Paper

Bookmark

View Full Paper

Single-item training for multi-dish recognition: a class-agnostic framework for Indian food platters

Key Points

Abstract

Cite This Study