What type of study is this?

September 10, 2025Open Access

FFFNet: A Food Feature Fusion Model with Self-Supervised Clustering for Food Image Recognition

Key Points

The Food Feature Fusion Model significantly enhances food image recognition accuracy.
In evaluations, FFFNet achieved Top-1/Top-5 accuracies of 65.31%/88.94% on the ISIA Food-500 dataset.
This model employs a multi-head cross-attention mechanism to merge Convolutional Neural Networks and Vision Transformers.
Self-supervised clustering optimizes feature space, enhancing intra-class compactness and inter-class separability.

Abstract

With the growing emphasis on healthy eating and nutrition management in modern society, food image recognition has become increasingly important. However, it faces challenges such as large intra-class differences and high inter-class similarities. To tackle these issues, we present a Food Feature Fusion Network (FFFNet), which leverages a multi-head cross-attention mechanism to integrate the local detail-capturing capability of Convolutional Neural Networks with the global modeling capacity of Vision Transformers. This enables the model to capture key discriminative features when addressing such challenging food recognition tasks. FFFNet also introduces self-supervised clustering, generating pseudo-labels from the feature space distribution and employing a clustering objective derived from Kullback–Leibler divergence to optimize the feature space distribution. By maximizing similarity between features and their corresponding cluster centers, and minimizing similarity with non-corresponding centers, it promotes intra-class compactness and inter-class separability, thereby addressing the core challenges. We evaluated FFFNet across the ISIA Food-500, ETHZ Food-101, and UEC Food256 datasets, attaining Top-1/Top-5 accuracies of 65.31%/88.94%, 89.98%/98.37%, and 80.91%/94.92%, respectively, outperforming existing approaches.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper