Self-reported dietary assessments have long been a limiting factor in advancing the field of precision nutrition. This is due to challenges such as unrecorded eating episodes, recall bias and portion-size estimation errors (1) . We developed a system using customised wearable cameras and reasoning-enabled large vision–language models (LVLMs) to create a fully automated pipeline facilitating scalable and objective dietary assessment. Beyond reducing user burden, passive capture can record brief or opportunistic eating episodes that are typically missed, while the use of LVLMs improves identification across heterogeneous contexts. However, the feasibility, privacy safeguards, and quantitative performance of such systems remain underexplored. This study aims to evaluate the LVLM-enabled passive system’s performance in real-world deployments, focusing on its ability to accurately capture and analyze dietary intake. A feasibility study was conducted at two centres, Hammersmith Hospital and the University of Reading (2) , where thirty UK participants wore customised cameras side-mounted on glasses (STM32 microcontroller; 128-GB SD card; rechargeable) throughout waking hours whilst consuming two highly-controlled, standardised diets; one of which was compliant with UK healthy eating guidelines and the other was not. Each diet was consumed over four study days, during which participants remained in the facility and consumed meals provided by the study team. The model outputs were benchmarked against a dietitian-verified reference menu and the weights of food portions consumed. The preprocessing pipeline was first applied to blur faces and screens in captured images for privacy protection. The LVLM-based pipeline then performed three tasks: (i) extracting eating episodes; (ii) recognising food items across heterogeneous settings; and (iii) context-aware portion-size estimation, using cues from containers, utensils, and hands to mitigate monocular visual scale ambiguity (3) . Any eating sessions lacking captured images were excluded from subsequent analyses. Data passively captured with wearable cameras from 30 participants (Hammersmith, n=15; Reading, n=15) over eight study days yielded 2.08 million raw images at Hammersmith and 2.15 million at Reading. After privacy filtering and removal of redundant frames, 0.49% and 0.46% of images were retained from each site, respectively. Overall, food-item recall from passively captured imagery was 82% (95% CI 81–84%). Portion-size estimation showed a mean absolute error of 44.7 g (95% CI 42.2–47.3 g) for food items and 70.4 mL (95% CI 67.4–73.4 mL) for beverages against weighed consumed portions. This feasibility study provides foundational evidence for LVLM-enabled, passive, camera-based dietary monitoring and supports progression to real-world deployment. These feasibility results support further multi-site validation, inclusion of metrics beyond recall (e.g., energy and macronutrient assessment), and assessment of performance across settings (home vs out-of-home) and subgroups to capture nutrient intake at population-level and enhance precision nutrition approaches.
Lo et al. (Fri,) studied this question.