Collecting large, diverse, and well-labeled datasets remains a persistent bottleneck in agricultural computer vision. This study explores the efficacy of video-based synthetic data, generated via the diffusion-transformer model Sora, to address this scarcity for strawberry leaf disease classification. A synthetic dataset of 1,467 images was curated by extracting frames from generated videos, using structured text prompts and reference images to capture temporal variations in lighting and leaf morphology. This data was utilized to train six lightweight deep learning architectures (DenseNet-121, EfficientNet-B0, MobileNetV3-Small, ResNet-18, ShuffleNetV2, and Vision Transformer (ViT)-Tiny) using a feature extraction strategy. The models were evaluated on a held-out test set of 618 real-world images to assess synthetic-to-real generalization. ResNet-18 achieved the highest nominal performance, with accuracy, precision, recall, and F1-score all reaching 98.71%. A 5-fold stratified cross-validation further confirmed the approach’s stability with an average accuracy of 98.9%. Notably, statistical analysis using McNemar’s test revealed no significant performance difference ( p > 0.05) between ResNet-18 and the significantly lighter MobileNetV3-Small. These findings demonstrate that video-derived synthetic data can effectively bridge the domain gap, enabling the training of robust, resource-efficient models suitable for deployment on edge devices in precision agriculture.
Adnan Miski (Wed,) studied this question.