To improve grape target perception and picking-point positioning for intelligent harvesting robots, this study develops a vision-based method for orchard grape detection and harvesting-point localization. The method is intended to address missed detections, insufficient recognition accuracy, and unsatisfactory peduncle segmentation caused by illumination variation, occlusion, and interference from branches and leaves in complex orchard scenes. For grape cluster and peduncle detection, a lightweight YOLOv7-derived model, termed YOLO-FES, was established. In this model, FasterNet and SCConv were introduced to refine the backbone and neck structures, and the EMA mechanism was incorporated to lower parameter complexity and computational cost while improving detection performance. For suspended grape structure association and peduncle extraction, the GJK algorithm was combined with nearest-neighbor rectangular discrimination, and an improved YOLACT-based peduncle segmentation network, named M-YOLACT, was constructed. With the integration of the MLCA mechanism and the Mish activation function, accurate peduncle segmentation was achieved. In addition, a stereo depth camera was employed to obtain two-dimensional picking-point information and further recover the corresponding three-dimensional spatial coordinates. Experimental results showed that the mAP@0.5 of YOLO-FES for grape clusters and peduncles reached 95.37%. For grape peduncle segmentation, the mAP@0.5 values of the bounding boxes and masks produced by M-YOLACT reached 95.73% and 94.36%, respectively. The proposed method achieved an overall harvesting success rate of 89.2%, with an average time consumption of 11 s for a single harvesting operation. By integrating deep-learning-based detection and segmentation with binocular-vision localization, this study provides a practical technical solution and useful reference for the visual system design of grape-harvesting robots.
Lin et al. (Thu,) studied this question.