Foreign Object Debris (FOD) on airport pavements poses a serious threat to aviation safety, making accurate detection and interpretable scene understanding crucial for operational risk management. This paper presents an integrated multi-modal framework that combines an enhanced YOLOv7-X detector, a cascaded YOLO-SAM segmentation module, and a structured prompt engineering mechanism to generate detailed semantic descriptions of detected FOD. Detection performance is improved through the integration of Coordinate Attention, Spatial–Depth Conversion (SPD-Conv), and a Gaussian Similarity IoU (GSIoU) loss, leading to a 3.9% gain in mAP@0.5 for small objects with only a 1.7% increase in inference latency. The YOLO-SAM cascade leverages high-quality masks to guide structured prompt generation, which incorporates spatial encoding, material attributes, and operational risk cues, resulting in a substantial improvement in description accuracy from 76.0% to 91.3%. Extensive experiments on a dataset of 12,000 real airport images demonstrate competitive detection and segmentation performance compared to recent CNN- and transformer-based baselines while achieving robust semantic generalization in challenging scenarios, such as complete darkness, low-light, high-glare nighttime conditions, and rainy weather. A runtime breakdown shows that the enhanced YOLOv7-X requires 40.2 ms per image, SAM segmentation takes 142.5 ms, structured prompt construction adds 23.5 ms, and BLIP-2 description generation requires 178.6 ms, resulting in an end-to-end latency of 384.8 ms per image. Although this does not meet strict real-time video requirements, it is suitable for semi-real-time or edge-assisted asynchronous deployment, where detection robustness and semantic interpretability are prioritized over ultra-low latency. The proposed framework offers a practical, deployable solution for airport FOD monitoring, combining high-precision detection with context-aware description generation to support intelligent runway inspection and maintenance decision-making.
Cheng et al. (Mon,) studied this question.