This study presents a comprehensive, practice-oriented exploration of the full pipeline of advanced deep learning-based image processing. We implement and compare image generation, captioning, segmentation, editing, and in painting using state-of-the-art models including Stable Diffusion, LoRA, ControlNet, InstructPix2Pix, CLIP, BLIP-2, SAM, and Mask2Former. The experiments are conducted within Python environments, and interactive web interfaces are developed using Gradio and Streamlit for real-time user engagement. Mathematical analysis of core mechanisms such as self-attention, optimization, and loss functions is provided to enhance theoretical understanding. Evaluation metrics like BLEU, METEOR, and IoU are employed to assess model performance quantitatively. The study highlights the educational value of integrating theory with hands-on practice, proposing a project-based learning model suitable for higher education. It also discusses interdisciplinary applications, including human-centered AI, creative industries, and interactive systems design. The results demonstrate that combining different models leads to synergistic effects in complex tasks, offering insights into building integrated AI systems. Future research directions include optimization for real-time applications, personalization of generation models, and the development of unified multimodal AI platforms. This work contributes to fostering creative problem-solving skills and advancing human-centered AI education and research.
Choi et al. (Fri,) studied this question.