This comment re-examines the released dataset and codebase for 'Do Multimodal Large Language Models Understand Welding?', which evaluates GPT-4o and LLaVA-1.6 on weld acceptability across RV/Marine, Aeronautical, and Farming contexts and proposes WeldPrompt. The audit identifies two threats to validity. First, the construct is unstable: about a quarter of Aeronautical rejections are driven by an undisclosed process-classification rule rather than visible-defect evidence, and Farming labels frequently invoke imagined end-use rather than a different inspection threshold. Second, the image pipeline caps inputs at 512 x 512 and JPEG-recompresses them before inference. The reported gaps between models, datasets, and contexts are therefore consistent with these confounds before any claim about MLLM welding capability needs to be invoked.
algburi (Sun,) studied this question.