What question did this study set out to answer?

This comment aims to critique the dataset and evaluation methods used in assessing welding capability of multimodal large language models.

May 26, 2026Open Access

A comment on 'Do Multimodal Large Language Models Understand Welding?'

Key Points

This comment aims to critique the dataset and evaluation methods used in assessing welding capability of multimodal large language models.
Audit of the dataset and codebase for evaluation of GPT-4o and LLaVA-1.6.
Identification of threats to validity related to construct stability and image processing.
Analysis of discrepancies in model performance across different contexts.
One-fourth of Aeronautical rejections stem from a hidden classification rule, not visible defects.
Farming labels often rely on imagined end-uses, affecting inspection accuracy.
Image processing limitations, including input size and compression, may influence the observed performance gaps.

Abstract

This comment re-examines the released dataset and codebase for 'Do Multimodal Large Language Models Understand Welding?', which evaluates GPT-4o and LLaVA-1.6 on weld acceptability across RV/Marine, Aeronautical, and Farming contexts and proposes WeldPrompt. The audit identifies two threats to validity. First, the construct is unstable: about a quarter of Aeronautical rejections are driven by an undisclosed process-classification rule rather than visible-defect evidence, and Farming labels frequently invoke imagined end-use rather than a different inspection threshold. Second, the image pipeline caps inputs at 512 x 512 and JPEG-recompresses them before inference. The reported gaps between models, datasets, and contexts are therefore consistent with these confounds before any claim about MLLM welding capability needs to be invoked.

A comment on 'Do Multimodal Large Language Models Understand Welding?'

Key Points

Abstract

Cite This Study