Abstract We present a domain-tailored verification framework for evaluating the scientific quality of AI-generated synthesis protocols, moving beyond generic NLP benchmarks that fail to capture chemistry-specific requirements. Our approach combines two quantitative metrics: a framework score that assesses the logical coherence of the synthesis pathway, and a weighted detail score that measures the precision of reported experimental parameters. Scientific Contribution This work establishes a benchmark for automated protocol generation, quantifies the gap between conceptual feasibility and parametric exactness in LLM outputs. We apply carefully curated dataset of SAC as a testbed to fine tune mainstream open source LLMs. The benchmark can be generalized to material synthesis protocols.
Aobo Zhang (Sun,) studied this question.