Abstract This study evaluates the capability of six state-of-the-art Large Language Models (LLMs): Perplexity AI, Claude Sonnet 4.5, Gemini 2.5 Pro, ChatGPT (GPT-5), DeepSeek-V3.2-Exp, and Llama-4-Maverick, to generate production-quality Python code with comprehensive unit tests.
Medlen et al. (Wed,) studied this question.