What question did this study set out to answer?

This evaluation aims to assess the performance of several LLMs in generating unit tests for Python code.

March 24, 2026Open Access

Leveraging Large Language Models (LLM) for Python Unit Test

Key Points

This evaluation aims to assess the performance of several LLMs in generating unit tests for Python code.
Evaluated six advanced Large Language Models (LLMs) for their code generation capabilities.
Tested each model’s ability to produce production-quality Python code.
Analyzed the comprehensiveness of unit tests generated alongside the code.
All evaluated LLMs demonstrated varying levels of capability in generating Python code.
Certain LLMs outperformed others in generating comprehensive unit tests.
Quality of code and tests varied, indicating a need for careful model selection.

Abstract

Abstract This study evaluates the capability of six state-of-the-art Large Language Models (LLMs): Perplexity AI, Claude Sonnet 4.5, Gemini 2.5 Pro, ChatGPT (GPT-5), DeepSeek-V3.2-Exp, and Llama-4-Maverick, to generate production-quality Python code with comprehensive unit tests.

Leveraging Large Language Models (LLM) for Python Unit Test

Key Points

Abstract

Cite This Study