What question did this study set out to answer?

The aim is to evaluate the effectiveness of Large Language Models in generating security tests that can detect code vulnerabilities.

May 8, 2026Open Access

Mars-2030/secmutbench1: v 1.0

Key Points

The aim is to evaluate the effectiveness of Large Language Models in generating security tests that can detect code vulnerabilities.
Utilized mutation testing to assess test quality by measuring ability to detect security-relevant mutants.
Generated security tests based on secure source code and Common Weakness Enumeration (CWE) categories.
Employed a multi-layer kill classification for mutants based on their impact.
Demonstrated that LLM-generated tests effectively detected vulnerabilities with a wide range of mutants.
Achieved a significant detection rate across the mapped CWE categories.
Outlined specific contributions of each security mutation operator in the detection process.

Abstract

SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection SecMutBench evaluates whether Large Language Models (LLMs) can generate effective security tests that detect vulnerabilities in code. Unlike existing benchmarks that assess secure code generation, SecMutBench focuses on security test generation evaluated through mutation testing. Given secure source code and a CWE category, an LLM generates security tests. These tests are then evaluated by running them against mutants — code variants with injected vulnerabilities. Tests that detect (kill) mutants demonstrate genuine security awareness. Key Features Security-Focused: 339 samples mapped to 30 Common Weakness Enumeration (CWE) vulnerability types Mutation Testing Evaluation: Test quality measured by ability to kill security-relevant mutants 25 Security Mutation Operators: Custom operators that inject realistic vulnerability patterns with multiple variants Pre-generated Mutants: 1,869 deterministic mutants (avg 5.5/sample) for reproducible evaluation Kill Classification: Three-layer classification (semantic, incidental, crash) with mock-state observability Multi-Modal Evaluation: Combines mutation testing with LLM-as-judge metrics CWE-Strict Operator Mapping: Operators only fire on samples matching their target CWEs (no cross-contamination) Multi-Source Dataset: Samples from SecMutBench originals, CWEval, SecurityEval, and LLM-generated semantic variations

Mars-2030/secmutbench1: v 1.0

Key Points

Abstract

Cite This Study