SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection SecMutBench evaluates whether Large Language Models (LLMs) can generate effective security tests that detect vulnerabilities in code. Unlike existing benchmarks that assess secure code generation, SecMutBench focuses on security test generation evaluated through mutation testing. Given secure source code and a CWE category, an LLM generates security tests. These tests are then evaluated by running them against mutants — code variants with injected vulnerabilities. Tests that detect (kill) mutants demonstrate genuine security awareness. Key Features Security-Focused: 339 samples mapped to 30 Common Weakness Enumeration (CWE) vulnerability types Mutation Testing Evaluation: Test quality measured by ability to kill security-relevant mutants 25 Security Mutation Operators: Custom operators that inject realistic vulnerability patterns with multiple variants Pre-generated Mutants: 1,869 deterministic mutants (avg 5.5/sample) for reproducible evaluation Kill Classification: Three-layer classification (semantic, incidental, crash) with mock-state observability Multi-Modal Evaluation: Combines mutation testing with LLM-as-judge metrics CWE-Strict Operator Mapping: Operators only fire on samples matching their target CWEs (no cross-contamination) Multi-Source Dataset: Samples from SecMutBench originals, CWEval, SecurityEval, and LLM-generated semantic variations
Almutairi et al. (Wed,) studied this question.