SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection SecMutBench evaluates whether Large Language Models (LLMs) can generate effective security tests that detect vulnerabilities in code. Unlike existing benchmarks that assess secure code generation, SecMutBench focuses on security test generation evaluated through mutation testing. Given secure source code and a CWE category, an LLM generates security tests. These tests are then evaluated by running them against mutants — code variants with injected vulnerabilities. Tests that detect (kill) mutants demonstrate genuine security awareness. Key Features Security-Focused: 339 samples mapped to 30 Common Weakness Enumeration (CWE) vulnerability types Mutation Testing Evaluation: Test quality measured by ability to kill security-relevant mutants 25 Security Mutation Operators: Custom operators that inject realistic vulnerability patterns with multiple variants Pre-generated Mutants: 1,869 deterministic mutants (avg 5.5/sample) for reproducible evaluation Kill Classification: Three-layer classification (semantic, incidental, crash) with mock-state observability Multi-Modal Evaluation: Combines mutation testing with LLM-as-judge metrics CWE-Strict Operator Mapping: Operators only fire on samples matching their target CWEs (no cross-contamination) Multi-Source Dataset: Samples from SecMutBench originals, CWEval, SecurityEval, and LLM-generated semantic variations
Building similarity graph...
Analyzing shared references across papers
Loading...
Almutairi et al. (Wed,) studied this question.
synapsesocial.com/papers/69fd7e79bfa21ec5bbf06adf — DOI: https://doi.org/10.5281/zenodo.20045642
Mariam Almutairi
Virginia Tech
Chang‐Tien Lu
Virginia Tech
Virginia Tech
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: