Software testing through fuzzing has gained widespread adoption for discovering security vulnerabilities, yet questions remain about its effectiveness in detecting subtle behavioral faults. This paper presents an empirical evaluation investigating the intersection of fuzzing and mutation testing, specifically examining how well fuzz targets perform when evaluated through mutation analysis. We conducted a systematic study using Bitcoin Core as our subject system, analyzing 10 different fuzz targets across various modules and evaluating their ability to detect 726 generated mutants. Our methodology involved executing fuzz targets with existing seed corpora and measuring mutation scores both with and without assertion statements to understand the role of explicit oracles in fault detection. Our findings reveal that contrary to previous studies suggesting fuzzing’s limited effectiveness in mutation testing, several fuzz targets achieved high mutation scores, with two targets reaching 100% mutant detection rates. We identified three key design patterns that significantly enhance mutant detection capabilities: (1) round-trip testing approaches that verify data integrity through serialization-deserialization cycles, (2) mathematical oracles that implement exact behavioral verification through redundant calculations, and (3) metamorphic relations that validate expected relationships between inputs and outputs. Our analysis demonstrates a positive correlation between assertion density in fuzz targets and mutation scores, with assertion removal causing substantial drops in detection rates across all targets. The study contributes empirical evidence that well-designed fuzz targets can effectively detect subtle behavioral faults beyond traditional crash-based vulnerabilities. Our results suggest that incorporating explicit oracles, metamorphic properties, and roundtrip verification mechanisms into fuzz target design significantly improves their mutation testing performance. These findings have practical implications for improving fuzzing methodologies and developing more comprehensive automated testing strategies for safety-critical software systems.
Garcia et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: