Audio forgery has emerged as a significant security and forensic challenge, driven by rapid advances in generative artificial intelligence and the widespread availability of audio editing tools, which enable the creation of highly realistic manipulated speech with minimal technical expertise. Existing approaches predominantly operate at the file level, providing only coarse binary decisions without identifying when or where manipulation occurs. This study addresses fine-grained temporal localization through a unified frame-level localization framework. We introduce a controlled forgery generation framework derived from the TIMIT speech corpus, applying atomic, localized manipulations under strict temporal constraints and producing precise frame-level annotations across diverse manipulation types. Building on this dataset, we then propose a transform-agnostic localization-driven detection approach using temporal inconsistency modeling, enabling unified analysis across heterogeneous manipulations at frame-level resolution. To analyze forensic evidence, we present an evidence-stratified modeling paradigm comparing three complementary strategies: a handcrafted anomaly-based method, a deep localization model leveraging pretrained wav2vec 2.0 representations, and a hybrid approach combining both through confidence-aware fusion and temporal consistency reinforcement. A systematic experimental analysis evaluates the effects of representation adaptation, hybrid fusion, and manipulation type on detection and localization performance. Results show that handcrafted features are insufficient for reliable frame-level localization, while task-adapted wav2vec 2.0 achieves strong and consistent performance. The hybrid approach does not consistently improve frame-level accuracy but yields substantial gains in segment-level localization by enforcing temporal coherence. Per-transform analysis confirms robust performance across most manipulations, with deletion-based operations remaining the most challenging.
Moallim et al. (Thu,) studied this question.