What question did this study set out to answer?

This study aims to improve audio forgery detection by localizing manipulations at the frame level.

May 9, 2026Open Access

Frame-Level Audio Forgery Localization Using Handcrafted and Neural Features

Key Points

This study aims to improve audio forgery detection by localizing manipulations at the frame level.
Introduced a controlled forgery generation framework using the TIMIT speech corpus.
Proposed a detection approach using temporal inconsistency modeling for frame-level analysis.
Evaluated three strategies: handcrafted anomaly-based, deep localization with wav2vec 2.0, and a hybrid method.
Handcrafted features were insufficient for reliable frame-level localization.
Task-adapted wav2vec 2.0 achieved strong and consistent performance for manipulation detection.
The hybrid approach improved segment-level localization while showing inconsistent frame-level accuracy.

Abstract

Audio forgery has emerged as a significant security and forensic challenge, driven by rapid advances in generative artificial intelligence and the widespread availability of audio editing tools, which enable the creation of highly realistic manipulated speech with minimal technical expertise. Existing approaches predominantly operate at the file level, providing only coarse binary decisions without identifying when or where manipulation occurs. This study addresses fine-grained temporal localization through a unified frame-level localization framework. We introduce a controlled forgery generation framework derived from the TIMIT speech corpus, applying atomic, localized manipulations under strict temporal constraints and producing precise frame-level annotations across diverse manipulation types. Building on this dataset, we then propose a transform-agnostic localization-driven detection approach using temporal inconsistency modeling, enabling unified analysis across heterogeneous manipulations at frame-level resolution. To analyze forensic evidence, we present an evidence-stratified modeling paradigm comparing three complementary strategies: a handcrafted anomaly-based method, a deep localization model leveraging pretrained wav2vec 2.0 representations, and a hybrid approach combining both through confidence-aware fusion and temporal consistency reinforcement. A systematic experimental analysis evaluates the effects of representation adaptation, hybrid fusion, and manipulation type on detection and localization performance. Results show that handcrafted features are insufficient for reliable frame-level localization, while task-adapted wav2vec 2.0 achieves strong and consistent performance. The hybrid approach does not consistently improve frame-level accuracy but yields substantial gains in segment-level localization by enforcing temporal coherence. Per-transform analysis confirms robust performance across most manipulations, with deletion-based operations remaining the most challenging.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Moallim et al. (Thu,) studied this question.

synapsesocial.com/papers/69fed021b9154b0b82877334 https://doi.org/https://doi.org/10.3390/signals7030042

Bookmark

View Full Paper