What question did this study set out to answer?

This research aims to evaluate how different ASR backends impact turn-taking metrics in the Gemini 3.1 Flash Live system.

May 22, 2026Open Access

Reproducing Full-Duplex-Bench: Evaluation of Gemini 3.1 Flash Live on Turn-Taking and an Analysis of ASR Backend Sensitivity

Read Full Paperexternally

Key Points

This research aims to evaluate how different ASR backends impact turn-taking metrics in the Gemini 3.1 Flash Live system.
Reproduced the Full-Duplex-Bench v1.0 evaluation pipeline for Gemini 3.1 Flash Live Preview using alternative ASR backends.
Performed assessments on four tasks: backchannel, pause handling, smooth turn-taking, and user interruption.
Analyzed performance discrepancies between the Whisper-large-v3 and AssemblyAI REST API ASR backends.
Turn-Over Rate and response latency vary significantly between ASR backends, while JSD and backchannel frequency remain constant.
In pause handling, MLX (0.856) outperformed AssemblyAI (0.111) on the same audio.
Recommendations suggest that benchmarks using Turn-Over Rate factor in the chosen ASR backend to avoid skewed results.

Abstract

We reproduced the Full-Duplex-Bench v1.0 evaluation pipeline (Lin et al., 2025) for Gemini 3.1 Flash Live Preview and looked at how much the choice of ASR backend changes the downstream metrics. The original benchmark uses nvidia/parakeet-tdt-0.6b-v2 via NeMo (CUDA-only) for word-level transcription but since we're on Apple Silicon (M3 Max, no CUDA), we replaced this with two alternatives: whisper-large-v3 via MLX and the AssemblyAI REST API. We ran all four v1.0 tasks: backchannel, pause handling, smooth turn-taking, and user interruption. What we found is that JSD and backchannel frequency stay the same no matter which ASR you use, but Turn-Over Rate and response latency move around a lot. Neither backend wins across the board. AssemblyAI is closer to the paper on pause handling TOR, but MLX Whisper-v3 is closer on backchannel TOR, smooth turn-taking, and latency. The gap in pause handling is not small either: 0.856 (MLX) vs 0.111 (AssemblyAI) on the same audio. We think any benchmark that uses TOR needs to pin its ASR backend and report it, because the transcription system alone can shift scores by an order of magnitude. Code is at https://github.com/ifeanyidike/full-duplex-bench-repro.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ifeanyi Dike

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Reproducing Full-Duplex-Bench: Evaluation of Gemini 3.1 Flash Live on Turn-Taking and an Analysis of ASR Backend Sensitivity

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study