We reproduced the Full-Duplex-Bench v1.0 evaluation pipeline (Lin et al., 2025) for Gemini 3.1 Flash Live Preview and looked at how much the choice of ASR backend changes the downstream metrics. The original benchmark uses nvidia/parakeet-tdt-0.6b-v2 via NeMo (CUDA-only) for word-level transcription but since we're on Apple Silicon (M3 Max, no CUDA), we replaced this with two alternatives: whisper-large-v3 via MLX and the AssemblyAI REST API. We ran all four v1.0 tasks: backchannel, pause handling, smooth turn-taking, and user interruption. What we found is that JSD and backchannel frequency stay the same no matter which ASR you use, but Turn-Over Rate and response latency move around a lot. Neither backend wins across the board. AssemblyAI is closer to the paper on pause handling TOR, but MLX Whisper-v3 is closer on backchannel TOR, smooth turn-taking, and latency. The gap in pause handling is not small either: 0.856 (MLX) vs 0.111 (AssemblyAI) on the same audio. We think any benchmark that uses TOR needs to pin its ASR backend and report it, because the transcription system alone can shift scores by an order of magnitude. Code is at https://github.com/ifeanyidike/full-duplex-bench-repro.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ifeanyi Dike
Building similarity graph...
Analyzing shared references across papers
Loading...
Ifeanyi Dike (Wed,) studied this question.
synapsesocial.com/papers/6a0ff412d674f7c03778d118 — DOI: https://doi.org/10.5281/zenodo.20304821