What question did this study set out to answer?

This research aims to evaluate and compare the transcription capabilities of leading ASR tools for voice answers in web surveys.

March 25, 2026

Transcribing and Coding Voice Answers Obtained in Web Surveys: Comparing Three Leading Automatic Speech Recognition Tools

Key Points

This research aims to evaluate and compare the transcription capabilities of leading ASR tools for voice answers in web surveys.
Compared three ASR tools: Google Cloud Speech-to-Text API, OpenAI Whisper, and Vosk.
Collected voice answers from an open-ended question regarding nursing home transparency.
Evaluated transcription quality and coding efficiency using human coders and GPT-4o.
Google Cloud occasionally fails to provide accurate transcriptions.
OpenAI Whisper is prone to producing hallucinations or false transcriptions.
Vosk presents challenges including unclear audio and high rates of incorrect words.
Both human and LLM coding yielded significant differences in results.

Abstract

Abstract With the rise of smartphone use in web surveys, voice or oral answers have become a promising methodology for collecting rich data. Voice answers present both opportunities and challenges. This study addresses two of these challenges—labor-intensive manual transcription and coding of responses. We compare the transcription performance of three leading Automatic Speech Recognition (ASR) tools—Google Cloud Speech-to-Text API, OpenAI Whisper, and Vosk—using voice answers collected from an open-ended question on nursing home transparency that was administered in an opt-in online panel in Spain. Additionally, we evaluate the efficiency and quality of coding these transcriptions using human coders and GPT-4o, a Large Language Model (LLM) developed by OpenAI. We found that each of the ASR tools has distinct merits and limits. Google sometimes fails to provide transcriptions, Whisper produces hallucinations (false transcriptions), and Vosk has clarity issues and high rates of incorrect words. Human and LLM-based coding also differ significantly. Thus, we recommend using several ASR tools for voice answer transcription and implementing human as well as LLM-based coding, as the latter offers additional information at minimal added cost.

Bookmark

Cite This Study

Revilla et al. (Mon,) studied this question.

synapsesocial.com/papers/69c37b93b34aaaeb1a67e1b0 https://doi.org/https://doi.org/10.1093/jssam/smaf028

Bookmark