Abstract With the rise of smartphone use in web surveys, voice or oral answers have become a promising methodology for collecting rich data. Voice answers present both opportunities and challenges. This study addresses two of these challenges—labor-intensive manual transcription and coding of responses. We compare the transcription performance of three leading Automatic Speech Recognition (ASR) tools—Google Cloud Speech-to-Text API, OpenAI Whisper, and Vosk—using voice answers collected from an open-ended question on nursing home transparency that was administered in an opt-in online panel in Spain. Additionally, we evaluate the efficiency and quality of coding these transcriptions using human coders and GPT-4o, a Large Language Model (LLM) developed by OpenAI. We found that each of the ASR tools has distinct merits and limits. Google sometimes fails to provide transcriptions, Whisper produces hallucinations (false transcriptions), and Vosk has clarity issues and high rates of incorrect words. Human and LLM-based coding also differ significantly. Thus, we recommend using several ASR tools for voice answer transcription and implementing human as well as LLM-based coding, as the latter offers additional information at minimal added cost.
Revilla et al. (Mon,) studied this question.