The rapid growth of long-form spoken content, including meetings, lectures, interviews, classroom discussions, and online consultations, has created a strong need for automatic systems that can generate concise, faithful, and accessible summaries from speech 1. Existing speech summarization systems commonly depend on cascaded pipelines, where speech is first converted into text using automatic speech recognition and then summarized using text-based models. Although effective, these systems suffer from ASR error propagation, loss of speaker-level information, multilingual degradation, and text-only output limitations. In addition, large language model-based summarizers may generate hallucinated or unsupported information, which reduces the reliability of summaries in high-value domains such as education, healthcare, legal documentation, and business communication 2. To address these issues, this paper proposes a hallucination-aware multilingual speech to-speech summarization framework using reinforcement-aligned large speech language models. The proposed system is designed to be evaluated using ROUGE, BERTScore, WER, CER, hallucination rate, factual consistency score, speaker attribution accuracy, multilingual adequacy, latency, and mean opinion score.Inspired by recent multilingual speech-to-speech modeling preference-alignment methods, and the framework applies Direct Preference Optimization or optimization to reinforcement-style improve factual consistency and reduce unsupported summary claims 3, 4. The factuality verification module compares generated summaries with source speech/transcript evidence to identify contradictions, missing context, and speaker-attribution errors 5. The proposed system is evaluated using ROUGE, BERTScore, WER, CER, hallucination rate, factual consistency score, speaker attribution accuracy, multilingual adequacy, latency, and mean opinion score. This study contributes a unified framework for faithful, multilingual, and speech-output oriented summarization of long-form spoken content.
Dinesh@Dhanabalan et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: