Los puntos clave no están disponibles para este artículo en este momento.
The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging long-form contexts. A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. We exploit developments in efficient attention mechanisms, such as Flash Attention, and show that Transformer-based models can be easily adapted to document-level processing. We experiment with methods to address the quadratic complexity of attention by replacing it with simpler alternatives. As such, our models can handle up to 30 minutes of speech during both training and testing. We evaluate our models on ASR, ST, and Speech Summarization (SSUM) using How2, TEDLIUM3, and SLUE-TED. With document-level context, our ASR models achieve 33.3% and 6.5% relative improvements in WER on How2 and TEDLIUM3 over prior work. Finally, we use our findings to propose a new attention-free self-supervised model, LongHuBERT, capable of handling long inputs. In doing so, we achieve state-of-the-art performance on SLUE-TED SSUM, outperforming cascaded systems that have dominated the benchmark.
Chen et al. (Mon,) studied this question.