Los puntos clave no están disponibles para este artículo en este momento.
At ATR Spoken Language Translation Research Laboratories, we are building a broad-coverage bilingual corpus to study corpus-based speech translation technologies for the real world. There are three important points to consider in designing and constructing a corpus for future speech translation research. The first is to have a variety of speech samples, with a wide range of pronunciations and speakers. The second is to have data for a variety of situations. The third is to have a variety of expressions. This paper reports our trials and discusses the methodology. First, we introduce a bilingual travel conversation (TC) corpus of spoken languages and a broad-coverage bilingual basic expression (BE) corpus. TC and BE are designed to be complementary. TC is a collection of transcriptions of bilingual spoken dialogues, while BE is a collection of Japanese sentences and their English translations. Whereas TC covers a small domain, BE covers a wide variety of domains. We compare the characteristics of vocabulary and expressions between these two corpora and suggest that we need a much greater variety of expressions. One promising approach might be to collect paraphrases representing various different expressions generated by many people for similar concepts. 1.
Takezawa et al. (Wed,) studied this question.