Key points are not available for this paper at this time.
Standard statistical language modeling techniques suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. In this paper, we focus on improving the estimation of domain-dependent n -gram models by the selective use of out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this work aims at differentially weighting subsets of the out-of-domain data according to style and/or content similarity to the given task, where “style" is represented by part-of-speech statistics and “content" by the particular choice of vocabulary items. In addition to n -gram estimation, the differential weights can be used for lexicon design. Recognition experiments are based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. The similarity weighting approach gives a 3–5% reduction in word error rate over a domain-specific n -gram language model, providing some of the largest language modeling gains reported for the Switchboard task in recent years.
Iyer et al. (Thu,) studied this question.