Key points are not available for this paper at this time.
Several recent investigations have hypothesized that syllable-sized segments may be more appropriate units than phoneme-sized segments for use in continuous speech recognition systems. The significant acoustic information in these segments may be represented by a variety of parameters, some obtained by linear predictive techniques e.g., linear prediction coefficients (LPC), reflection coefficients (RC), and cepstral coefficients (LPCC), others by spectral techniques e.g., linear-frequency cepstral coefficients (LFCC) and mel-frequency cepstral coefficients (MFCC). This study compared the performance of these parameters in a word identification test. Two male speakers produced 57 sentences in each of two sessions, and 676 tokens of 52 CVC words in a variety of syntactic positions were manually segmented. For each speaker, a symmetric dynamic warping technique was used for time registration of half of the data to form composite templates and for comparisons of the remaining data against the templates. The local distance functions were the Euclidian metric for the RC, LPCC, LFCC, and MFCC, and the Itakura metric for the LPC. The best parameter for word identification was the MFCC (96.5% and 95.0% for each speaker), followed by the LFCC (94.7% and 87.6%), LPCC (91.7% and 86.4%), LPC (85.2% and 84.3%), and RC (83.1% and 77.3%). We suggest that the superior performance of the reel-frequency cepstral coefficients is due to the fact that they better represent the perceptually-relevant aspects of the short-term speech spectrum. Research supported by NSF Grant MSC-76-81034 and the USAECOM Contract MDA 904-77-C-0157.
Davis et al. (Wed,) studied this question.