January 1, 1997

Comparing representations in Chinese information retrieval

Key Points

Key points are not available for this paper at this time.

Abstract

Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach. 1. Introduction While information retrieval (IR) in English has over thirty years of history, IR in Chinese is relatively recent. It is well-known that written Chi...

KI fragen

Bookmark

KI fragen

Bookmark

Comparing representations in Chinese information retrieval

Key Points

Abstract

Cite This Study