November 22, 2002

A corpus for the evaluation of lossless compression algorithms

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

A number of authors have used the Calgary corpus of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990. The advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being fine-tuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over this period there have been changes in the kinds of files that are compressed, particularly with the development of the Internet, and the rapid growth of high-capacity secondary storage for personal computers. We explore the issues raised above, and develop a principled technique for collecting a corpus of test data for compression methods. A corpus, called the Canterbury corpus, is developed using this technique, and we report the performance of a collection of compression methods using the new corpus.

Me gusta

Guardar

Cite This Study

Arnold et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1c0f224ebd09f3dfa963fc https://doi.org/https://doi.org/10.1109/dcc.1997.582019

Me gusta

Guardar