January 1, 1994Open Access

Claws4

Key Points

Key points are not available for this paper at this time.

Abstract

The main purpose of this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, of which c.70 million words have been tagged at the time of writing (April 1994)) We will emphasise the goals of (a) gener~d-purpose adaptability, (b) incorporation of linguistic knowledge to improve quality ,and consistency, and (c) accuracy, measured consistently and in a linguistically informed way. The British National Corpus (BNC) consists of c.100 million words of English written texts and spoken transcriptions, sampled from a comprehensive range of text types. The BNC includes 10 million words of spoken h'mguage, c.45% of which is impromptu conversation (see Crowdy, forthcoming). It also includes ,an immense variety of written texts, including unpublished materials. The gr,'unmatical tagging of the corpus has therefore required the 'super-robustness' of a tagger which can adapt well to virtually all kinds of text. The tagger also has had to be versatile in dealing with different tagsets (sets of grammatical category labels-see 3 below) and accepting text in varied input formats. For the purposes of the BNC, l, he tagger has been requircd both to accept and to output text in a corpus-oriented TEl-confonnant mark-up definition known as CDIF (Corpus Document Interchange Format), but within this format many variant fornaats (affecting, for example, segmentation into words and sentences) can be readily accepted. In addition, CLAWS al-

Bookmark

View Full Paper

Cite This Study

Leech et al. (Sat,) studied this question.

synapsesocial.com/papers/6a1bcdc34d9126f09c5ed4ab https://doi.org/https://doi.org/10.3115/991886.991996

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper