Overcoming the Lack of Parallel Data in Sentence Compression

Key Points

Key points are not available for this paper at this time.

Abstract

A major challenge in supervised sentence compression is making use of rich feature representations because of very scarce parallel data.We address this problem and present a method to automatically build a compression corpus with hundreds of thousands of instances on which deletion-based algorithms can be trained.In our corpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence supervised systems which require a structural alignment between the input and output can be successfully trained.We also extend an existing unsupervised compression method with a learning module.The new system uses structured prediction to learn from lexical, syntactic and other features.An evaluation with human raters shows that the presented data harvesting method indeed produces a parallel corpus of high quality.Also, the supervised system trained on this corpus gets high scores both from human raters and in an automatic evaluation setting, significantly outperforming a strong baseline.

Mark Helpful

Bookmark

Relay

View Full Paper