What question did this study set out to answer?

To develop a model that converts non-canonical tweet language into canonical form using edit scripts and neural embeddings.

synapse

⌘+K

synapse

⌘+K

January 1, 2014Open Access

Normalizing tweets with edit scripts and recurrent neural embeddings

Key Points

To develop a model that converts non-canonical tweet language into canonical form using edit scripts and neural embeddings.
Developed a text normalization model using labeled and unlabeled data.
Incorporated character-level neural embeddings generated by a Simple Recurrent Network.
Analyzed performance on an English tweet normalization dataset.
Substantially lowered word error rates compared to state-of-the-art models.
Achieved improvements with minimal training data.
Did not require lexical resources for effectiveness.

Abstract

Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on stateof-the-art with little training data and without any lexical resources.

Bookmark

View Full Paper

Bookmark

View Full Paper

Normalizing tweets with edit scripts and recurrent neural embeddings

Key Points

Abstract

Cite This Study