January 1, 2015Open Access

Unsupervised Text Normalization Using Distributed Representations of Words and Phrases

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Text normalization techniques that use rule-based normalization or string similarity based on static dictionaries are typically unable to capture domain-specific abbreviations (custy, cx! customer) and shorthands (5ever, 7ever! forever) used in informal texts. In this work, we exploit the property that noisy and canonical forms of a particular word share similar context in a large noisy text collection (millions or billions of social media feeds from Twitter, Facebook, etc.). We learn distributed representations of words to capture the notion of contextual similarity and subsequently learn normalization lexicons from these representations in a completely unsupervised manner. We experiment with linear and non-linear distributed representations obtained from log-linear models and neural networks, respectively. We apply our framework for normalizing customer care notes and Twitter. We also extend our approach to learn phrase normalization lexicons (g2g! got to go) by training distributed representations over compound words. Our approach outperforms Microsoft Word, Aspell and a manually compiled urban dictionary from the Web and achieves state-of-the-art results on a publicly available Twitter dataset.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Vivek Kumar Rangarajan Sridhar (Thu,) studied this question.

synapsesocial.com/papers/6a0edfac8a6cf2089022a1f6 https://doi.org/https://doi.org/10.3115/v1/w15-1502

Me gusta

Guardar

Ver artículo completo