November 23, 2016Open Access

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities

Key Points

Key points are not available for this paper at this time.

Abstract

In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period-November, 2014 to February, 2015. The posts mention over 250 drug-related keywords. The language models encapsulate semantic and sequential properties of the texts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Abeed Sarker

Georgia Institute of Technology

Graciela Gonzalez‐Hernandez

Ministry of Health and Social Welfare

Journals

Data in Brief

Actions

Institutions

University of Pennsylvania

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study