March 7, 2019

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Key Points

Key points are not available for this paper at this time.

Abstract

The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark-RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low-end workstations.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Lovrić et al. (Thu,) studied this question.

synapsesocial.com/papers/69d9209bea2783c07da3c354 — DOI: https://doi.org/10.1002/minf.201800082

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Large-scale virtual screening on public cloud resources with Apache Spark· 2017 · 21 citations
ChEMBL: a large-scale bioactivity database for drug discovery· 2011 · 4,432 citations
PubChem Substance and Compound databases· 2015 · 5,451 citations
Mordred: a molecular descriptor calculator· 2018 · 1,525 citations
Multi‐Server Approach for High‐Throughput Molecular Descriptors Calculation based on Multi‐Linear Algebraic Maps· 2014 · 14 citations

Authors

Mario Lovrić

University of Copenhagen

José Manuel Molero

Know Center Research GmbH (Austria)

Roman Kern

Graz University of Technology

Journals

Molecular Informatics

Actions

Institutions

Know Center Research GmbH (Austria)

Children's Hospital Srebrnjak

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider