June 29, 1999

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

Key Points

Key points are not available for this paper at this time.

Abstract

One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis−Patrick's (J−P) (Jarvis, R. A. IEEE Trans. Comput. 1973, C-22, 1025−1034). The implementation of J−P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J−P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J−P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J−P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the overall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index.

Ask AI

Helpful

Bookmark