June 12, 2006Open Access

DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections

Key Points

Key points are not available for this paper at this time.

Abstract

DivergentSet addresses the important but so far neglected bioinformatics task of choosing a representative set of sequences from a larger collection. We found that using a phylogenetic tree to guide the construction of divergent sets of sequences can be up to 2 orders of magnitude faster than the naive method of using a full distance matrix. By providing a user-friendly interface (available online) that integrates the tasks of finding additional sequences, building and refining the divergent set, producing random divergent sets from the same sequences, and exporting identifiers, this software facilitates a wide range of bioinformatics analyses including finding significant motifs and covariations. As an example application of DivergentSet, we demonstrate that the motifs identified by the motif-finding package MEME (Motif Elicitation by Maximum Entropy) are highly unstable with respect to the specific choice of sequences. This instability suggests that the types of sensitivity analysis enabled by DivergentSet may be widely useful for identifying the motifs of biological significance. DivergentSet addresses the important but so far neglected bioinformatics task of choosing a representative set of sequences from a larger collection. We found that using a phylogenetic tree to guide the construction of divergent sets of sequences can be up to 2 orders of magnitude faster than the naive method of using a full distance matrix. By providing a user-friendly interface (available online) that integrates the tasks of finding additional sequences, building and refining the divergent set, producing random divergent sets from the same sequences, and exporting identifiers, this software facilitates a wide range of bioinformatics analyses including finding significant motifs and covariations. As an example application of DivergentSet, we demonstrate that the motifs identified by the motif-finding package MEME (Motif Elicitation by Maximum Entropy) are highly unstable with respect to the specific choice of sequences. This instability suggests that the types of sensitivity analysis enabled by DivergentSet may be widely useful for identifying the motifs of biological significance. The problem of picking a representative non-redundant set of sequences in a convenient manner is critical for many bioinformatics analyses. Many sequence analysis methods assume that protein or nucleic acid sequences have had sufficient time to reach equilibrium such that unimportant residues or associations have mutated away, and only functionally important sites remain intact. These methods include identifying functional motifs (1Bailey T.L. Elkan C. Altman R. Brutlag D. Karp P. Searls D. ISMB-94: Proceedings, Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA1994: 28-36Google Scholar) and detecting correlated evolution in functionally related residues (2Lockless S. Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families.Science. 1999; 286: 295-299Crossref PubMed Scopus (1043) Google Scholar). Sequences that resemble each other primarily because of shared ancestry rather than because of functional constraints must be excluded from these analyses. The restriction on sequence identity arises because methods that detect similar patterns against a random background model cannot determine whether sequences are similar only because they are conserved from sequences that have had insufficient time to evolve in different directions. This effect is especially important when using shared motifs to define superfamilies (3Copley S. Dhillon J. Lateral gene transfer and parallel evolution in the history of glutathione biosynthesis genes.Genome Biol. 2002; 3 (Research0025)Crossref PubMed Google Scholar, 4Copley S. Novak W. Babbitt P. Divergence of function in the thioredoxin fold suprafamily: evidence for evolution of peroxiredoxins from a thioredoxin-like ancestor.Biochemistry. 2004; 43: 13981-13995Crossref PubMed Scopus (129) Google Scholar). Phylogenetic analyses of large numbers of sequences can also be extraordinarily time-consuming even with efficient algorithms (5Tamura K. Nei M. Kumar S. Prospects for inferring very large phylogenies by using the neighbor-joining method.Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 11030-11035Crossref PubMed Scopus (3813) Google Scholar). Choosing a smaller but representative set of sequences often provides the same phylogenetic tree far more efficiently (6Poe S. Sensitivity of phylogeny estimation to taxonomic sampling.Syst. Biol. 1998; 47: 18-31Crossref PubMed Scopus (105) Google Scholar, 7Rosenberg M. Kumar S. Incomplete taxon sampling is not a problem for phylogenetic inference.Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 10751-10756Crossref PubMed Scopus (243) Google Scholar). Because picking a divergent set manually is laborious, little attention has been paid to the reproducibility of programs that rely on divergent sets. How much does the taxon sampling or the precise choice of sequences from the same protein or RNA families affect the apparent functional motifs or relationships? In this study, we demonstrate one use of DivergentSet by comparing the motifs found by the popular motif-finding program MEME 1The abbreviations used are: MEME, Motif Elicitation by Maximum Entropy; BLAST, Basic Local Alignment Search Tool; KEGG, Kyoto Encyclopedia for Genes and Genomes; LysRS, lysyl-tRNA synthetase; PSI-BLAST, Position-Specific Iterated Basic Local Alignment Search Tool; OTU, operational taxonomic unit; UPGMA, unweighted pair group method with arithmetic mean; CPU, central processing unit; PBS, portable batch system. (1Bailey T.L. Elkan C. Altman R. Brutlag D. Karp P. Searls D. ISMB-94: Proceedings, Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA1994: 28-36Google Scholar) using different sets of divergent sequences from the same initial alignment. Divergent sets of sequences have typically been chosen manually. In this procedure, the distance between each pair of sequences is calculated. Any sequence that is too similar to a sequence already in the set is discarded. Taxonomic information can also be used, e.g. by taking one sequence from each genus. However, both methods based on taxonomic annotations and existing methods based on sequence similarity have substantial drawbacks. Thus, an automated method to choose sets of sequences based on sequence similarity is highly desirable. To our knowledge, no fully automated system for choosing a divergent set based on sequence distances exists. Several programs, including BLASTCLUST, part of the BLAST package (8Altschul S. Gish W. Miller W. Myers E. Lipman D. Basic local alignment search tool.J. Mol. Biol. 1990; 215: 403-410Crossref PubMed Scopus (71456) Google Scholar); nrdb90.pl, a script for removing nearly identical protein sequences from a C. from large protein sequence 1998; PubMed Scopus Google Scholar); and P. J. a program for operational taxonomic and PubMed Scopus Google Scholar) and a program to of 2001; PubMed Scopus Google programs for choosing taxonomic by sequence can to in choosing divergent sets by identifying of related sequences. However, and for analysis and not the task of choosing representative sequences from each This task must be by a time-consuming that can an many in part because is often to include sequences from The to include sequences manually is especially representative are in the M. J. M. The a for Mol. Biol. PubMed Scopus Google Scholar). Choosing sequences based on is especially because a or a a very different of sequence in different of and little sequence are especially The in and is identical the sequence but these are in in from the can by more than sequences from not an automated system for choosing a divergent set of sequences of the divergent for analysis because the are to be the We identified for the system. The system different sets of divergent sequences from an existing set of or protein additional related sequences a sequence or set of sequences to the set is such P. J. a program for operational taxonomic and PubMed Scopus Google Scholar) and a program to of 2001; PubMed Scopus Google Scholar) are for picking from a set of sequences to be and are not to this divergent sets of sequences and additional sets of divergent sequences using a phylogenetic tree a the divergent set in the of a phylogeny is a existing and even with of sequences. This is important for sensitivity analyses can the set of divergent the sequences that are in the divergent set to that sequences or sequences with are in the divergent This the to sequences that or specific in annotations and to on the tree the sequences that and numbers and full sequences in both random divergent sets sets chosen random from sets in no sequence be too similar to an existing and divergent sets divergent sets that the of DivergentSet (available is a that these DivergentSet an to use interface with a To the of the we and methods of distances between sequences. This critical because algorithms for sequences of time We also methods of additional of the sequences in the initial The that DivergentSet is in This with a or sequence and a divergent set of sequences that can be used for analyses such The with a or set of sequences. sequences are by BLAST (8Altschul S. Gish W. Miller W. Myers E. Lipman D. Basic local alignment search tool.J. Mol. Biol. 1990; 215: 403-410Crossref PubMed Scopus (71456) Google S. A. J. Miller W. Lipman D. BLAST and a of protein search PubMed Scopus Google or S. Babbitt P. more from sequence similarity 1999; PubMed Scopus Google Scholar). tree the sequences is used to choose a divergent To that the set is of sequences in the set are and sequences are discarded. The divergent set can be or sequences and used for many different tasks including and the set can be used for of or divergent set can be chosen from the same initial set of sequences. each we different we the of these different and the methods in the DivergentSet We different methods to distances between sequences. These methods on of sequences S. C. method to the search for in the acid sequence of Mol. Biol. PubMed Scopus Google of the of between Google Scholar) of in the sequences of different and the in R. sequence alignment with and 2004; PubMed Scopus Google Scholar). the method provides the sequence the other to the time on sequence of the of between Google Scholar) are used to The are by the of that the have in The between and is a and are the sets of of in the and a very method of comparing because they not that the sequences be The time to of the of the rather than the of the R. sequence alignment with and 2004; PubMed Scopus Google Scholar) also an and method that does not alignment is used to the initial guide tree for This method is also faster than building a tree using distances from We the methods and by comparing the tree distances from these with the distances the distances between sequences using the tree P. R. Scholar) from each of distance for the tree or the neighbor-joining tree Nei M. The neighbor-joining a method for phylogenetic Biol. Google Scholar). We methods for additional sequences. we one of the S. Babbitt P. more from sequence similarity 1999; PubMed Scopus Google the of many BLAST by sequences that in a of BLAST from sequences in the We used a BLAST of and a of or These can be in the we S. A. J. Miller W. Lipman D. BLAST and a of protein search PubMed Scopus Google a method that the sensitivity of BLAST by building up a of conserved on each sequence for using an of We these methods on of sequences and on of protein families that To the of we methods that the sequences in an by a phylogenetic The tree to sequences from the analysis the time that be to with other sequences. the tree with the the and the the the of each are in an from to The tree can also divergent sets. The method that the phylogenetic tree provides a of sequence evolution and that the of evolution are in In this sequences are in tree with the sequence we each sequence a sequence more divergent than the identity is We in of is the of or residues that are the same in the alignment of both sequences by the of the sequences that in the tree are to be divergent is from and are not This tree method the of However, also many by sequences from families of they are primarily because of in between sequences. are when sequences that in the tree the divergent sequence in the tree are more similar to the sequence than is the divergent These can when the tree is or when the of evolution are in different of the or sequences that are to be in the set but are not with this or with of the we are because sequence that is to be from the tree is with each sequence already in the Sequences that are the identity are The we each sequence only with the sequences already chosen for the divergent set than with This method is similar to the tree method that the with sequence not when the divergent sequence is but sequences that have not already been from the tree have been This method no because sequence that in the divergent set has been with each other sequence in the divergent provides large time when sequences are in the divergent set but more when the identity is and sequences must be with other sequences. of this method is that is to because the each the tree determine other to be and cannot be the of the This method is the efficient of the methods we when on a we used a method that we In this we use the tree to a divergent set and this set in this set be divergent because the tree method is This method also no and the of also has the that This arises because in the set must be and of can be This method is the efficient of the methods we when in parallel on a time for the rather than We to these methods using a phylogenetic tree rather than a distance for the tree the sequences that are related to one Thus, when comparing each sequence with the sequences that are already in the divergent set, that the sequence to be are in the the of we found that the divergent set in the of a phylogenetic tree the much to than the full distance especially when the of sequences is large tree with is much for to than the distance with In the tree a of the time so the of using the of important when choosing divergent sets is whether the application a random divergent set or a divergent random divergent set that the of sequences in the divergent set is can be important for phylogenetic divergent set many sequences from the set can be important when the of is We both random and divergent sets using a To a random divergent set, we the of the each of the tree using the random M. a 1998; Scopus Google Scholar) in the for each sequence with the sequence the sequence found by the in each from the we the sequence with each other sequence that is in the already sequences. Any sequence that is too similar to the sequence is random set can be by the The of each set of sequences is and sequence be when is an choice between of sequences. To sets of sequences, we a fully sequence is to each other In this the the sequences, and the the identity We the with of that are the identity are the identity This a set of divergent sequences. is often important to specific sequences in the divergent Sequences may be because are or because the in they are found is Because the in each sequence in the tree whether be in to other sequences in the same we can sequences to be by each of these sequences to the in of in the tree We sequences by whether they are in a of to or whether they for These include the of can be by the and of each such that the of a the in the is from the to the such that that has a is This method random sampling from the of divergent sets that include many of the sequences a pair of sequences are similar to one only one sequence from that pair be in the divergent To the of the different we used one from S. K. W. M. Kyoto Encyclopedia of Genes and 1999; PubMed Scopus Google Scholar) on and one We these sets to a of The set, of the non-redundant sequences that a search in for the of both and sequences are not to one and of the same M. S. A. D. U. W. W. C. D. lysyl-tRNA to PubMed Scopus Google The set, of sequences each with sequences using a random also the model R. the of and Scopus Google Scholar). sequence from a sequence with acid evolve a sequence in the range This set to and for in the divergent sets. are sequences that are in the set because they are too similar to one of the other sequences in the are sequences that excluded from the divergent and because the protein families for the biological sequences in are not This to whether BLAST to are finding or are sequences by This sequence set is of the of and the of finding in a taxonomic group the of the in and on and on a The are the we in for and the BLAST, PSI-BLAST, and MEME programs for we used the The interface and is on our The is a of the used by DivergentSet to divergent sets of sequences. a of identifiers, or sequences. sequences from the sequence using the for search for additional related sequences. each a BLAST or and to the to in parallel on a the BLAST or are the of and of time have been on the the BLAST or have the using the S. Babbitt P. more from sequence similarity 1999; PubMed Scopus Google Scholar). to a tree with the sequences and to the to in the tree has been a set of divergent sequences using the tree a guide the are the of and of time have been on the set of divergent sequences from the of sequences in the this an of and up the of such that no more than of of sequences to these to the and in parallel on the of sequences are the of and of time have been on the of sequences have been the set of divergent sequences using the set or tree method a of the phylogenetic in sequences, and in sequences with the has a sequences have these divergent sequences. the of this search in the to the to the of the search by manually or sequences from the divergent set or by the BLAST This to nearly for the BLAST and the analysis significant of time is distance from of sequence This has a time of both in the of sequences and in the of the sequences. The of time the distance for the large numbers of sequences that are in We methods of the time the time each and the of the of the different distance these use the distances to We the and distance in the of the of we to the of the by a of by in using of rather than are for the and using both much faster than even using our a method to the of As the time in to the of the of for this time from to sequences to to sequences, for and the with sequences and Thus, using methods such to the of the can large time but the methods used by even The similar for both and sequences, that the highly conserved in the biological sequences not to the time and of the different set picking methods we in this are of in a different random divergent set The time for using or because sequences must be The method can be 2 orders of magnitude faster than when sequences are more divergent than the but to a of of when the is and more sequences are in the divergent set and must be with one The method 2 orders of magnitude faster than even our The of by using the phylogenetic tree a The on this are because a set picking method the same of whether sequences are using the or the tree the of by 2 orders of magnitude the to to the method the of by more than an of magnitude of in a of time from to in a substantial that in for the methods used to the methods in the tree to the method used to the Because building the tree little of the these methods in a very similar The of sequences found is similar using each method that too or too many are sequences on the and on whether are or the tree is This of the method is to the between the and the identity from the between with of 2 or 3 with that a function be the different protein families very different not The large of to the function for each protein in with the the distance that be by using The for each method the sequence identity in the set is for methods using the and methods for not are identical to for In other methods sequences that are far in the similarity by be for motif-finding The that using is not an using the method does not The that a highly with of to the tree of the methods that the tree and using and the tree can be the is and is faster than the alignment for of sequences. We the of the methods for finding additional sequences and on both the and the the set we to and because we the of each sequence in each in the sequences and we only the of sequences and the a BLAST are for an of to a much larger of the sequences a and typically to to on a set of sequences to for Because of the time the BLAST the same time to of similarity sequences from families the of algorithms with a of of the when sequences from the same used but only of the when the same of sequences chosen from different by using sequences from The of sequences in the set that and for The one method with a when used with a of The for this method the set, many more sequences and far sequences than finding sequences when sequences used and on more sequences The of on these is to the of The of is that can by S. Babbitt P. more from sequence similarity 1999; PubMed Scopus Google Scholar). the we used, the and sequences more than one of the initial sequences. with e.g. can in much sensitivity with little of not We the the same for analyses in the so that the be when used with the same BLAST we the effect of the set choice on the of the motifs by MEME (1Bailey T.L. Elkan C. Altman R. Brutlag D. Karp P. Searls D. ISMB-94: Proceedings, Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA1994: 28-36Google Scholar). this we used the MEME that the of motifs to and a set to from motifs many motifs are to reach the the set, we random divergent sets of sequences from to identity We found that the motifs that MEME identified from the same set of initial sequences on divergent set of sequences we The of the motifs The in use a range of similarity to whether each of the of motifs the in of MEME using chosen divergent sets. We each with each other using the We the for each the in the other that had the alignment when with that are for and for and very similar motifs with no more than the of motifs by the of of in the same in both this for for sequence The when the for similarity when or and when or However, these the motifs and the motifs that between the typically not have been significant by the using the very little to the In the similar and are the same in of using the to than a in a and that the are not primarily to between similar of of the with different numbers of on the and This of the of using many different divergent sets for finding and for using only motifs and of motifs that are We the analysis on set, in sequences rather than but the identical not In to the analyses and we on the distance and the We based on of and and the of the for of 2 and 3 similar to with of We also an that each of the sequences to the sequences in the set rather than sequences from the tree very similar to we methods of building the tree with very similar to in we of the and alignment but our typically than the and no in not We found large in both and in the methods we for finding additional sequences, distances between sequences, and choosing the divergent We the of these methods and PSI-BLAST, and in the DivergentSet provides identical to the method but is to very to are for choosing divergent sets. not the identity between of sequences. a in of both excluded of sequences similar than the and sequences more similar than the In sequences that are very similar often in for to the of nearly identical sequences in the Because the of choosing divergent sets is to such sequences, we that they for this be to a function to and distances for a random of of sequences from a set, the in the is to be so large that the and remain The to additional sequences is when the sequences from the same sequence and in many sequences are used DivergentSet be used to in such analyses by a set of sequences for that are rather than that are to the families of sequences that are similar to be used for the of sequences found many sequences in a much The can also be by the no time but may affect We the methods in this analysis to a using the same BLAST but we that the of is in using very e.g. to very using a of that are than S. Babbitt P. more from sequence similarity 1999; PubMed Scopus Google Scholar). The divergent set chosen can the motifs than of the motifs the same between different chosen divergent sets. These the of using many different divergent sets from the same sequences for analyses such a of the biological of motifs that are identified from a divergent set of also suggests that the of motif-finding algorithms may be by using divergent sets chosen from the same larger set of sequences. This to the analysis using different divergent sets the for an efficient method for set each divergent set of to analyses by using many different sets is We a both choosing the divergent set manually and the of the full distance and choosing sequences the identity The of choosing divergent sets by has been taking for an with the to the has of analyses to for the full distance and automated this to a of time a set of By taking of a phylogenetic tree from the and using time-consuming we to this time to a By the and BLAST we to the time to a or from the The to random divergent sets with the to additional sequences from the in rather than our to the of a wide range of bioinformatics analyses to the choice of divergent DivergentSet also the of methods for from many of these analyses. We and for providing the set and and of the for We also and other of the for on the

Bookmark

View Full Paper

Bookmark

View Full Paper

DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections

Key Points

Abstract

Cite This Study