August 29, 2011Open Access

iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets. The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets. A combination of protein digestion, liquid chromatography and tandem mass spectrometry (LC-MS/MS) 1The abbreviations used are:LC-MS/MSliquid chromatography-tandem MSPSMpeptide to spectrum matchesFDRfalse discovery rateTPPTrans-Proteomic PipelineFFEfree-flow electrophoresisOGEoff-gel electrophoresisNSSnumber of sibling searchesNRSnumber of replicate spectraNSEnumber of sibling experimentsNSInumber of sibling ionsNSMnumber of sibling modificationsEMexpectation maximization. 1The abbreviations used are:LC-MS/MSliquid chromatography-tandem MSPSMpeptide to spectrum matchesFDRfalse discovery rateTPPTrans-Proteomic PipelineFFEfree-flow electrophoresisOGEoff-gel electrophoresisNSSnumber of sibling searchesNRSnumber of replicate spectraNSEnumber of sibling experimentsNSInumber of sibling ionsNSMnumber of sibling modificationsEMexpectation maximization., often referred to as shotgun proteomics, has become a robust and powerful proteomics technology. Protein samples are digested into peptides, typically using trypsin. The resulting peptides are then separated and subjected to mass spectrometric (MS) analysis, whereby a subset of the available precursor ions are sampled by the MS instrument, isolated and further fragmented in the gas phase to generate fragment ion these spectra, the peptides and then the present in the sample in with liquid chromatography-tandem MS peptide to spectrum false discovery rate Pipeline number of sibling number of replicate number of sibling experiments number of sibling ions number of sibling maximization. liquid chromatography-tandem MS peptide to spectrum false discovery rate Pipeline number of sibling number of replicate number of sibling experiments number of sibling ions number of sibling maximization. The volume of data generated in proteomic experiments has been the has been by the in several of proteomics sample and and more mass by and The resulting in the number and of data has computational tools that data from of experiments and in a and analysis and tools for tandem mass spectrometry in the to and false peptide to spectrum by database search engines for the of proteomic data sets. by of false discovery in the data sets. a applied it and to and data The for in of and Protein Protein and of proteomic data generated by tandem mass has been in and tools in of shotgun proteomic data. the of new and tandem MS database search as as data for and posterior peptide and protein probabilities in by and mass spectrometry in a is also on of proteomic data analysis, tools for and data and data as in and of proteomic data generated by tandem mass in computational mass for the in for of to these the of the computational tools PeptideProphet to the of peptide identifications by and database of database search and A for by tandem mass which and more analysis of proteomic data. tools the of the widely used Pipeline A of the the same the last a in the of data As a data are in multiple the of overlapping data sets. is the in across multiple of shotgun proteomic for of protein A Protein and in of human protein from the human the of of as A of the and and proteome of of the proteome sample and the mass spectrometric of PeptideProphet and have been to accurate estimates in the of to data sets, several in these tools performance with data set Protein for by Furthermore, is a in the analysis of data using a combination of multiple search engines, with the to the number and of peptide and protein has become with the of the of and of the results of multiple the of search the of and to data present a computational method and iProphet, to these the two of and protein identifications, iProphet to more the nature of shotgun proteomic data by as peptide precursor ions as precursor identical and peptide sequences as sequence identical to from different database search tools are integrated the same As the main outcome, iProphet permits the calculation of more accurate posterior probabilities and estimates at the level of peptide probability is used for all sequence identical which is for more accurate estimates at the protein level. The of PeptideProphet allows the data the of a of data from different search We the performance of iProphet on two publicly available data sets: data from a human whole cell lysate proteome profiling experiments representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets. publicly available data used in The from a on of human by proteomics and to as human data set in the main The MS data from the data in for the the the lysate separated using of digested with and using a ion mass use a replicate of the whole cell lysate experiment It a set of MS in The data set is a set of samples from a Streptococcus pyogenes protein of the human to as pyogenes data set in the The data are available in the data as data set into experiments a of and spectra. of the experiments from peptide samples by and on a of the experiments from samples on of the experiments from peptide samples by and on The two experiments from samples separated by a and on MS data to the A representation of mass spectrometry data and to proteomics with search engines then with PeptideProphet, iProphet, and in that from the human data set the pyogenes data of the MS data subsets by The PeptideProphet results in with using iProphet as The with using the data set and the subset of the pyogenes data The protein sequence database used to search from the data set of the human protein sequences on The protein sequence database used to search the data set the pyogenes protein sequences from database and human human protein sequences to for a of human protein in of the pyogenes sequences in both by the peptides of the the into two sets, with of the used by PeptideProphet for and by the iProphet The set used to the performance of the computational The same applied and the performance of Percolator. The two of from by peptide it that identical peptides that from different in identical peptides in the protein The of the sequences in the database then to of search engines to on in the different search engines used in protein identification by searching sequence using mass spectrometry to of with in a Protein A method for the to protein sequences with tandem mass for and database using the search of modified peptides from tandem mass accurate tandem mass peptide identification by and mass spectrometry search A set of applied for all search engines, which as a precursor mass of mass for of the pyogenes data mass for and peptides in which on computer of search all search engines, PeptideProphet on the data in the of peptide identifications in proteomics using the database search and set of the sequences to used in to the of the The PeptideProphet accurate mass of peptide identifications in mass used for the and the mass for the data. The number of and mass of PeptideProphet for the in the which in a for peptides mass and number of using these in PeptideProphet The PeptideProphet peptide and identification for mass spectrometry proteomics applied to all and experiments and the applied to experiments The iProphet in to PeptideProphet The probability are on the number of sibling replicate sibling experiments sibling ions peptide precursor and sibling The identifications on the of multiple search engines for the same set of spectra. are the of the search engines are more probabilities from the search engines as by the PeptideProphet analysis on search The probabilities are by the probabilities of that on the peptide sequence and by the number of on the the of for is The is as The the that in a typical data multiple probability identifications of the same precursor ion the of that precursor ion correctly the of to probabilities and to the same peptide ion that all are As a for precursor ions that are commonly identified with probabilities a for precursor ions that are commonly identified with probabilities is for precursor ions that are identified from spectrum method of to the probabilities of precursor ions identified with a is from data set and data sets. The of for is to The is to the is a that is used to multiple identifications of the same precursor ion across different experiments the that precursor ions that are in multiple experiments and to the same peptide sequence are more to As a for precursor ions that are commonly identified with probabilities across different a for precursor ions that are commonly identified with probabilities across different is for precursor ions that are identified from experiment The of for is to The is by the The peptides that are identified by with different The is as The peptides that are identified with different mass The is as As sequence database two of The set of used at the PeptideProphet and also in and available for to in the main in the of and to the set of for data the probability used to the of all to false and the then applied to the at a probability and are the number of and the number of a probability and is the number of false The number of is as The same is used to estimates at the peptide sequence level and at the protein level. The same analysis applied on that by that used for the peptide and protein of the posterior The using posterior probabilities and for a probability is as to the of peptide identifications by and database The number of a probability is by the are modified to at the level of peptide sequences and the by iProphet, it is to the PeptideProphet and to the analysis of shotgun proteomic data. The at two and protein identification PeptideProphet as all from the experiment the for spectrum It then the to a of and from the data using of as the as the database search also on of the peptides number of mass PeptideProphet then the posterior probability for as using and are the probabilities of a and and and are probabilities of a and in the data set the of and The probabilities and the the and are from the data PeptideProphet in the of the whole of and it into in the data set that identify the same peptide The main of PeptideProphet, is the posterior probability of the analysis at the protein level. is by which as the of and posterior probabilities from and to the probability that a particular protein is present in the The protein probability is as the probability that at to the protein is protein probabilities using to the same peptide ion are by a are as evidence for protein in peptide precursor the to the same peptide sequence different precursor ion charge on with to different peptide The the nature of peptide to protein the that peptides, more the to to a number of identified by multiple It to for and of protein level probabilities by the probabilities by PeptideProphet, to for the protein number of sibling peptides, as in peptide precursor ions that have sibling ions to the same and that have of which are protein iProphet the by more the nature of shotgun proteomic peptide precursor ions as as all peptides and peptide sequence The level and of the multilevel are the the of multiple to the same peptide precursor ion is into by a new the number of replicate and using it in a to the in The iProphet further it to multiple search engines by level of the search iProphet to the PeptideProphet of using the for of The identifications on the of multiple search engines for the same set of of The the that in a typical data multiple probability identifications of the same precursor ion the of that precursor ion correctly the of to probabilities and to the same peptide ion that all are of The is used to multiple identifications of the same peptide precursor ion across different the that precursor ions that are in multiple experiments and to the same peptide sequence are more to experiments of the same of different of the same sample different samples from the same a as as peptides in samples for particular to and to samples of that It is to the to experiments, by different experiment to experiment of The peptides that are identified by two more peptide precursor ions of different is on the that peptide are often identified in more charge identifications of the same peptide are to in multiple charge of The peptides that are identified with different mass is on the that identifications the same peptide with two different mass in both are to from of by mass spectrometry in a that are as in search of the is as a of two the and with PeptideProphet iProphet the to all of the new in the PeptideProphet probabilities using are used as of to the and for of the new by The probability of as is using is the PeptideProphet and and the probability and of these the is as the of the and for the the probability of on the PeptideProphet probability and the and by the of the iProphet by the of the and in the The are using the of peptide identifications in proteomics using the database search and The all have that all are the in which the are applied the of iProphet to the PeptideProphet results probabilities The main of iProphet is the identification probability at the peptide sequence as the probability of all to that the into is the set of peptide sequences and further of the peptide probabilities for the number of sibling peptides, the of modified compared with the to peptide probabilities are then used to the protein probability as in the of peptides peptides sequence is present in multiple in the protein sequence peptides across the protein and protein as previously A for by tandem mass The to with peptides with the for is in in iProphet, it to the analysis, from to protein a The performance of iProphet using a human data set representative of data generated in a typical experiment The data set with different search engines and and the from search using PeptideProphet then applied to PeptideProphet results for search search as as to all search engines The analysis with addition of iProphet applied to The results compared at peptide and in of the of probabilities to and identifications as as The use of iProphet in a of correctly identified at of on the search The results of the analysis on search results are in for search The the number of as a of with the of in the of the at the level for the search engines in particular data the performance the protein the use of iProphet in the the results for and for search As all search engines with iProphet and of the of correctly identified search performance at rate of A also at the protein level the of posterior probabilities by the estimates with the estimates at the peptide and protein The using PeptideProphet and probabilities for a search are in for search the for which it PeptideProphet accurate probability as by a the and estimates is by a the the iProphet probabilities at the level to is the at the level of peptide sequences and at the protein the estimates on PeptideProphet probabilities become accurate and and for search the PeptideProphet probability all the same peptide sequence the probability at that resulting in iProphet more accurate probabilities by the and probability at the peptide sequence level it into all that into the identification of a peptide at the protein as in also all the protein probabilities with the of iProphet are more accurate the of different in The number of as a of using PeptideProphet iProphet and using iProphet with a The of the number of sibling and The the and as and The and the iProphet The the of the the of the is on the by the the the probability of a in that and the probability in the of the pyogenes data iProphet allows accurate probabilities the results of multiple different database search tools The of the search results the probability for spectrum across all search as by results from multiple search a of the probabilities as compared with at the level iProphet at the level multiple search engines are and the probabilities and estimates accurate at the peptide sequence level and the protein level and The same for the human data set are with the pyogenes data data representative of composite organism-specific data sets, challenges the performance of the and further the of data the in the number of at of from to and for and the of search is the of it is to that the of search engines on the number of protein identifications at a different in data set in the human data set has Furthermore, a in the across data generated on different that it to search search for the analysis of a particular data the results from all search engines with iProphet and of the of correctly identified search performance the probabilities using iProphet in the more accurate by the both for the search results for peptide and protein and for all search engines the analysis for both data that the use of iProphet in the analysis the number of results at a more it the of the the of iProphet to the in performance at the iProphet applied to the pyogenes data set with iProphet, of and at a is using a search in the number of by in iProphet, with to the PeptideProphet The results of iProphet all iProphet are for the as data the number of sibling ions the The and by iProphet in data set are in is a and in of with on the from to a peptide identification on spectrum a peptide precursor ion is more to the same peptide also identified with probability from a charge spectrum of and more it is also identified from a spectrum as the identifications of peptides from precursor ions of different charge are as the and with all to the same peptide sequence a the protein probability As a of leads to more accurate protein the of and by iProphet in the pyogenes data all search engines It that the is also It also that the of different in iProphet on the used to generate the data. the a in experiments on of peptides as experiments for of the in a data experiments, peptides in multiple modified and a probability both are identified with The performance of PeptideProphet and iProphet compared with that of for peptide identification from shotgun proteomics data as a state-of-the-art a to and a on the and PeptideProphet and A of computational and rate for peptide and protein identification in shotgun and of for in the search engines used in the to search data from all search engines with iProphet a in both data a more PeptideProphet, and on search results the data set The performance of iProphet and in the of iProphet at by the that is to for PeptideProphet and iProphet the The of using peptides spectrum the PeptideProphet previously as and of for in and demonstrated a in number of in the is for in the it is to the number of identifications at the peptide protein The analysis using the subset of the pyogenes data set PeptideProphet in is of the of PeptideProphet to use mass of the MS on of in the As a iProphet both and PeptideProphet in that As in the data to identify more in by the peptide iProphet has been and as of the A to use the has been A of the and of the is available at The PeptideProphet to search of data set iProphet then as more that have been with PeptideProphet and PeptideProphet iProphet applied to PeptideProphet results for search used to the results of multiple and multiple The of iProphet is another that the for spectrum in the iProphet the probabilities it to the resulting The and for are in the as and also the for the data all data by iProphet use all and are the for multiple sibling is a search is and as a it is in analysis of the iProphet also by the The from iProphet is further using a modified of which a with the protein level of the iProphet is in PeptideProphet and and is available of the it is a of the it on all and It with from all as from to the A proteomics analysis of the are all the and tools of the with iProphet The has for of the search engines protein identification by searching sequence using mass spectrometry to of with in a Protein A method for the to protein sequences with tandem mass a to identify peptides sequence database searching using tandem mass and and of a searching method for peptide identification from for tandem mass spectrometry data of modified peptides from tandem mass accurate tandem mass peptide identification by and mass spectrometry search iProphet with the of choice at The iProphet as as all is and The is available in a publicly on of the to the to have to iProphet to new and data sets. new the of the is and is available on and are available at the The main of PeptideProphet and and the new iProphet tool is the analysis of and integration of peptide and protein identifications in proteomic data sets. the last several have that it to the in with to approaches on A of computational and rate for peptide and protein identification in shotgun search engines, and the search into A method for the of mass protein identifications using and as for The used to the as the probability in the generated from with the mass spectrometry search A probability for protein identification and using tandem mass data and protein sequence of the of the for database search using A method for the of mass protein identifications using of the which in the same of as is on the of probabilities and of tandem mass A The of these the search is that are different a of of the across different data sets. and are the analysis of multiple spectra. is to more for of of database search as the false discovery rate a and powerful to multiple and posterior peptide and protein probabilities A of computational and rate for peptide and protein identification in shotgun discovery and in mass of for the of peptide probabilities and false discovery of the same shotgun proteomics, the for into two The search for increased in protein identifications by mass that are a database and sequences and that to peptide sequences and false to sequences the same search for increased in protein identifications by mass The main of is and the of the same the a probability for a more analysis using a analysis of a in PeptideProphet and tools the posterior probability for as the for and false probabilities are of peptide identifications in mass to another rate the referred to as peptide and used to for subset of with probability a also as to the protein level analysis which the calculation of protein level probabilities and of PeptideProphet is that it is to the peptides of mass peptide number of from the of the and are on with the for of these the of into the further increases the and in PeptideProphet tools have and the of PeptideProphet on the of peptide identifications in mass into the for of the in the of data the of a of The with of the of peptide identifications in proteomics using the database search and which it to PeptideProphet to a number of database search compared with the a number of as several of the tools applied to data have become in As on A for by tandem mass of the challenges is the of the from to peptide sequence to protein level. The in to and it in the of to data sets. of the main of the iProphet is to accurate probabilities at the level of peptide sequences with the PeptideProphet which are accurate at the level. in leads to more accurate probabilities and estimates at the protein level. of the in iProphet to the multilevel of shotgun proteomic data also the and false the number of of also provides into the nature of the sample and different and with peptides and of for of shotgun proteomic data The protein to by and is in the of data of and the of the of probabilities is on the of and the estimates on the of the database two the and the database a search a database A for of database for protein identification using data for false and false discovery in shotgun a A of computational and rate for peptide and protein identification in shotgun Furthermore, of the database all of false identifications false of sequence discovery and in mass the is in that iProphet the of peptide and estimates compared with the a the and The analysis using two data demonstrated that the use of iProphet provides increased number of identifications at the same and more accurate probabilities at all has been on data as a of the with the human of peptide sequences by mass a for for proteomics to data from and experiments, a data analysis and present the results as a of all peptides and in the publicly available data. The iProphet has become a of the and to further the and of the database analysis and tools for tandem mass spectrometry in The multilevel allows integration of multiple database search A is to the of and a number of available open-source database search by results from multiple search peptide identification by search in proteome studies by analysis of false discovery for multiple search for peptide identifications from tandem mass it has been widely used in of the from multiple search engines into a has been and in of different generated by search the of iProphet, these multiple search results into the of different search engines to a number of identifications at a constant as a of the commonly used it the analysis available to a number of We and and iProphet and to with

Me gusta

Guardar

Ver artículo completo

Cite This Study

Shteynberg et al. (Mon,) studied this question.

synapsesocial.com/papers/6a0df5f5cecdf5fb20baa8d0 https://doi.org/https://doi.org/10.1074/mcp.m111.007690

Me gusta

Guardar

Ver artículo completo