February 2, 2012Open Access

Analysis of High Accuracy, Quantitative Proteomics Data in the MaxQB Database

CSChristoph SchaabEvotec (Germany)TGTamar GeigerWeizmann Institute of Science GSGabriele StoehrDigital Proteomics (United States)

Key Points

Key points are not available for this paper at this time.

Abstract

MS-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides, but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining data sets, and to integrate quantitative data. Although, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide false discovery rate is controlled by adjusting the project specific cutoff scores for the combined data sets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome are plotted. We used MaxQB to calculate the signal reproducibility of the detected peptides for the same proteins across different proteomes. Spearman rank correlation between peptide intensity and detection probability of identified proteins was greater than 0.8 for 64% of the proteome, whereas a minority of proteins have negative correlation. This information can be used to pinpoint false protein identifications, independently of peptide database scores. The information contained in MaxQB, including high resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://www.biochem.mpg.de/maxqb. MS-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides, but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining data sets, and to integrate quantitative data. Although, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide false discovery rate is controlled by adjusting the project specific cutoff scores for the combined data sets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome are plotted. We used MaxQB to calculate the signal reproducibility of the detected peptides for the same proteins across different proteomes. Spearman rank correlation between peptide intensity and detection probability of identified proteins was greater than 0.8 for 64% of the proteome, whereas a minority of proteins have negative correlation. This information can be used to pinpoint false protein identifications, independently of peptide database scores. The information contained in MaxQB, including high resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://www.biochem.mpg.de/maxqb. Bottom-up proteomics consists of the MS analysis of enzymatically digested proteomes. During the last few years, measurements have increasingly been performed in a high resolution, quantitative format (1Mallick P. Kuster B. Proteomics: A pragmatic perspective.Nat. Biotechnol. 2010; 28: 695-709Crossref PubMed Scopus (319) Google Scholar, 2Cox J. Mann M. Quantitative, high-resolution proteomics for data-driven systems biology.Annu. Rev. Biochem. 2011; 80: 273-299Crossref PubMed Scopus (532) Google Scholar, 3Domon B. Aebersold R. Options and considerations when selecting a quantitative proteomics strategy.Nat. Biotechnol. 2010; 28: 710-721Crossref PubMed Scopus (482) Google Scholar). Each proteomic experiment typically generates large amounts of raw MS and MS/MS data, which should be made available with each experiment (4Olsen J.V. Mann M. Effective representation and storage of mass spectrometry-based proteomic data sets for the scientific community.Sci. Signal. 2011; 4: pe7Crossref PubMed Scopus (13) Google Scholar). Computational proteomics is then used to extract high confidence peptide and protein identifications and relative ratios between conditions, as well as to distill biological implications from the data (5Taylor C.F. Paton N.W. Garwood K.L. Kirby P.D. Stead D.A. Yin Z. Deutsch E.W. Selway L. Walker J. Riba-Garcia I. Mohammed S. Deery M.J. Howard J.A. Dunkley T. Aebersold R. Kell D.B. Lilley K.S. Roepstorff P. Yates 3rd, J.R. Brass A. Brown A.J. Cash P. Gaskell S.J. Hubbard S.J. Oliver S.G. A systematic approach to modeling, capturing, and disseminating proteomics experimental data.Nat. Biotechnol. 2003; 21: 247-254Crossref PubMed Scopus (229) Google Scholar, 6Kumar C. Mann M. Bioinformatics analysis of mass spectrometry-based proteomics data sets.FEBS Lett. 2009; 583: 1703-1712Crossref PubMed Scopus (137) Google Scholar, 7Deutsch E.W. Mendoza L. Shteynberg D. Farrah T. Lam H. Tasman N. Sun Z. Nilsson E. Pratt B. Prazen B. Eng J.K. Martin D.B. Nesvizhskii A.I. Aebersold R. A guided tour of the Trans-Proteomic Pipeline.Proteomics. 2010; 10: 1150-1159Crossref PubMed Scopus (603) Google Scholar, 8Schaab C. Analysis of phosphoproteomics data.Methods Mol. Biol. 2011; 696: 41-57Crossref PubMed Scopus (13) Google Scholar). Apart from the analysis of individual projects, several repositories for proteomic experiments have been developed, each with different purposes in mind. The Global Proteome Machine (9Craig R. Cortens J.P. Beavis R.C. Open source system for analyzing, validating, and storing protein identification data.J. Proteome Res. 2004; 3: 1234-1242Crossref PubMed Scopus (576) Google Scholar) and PeptideAtlas (10Desiere F. Deutsch E.W. King N.L. Nesvizhskii A.I. Mallick P. Eng J. Chen S. Eddes J. Loevenich S.N. Aebersold R. The PeptideAtlas project.Nucleic Acids Res. 2006; 34: D655-D658Crossref PubMed Scopus (592) Google Scholar, 11Deutsch E.W. Lam H. Aebersold R. PeptideAtlas: A resource for target selection for emerging targeted proteomics workflows.EMBO Reports. 2008; 9: 429-434Crossref PubMed Scopus (444) Google Scholar) are two of the earliest such collections, with the primary goal of providing a collection of peptide identifications. These collections can, for example, be mined for the design of multiple reaction monitoring experiments in targeted proteomics (12Hüttenhain R. Malmström J. Picotti P. Aebersold R. Perspectives of targeted mass spectrometry for protein biomarker verification.Curr. Opin. Chem. Biol. 2009; 13: 518-525Crossref PubMed Scopus (149) Google Scholar). In contrast, TRANCHE (proteomecommons.org/tranche) is a repository for the raw mass spectrometric data (13Hill J.A. Smith B.E. Papoulias P.G. Andrews P.C. ProteomeCommons.org collaborative annotation and project management resource integrated with the Tranche repository.J. Proteome Res. 2010; 9: 2809-2811Crossref PubMed Scopus (26) Google Scholar). PRoteomics IDEntifications database (PRIDE) is a large effort at the European Bioinformatics Institute, which has collected peptide and protein identification data from more than 10,000 experiments (14Côté R. Reisinger F. Martens L. Barsnes H. Vizcaino J.A. Hermjakob H. The Ontology Lookup Service: Bigger and better.Nucleic Acids Res. 2010; 38: W155-W160Crossref PubMed Scopus (86) Google Scholar, 15Martens L. Hermjakob H. Jones P. Adamski M. Taylor C. States D. Gevaert K. Vandekerckhove J. Apweiler R. PRIDE: The proteomics identifications database.Proteomics. 2005; 5: 3537-3545Crossref PubMed Scopus (436) Google Scholar). PRIDE, PeptideAtlas, and TRANCHE are also part of the ProteomeXchange consortium, whose objective is to provide a single point of submission for MS-based proteomics data (www.proteomexchange.org). Many dedicated databases for specific organelles or organisms also exist (see for example Refs. 16Ahmad Y. Boisvert F.M. Gregor P. Cobley A. Lamond A.I. NOPdb: Nucleolar Proteome Database: 2008 update.Nucleic Acids Res. 2009; 37: D181-D184Crossref PubMed Scopus (217) Google Scholar and 17Gnad F. Oroshi M. Birney E. Mann M. MAPU 2.0: High-accuracy proteomes mapped to genomes.Nucleic Acids Res. 2009; 37: D902-D906Crossref PubMed Scopus (16) Google Scholar). Most of these databases accept data from heterogeneous sources, which presents a challenge in analysis. For instance, data acquired with different proteomics technologies, different computational pipelines and different quantification strategies may be combined in the database. Although these problems have been addressed to some degree by open standards and joint analysis of deposited raw data, we reasoned that there should be a role for a database designed for homogeneous, quantitative, high resolution data, which nevertheless covers a large part of diverse proteomes. Here we describe the construction of the MaxQB database, which is meant to address the above challenges, allow novel types of analyses, and serve as a public resource via a versatile web interface. We illustrate MaxQB with deep proteome data generated in an accompanying paper (18Geiger T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google Scholar). In that study, the proteomes of 11 widely used cell lines were mapped in depth with high resolution MS and MS/MS data. We describe analysis and visualization tools of MaxQB, a solution to the problem of inflated false positive protein identifications, and examine the reproducibility of peptide intensity rank order for each protein in different proteomes. MaxQB is structured as a classical three-tiered application consisting of data, application logic, and presentation. The data is a database by are it is in to the database to database management systems the and open source database The application is in and using the web application The web application a web the is of generated and human proteome databases were to MaxQB to a protein and were to a single protein of is by For example, the for the human protein to the the protein and the database have The were to a using the which the of The of the the were from P. D. K. S. Chen Y. P. S. S. L. M. T. N. A. D. S. R. F. E. P. I. B. B. D. M. M. D. S. J. A.J. S. A. J. Birney E. F. I. R. J. Hubbard A. J. Acids Res. 2011; PubMed Scopus Google and the of proteins in different organisms were from database T. K. T. S. and tools for Acids Res. 2010; 38: PubMed Scopus Google Scholar). MaxQB as a repository for experiments performed in it an increasing of deep proteome experiments of and cell types and in the The data are from a proteome experiment of 11 cell lines in the accompanying paper (18Geiger T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google Scholar). and cell lines were in conditions, and to the J.R. A. N. Mann M. for proteome 2009; PubMed Scopus Google Scholar) and by peptide were by a in J.V. J. E. E. P. Taylor D. M. M. A. Mann M. S. A with high Cell. Proteomics. 2009; Full Text Full Text PDF PubMed Scopus Google Scholar). Each proteome of in Analysis of the was performed in J. Mann M. high peptide identification mass and protein Biotechnol. 2008; PubMed Scopus Google Scholar) using the J. N. A. J.V. Mann M. A the Proteome Res. 2011; 10: PubMed Scopus Google Scholar). The that are and are accessible in MaxQB are data with the between the of identifications for cell lines (see and the correlation analysis (see is data with For T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google MaxQB and the of the line proteome can be at of correlation For each protein with two or more peptides the Spearman correlation between the of the peptides and the detection probability were and of proteins with high correlation and low correlation In to the cell line data, we also data from a proteome experiment of 28 mouse (18Geiger T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google Scholar). 28 were from and in The were and with a used by in cell discovery was performed with by peptide by The peptide were by a in Each proteome of in Analysis of the was performed in J. Mann M. high peptide identification mass and protein Biotechnol. 2008; PubMed Scopus Google Scholar) using the J. N. A. J.V. Mann M. A the Proteome Res. 2011; 10: PubMed Scopus Google Scholar). by in cell false discovery rate MaxQB as a repository and analysis for high resolution MS-based proteomics it stores protein and peptide identifications together with the high or low resolution fragment and quantitative such as ratios or label-free of data, MaxQB is integrated with J. Mann M. high peptide identification mass and protein Biotechnol. 2008; PubMed Scopus Google Scholar) the of data the of is to the data to the database. In the data is by a web the data can be the interface of In the is to such as the project experiment and of the data are in a database an database management The can and the data a web interface. the data can be or web from visualization and data analysis tools the for analysis in or demonstrate the and of MaxQB, data from the proteome of 11 cell lines in the accompanying paper (18Geiger T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google Scholar) were to the database. The combined data were the database using A problem of proteomics experiments is the of protein between experiments that were different databases or different of the same database J. C. Hermjakob H. Vizcaino J.A. and The of the protein database the storage of proteomics Cell. Proteomics. 2011; Full Text Full Text PDF PubMed Scopus Google Scholar). we to problem by a protein that the of protein databases to a protein For the human these databases are and In database that to and are mapped to a protein (see for more The protein human with in and (see This was the for analysis. these we the of peptides and by MS between and are such and of are between two or more of identified proteins with and peptides in with in the human in 11 cell peptides from identified peptides from identified Open in a analysis of cell line identified each cell line a of and analysis of all 11 cell line proteomes together identified proteins and In of the identified proteins generated of which were identified in the cell line data at a false discovery rate of For each of these the database the database identification the individual for the peptide and the fragment proteins by more than half of all human and a large of all peptides are identified in the database. Apart from the line MaxQB a of large experiments human proteomes. these experiments together for proteins by 64% of all human and of This that it to for a large part of the proteome from data a of diverse proteomes in which all human proteins are illustrate of MaxQB, we describe with diverse types of that can be addressed by novel database. a we that the is in of a specific protein to expression across the different cell lines and across mouse tissues. The can the database by specifically by and source database. The can be combined by and using the can be used is with the In example, the for all human that have a with The of the the In the in the databases and with the protein expression a the protein expression across the 11 human cell lines. of by more than of between and by label-free quantification in Cox J. H. B. M. J. S. M. H. M. Mann M. proteomics reveals in 2010; Full Text Full Text PDF PubMed Scopus Google Scholar) In to expression of the same protein between MaxQB can also expression within of the with all proteins in that the expression of the protein is estimated by the of peptide of the proteome to each in The B. D. N. J. J. Chen M. Global quantification of expression 2011; PubMed Scopus Google Scholar) is and can also be used to protein In selection of the proteome a the expression of the protein of with all proteins in cell This reveals that is the expressed proteins in these the The for the protein the of identified peptides the of and across the 11 cell lines and biological the in digested peptides with between and and the as from and at the Acids Res. 2011; PubMed Scopus Google Scholar) are of The are two The are in digested peptides with between and peptides are by label-free across the 11 cell lines with The may also be in the expression of in The database of organisms T. K. T. S. and tools for Acids Res. 2010; 38: PubMed Scopus Google Scholar). MaxQB information to allow the to to the proteomes of For example, two proteins in and the mouse the information the mouse an example of MaxQB can integrate data from the expression of in 28 mouse was identified in and several projects to identify all proteins specific have been the of the Proteome P. Aebersold R. A. A. K. L. J. Deutsch E.W. B. F. D. S. M. S. M. T. The Proteome and Cell. Proteomics. 2011; Full Text Full Text PDF PubMed Scopus Google Scholar). In a we many proteins have been identified for a and there are with low identification MaxQB all human mouse or and allows the to of for analysis In the of for example, in the of protein and the protein identifications in the cell line proteomes in of the the can to the of proteins in the of the as well as the peptide information. of all are with high confidence protein identification and the across the to be A of proteomics repositories is the selection of peptides for targeted such as multiple reaction In the a is in an multiple reaction monitoring for the cell protein and by for all peptides that are for have an identification than and have in the for proteins can be combined by The peptides these The the peptide and displays the fragment for the identification for peptide The can the of together with the and and as a for multiple reaction monitoring A of using MaxQB for is the that database high resolution that are by the which to that be in B. Mohammed S. A.J. A between and peptide Proteome Res. 2011; 10: PubMed Scopus Google Scholar). is a common in MS-based proteomics to control the of false identifications by a combined and database and then adjusting the cutoff to a such that the of identified is to a false discovery rate for confidence in protein identifications by mass 4: PubMed Scopus Google Scholar, A.I. A of computational and rate for peptide and protein identification in Proteomics. 2010; PubMed Scopus Google Scholar). Although database is a to control in single projects, an challenge when combining the of many different experiments the same In the of identifications the same proteome is (see for example the false identifications are of each and to an inflation of false identifications. A but different problem when combining the scores from multiple for the same data D. Deutsch E.W. Lam H. Eng J.K. Sun Z. Tasman N. Mendoza L. Aebersold R. Nesvizhskii A.I. analysis of proteomic data peptide and protein identification and Cell. Proteomics. 2011; Full Text Full Text PDF PubMed Scopus Google Scholar, T. H. C. Nesvizhskii A.I. A approach for peptide identifications from multiple database Proteome Res. 2011; 10: PubMed Scopus Google Scholar). we the of identifications from multiple proteomes that have been with the same Although has to been with experimental data. We the of problem by the of data sets to a database of all data together as For the raw data of the proteome experiment of 11 cell lines in the accompanying paper (18Geiger T. Wehner A. Schaab C. Cox J. Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins.Mol. Cell. Proteomics. 2012; 10.1074/mcp.M111.014050Abstract Full Text Full Text PDF Scopus (579) Google Scholar) were sets, each consisting of or cell lines and biological These sets were by with a for protein and peptide identification of Each in protein identifications and these sets were to a single database, the of identifications by whereas the of by with the in the individual sets. The be of the the more proteomics data sets are to a database, the the inflation of false identifications. the data are available for or for each data be For these we to the by adjusting the cutoff to a more database-wide The is performed such that the between the of database and the of identifications a protein and peptide is to the MaxQB by a for each peptide and each protein This for a protein is the between the of and the of identifications with scores or to the of that protein the peptide is The identifications with the of are when the is data from the database than data from a single This is all projects in MaxQB are with the same and the same of identification of identified proteins raw are in Open in a can be in the of the identified peptides are between different proteomes but the cell line with the expression also the of identified a few peptides are identified in all and and these peptides are also the with the label-free intensity as by the These to between the probability of peptide identification and peptide For each protein at two we the Spearman rank correlation between the of label-free peptide and the of experiments in which the peptide was The of the correlation for each protein a of proteins with high correlation A of 64% of the proteins Spearman rank of more than the example of a protein with a high correlation the peptides detected many are also the most and A few proteins have or negative and were in For example, has a correlation of between peptide and peptide detection the peptides detected in the a high the peptide at was detected in of and and it was also the peptide detected for protein in these cell lines. We that peptide is a false identification in the cell which is by a high analysis that rank order of identified peptides for each protein are high to pinpoint false protein identifications, independently of peptide database scores. a of protein and peptide identifications from high quality data, such as in MaxQB, be used to protein identification in proteomics public databases such to the quality of data sets. peptide rank correlation of the data to data sets is may problems with the data. We have MaxQB, a resource for high resolution and quantitative MS-based proteomics data. MaxQB a of proteome which allows types of that are difficult to in many public the of MaxQB have been using deep proteome measurements of 11 different cell lines. These data more than half of the human proteome, and for these proteins can expression across cell lines as well as estimated expression levels within each of The expression data may be for example, to a cell line or that the protein of We to and more diverse data sets in the an example, the expression levels of an protein in 28 different mouse can be visualized in Although these may to of the proteome, a of proteins may be expressed in available sources, we that the large of proteins and peptides typically in proteomics experiments be in repositories, the peptide information can be mined for targeted proteomics in MaxQB has the of a of experiments that are controlled for false discovery increasingly proteome measurements be within We that the data in MaxQB be with these data to the that is difficult in data repositories that between data at different of Although MaxQB proteome data of human cell lines and a of mouse we that proteomes of cell types and be in the MaxQB can serve as a repository for more data, for example, proteome with or data of these data have in common that high resolution identifications and are with a of We to an submission of the experiments in MaxQB to PRIDE, that MaxQB data are also available in the databases that are part of

KI fragen

Bookmark

View Full Paper