June 29, 2013Open Access

Tools (Viewer, Library and Validator) that Facilitate Use of the Peptide and Protein Identification Standard Format, Termed mzIdentML

Key Points

Key points are not available for this paper at this time.

Abstract

The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain. The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain. The Proteomics Standards Initiative (PSI) 1The abbreviations used are: APIapplication programming interfaceCSVcomma-separated valueCVcontrolled vocabularyemPAIexponentially modified Protein Abundance IndexFDRfalse discovery rateiPRGProteome Informatics Research GroupMIAPEMinimum Information about a Proteomics ExperimentMSmass spectrometryPDHprotein detection hypothesisPSIProteomics Standards InitiativePSMpeptide spectrum matchRGResearch GroupXMLExtensible Markup Language. 1The abbreviations used are: APIapplication programming interfaceCSVcomma-separated valueCVcontrolled vocabularyemPAIexponentially modified Protein Abundance IndexFDRfalse discovery rateiPRGProteome Informatics Research GroupMIAPEMinimum Information about a Proteomics ExperimentMSmass spectrometryPDHprotein detection hypothesisPSIProteomics Standards InitiativePSMpeptide spectrum matchRGResearch GroupXMLExtensible Markup Language. recently released the mzIdentML standard data format (stable version 1.1) for reporting peptide and protein identifications in order to improve capabilities for data sharing and make it simpler for bioinformatics groups to focus development on a single, comprehensive file format (1Jones A.R. Eisenacher M. Mayer G. Kohlbacher O. Siepen J. Hubbard S. Selley J. Searle B. Shofstahl J. Seymour S. Julian R. Binz P.-A. Deutsch E.W. Hermjakob H. Reisinger F. Griss J. Vizcaino J.A. Chambers M. Pizarro A. Creasy D. The mzIdentML data standard for mass spectrometry-based proteomics results.Mol. Cell. Proteomics. 2012; 11 (M111.014381)Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar). The format is represented in XML and is formally defined by the combination of the XML Schema Definition and a separate mapping file describing where controlled vocabulary (CV) terms must be used within the format. A core part of the standard captures lists of peptide-spectrum matches (PSMs) with associated scores or measures, described by CV terms. Each PSM should be linked to the spectrum that was searched in a separate file, such as represented in the PSI's mzML standard (2Martens L. Chambers M. Sturm M. Kessner D. Levander F. Shofstahl J. Tang W.H. Römpp A. Neumann S. Pizarro A.D. Montecchi-Palazzi L. Tasman N. Coleman M. Reisinger F. Souda P. Hermjakob H. Binz P.-A. Deutsch E.W. mzML—a community standard for mass spectrometry data.Mol. Cell. Proteomics. 2011; 10 (R110.000133)Abstract Full Text Full Text PDF PubMed Scopus (452) Google Scholar). The PSM captures the modifications identified, again using CV terms sourced from Unimod (3Creasy D.M. Cottrell J.S. Unimod: protein modifications for mass spectrometry.Proteomics. 2004; 4: 1534-1536Crossref PubMed Scopus (239) Google Scholar) or the PSI-MOD ontology (4Montecchi-Palazzi L. Beavis R. Binz P.-A. Chalkley R.J. Cottrell J. Creasy D. Shofstahl J. Seymour S.L. Garavelli J.S. The PSI-MOD community standard for representation of protein modification data.Nat. Biotechnol. 2008; 26: 864-866Crossref PubMed Scopus (105) Google Scholar). In an mzIdentML file, each PSM is linked to all protein sequences from which the peptide could have been derived (using the enzyme specified in the search). The mzIdentML standard also has a separate section capturing protein identification results in a two-level hierarchy. The top level comprises groups of proteins representing a putatively detected “isoform”, with each group containing a list of individual proteins (entries in the database searched to be specific) for which there is ambiguity regarding which of those entities was actually identified, typically because of the existence of shared peptides (i.e. the well-known protein inference problem (5Nesvizhskii A.I. Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem.Mol. Cell. Proteomics. 2005; 4: 1419-1440Abstract Full Text Full Text PDF PubMed Scopus (791) Google Scholar)). Note that in this context an “isoform” is simply a group of accessions from the source database and could be the result of database errors (duplications, sequencing errors) as well as related biological entities. The format also contains structures for describing the search parameters in a standard way, sourcing CV terms from the PSI-MS CV for enzyme descriptors, score thresholds, and so on (6Mayer G. Montecchi-Palazzi L. Ovelleiro D. Jones A.R. Binz P.-A. Deutsch E.W. Chambers M. Kallhardt M. Levander F. Shofstahl J. Orchard S. Antonio Vizcaíno J. Hermjakob H. Stephan C. Meyer H.E. Eisenacher M. The HUPO proteomics standards initiative—mass spectrometry controlled vocabulary.Database (Oxford). 2013; 2013 (10.1093/database/bat009)Crossref PubMed Scopus (60) Google Scholar). application programming interface comma-separated value controlled vocabulary exponentially modified Protein Abundance Index false discovery rate Informatics Research Information about a Proteomics mass spectrometry protein detection Proteomics Standards Initiative peptide spectrum Research Markup Language. application programming interface comma-separated value controlled vocabulary exponentially modified Protein Abundance Index false discovery rate Informatics Research Information about a Proteomics mass spectrometry protein detection Proteomics Standards Initiative peptide spectrum Research Markup Language. The format contains some the of the source and tools are on users and there is support for mzIdentML from version Creasy D.M. Cottrell J.S. protein identification by using mass spectrometry PubMed Scopus Google a from the J. A. M. J. mass spectrometry data PubMed Scopus Google a for proteomic PubMed Scopus Google B. C. C. M. A. G. software for peptide sequencing by mass PubMed Scopus Google Chambers mass peptide identification by PubMed Scopus Google and the E.W. L. D. H. Tasman N. B. B. A.I. Aebersold R. A of the PubMed Scopus Google Scholar) D. Chambers M. R. D. P. source software for proteomics tools 2008; PubMed Scopus Google Scholar). In there is the for the and of mzIdentML in M. A. C. A. R. N. O. A. Kohlbacher O. open-source software for mass 2008; PubMed Scopus Google Scholar) and file format for within in the support other groups have a application programming interface for and mzIdentML F. R. F. D. Hermjakob H. Antonio Vizcaíno J. Jones A.R. a interface to the mzIdentML standard for peptide and protein identification 2012; PubMed Scopus Google Scholar). The of mzIdentML is on the We are a of three open-source tools for in we have a interface for the the ProteoIDViewer, on users and we have created the mzidLibrary, a of routines that be into a for or post-processing tools for the of the search J.A. L. M. D.M. mass spectrometry search 2004; PubMed Scopus Google Scholar) or comma-separated value and R. Beavis proteins with mass 2004; PubMed Scopus Google Scholar) XML into In to a the mzidLibrary has a interface allowing the routines to be by users in a straightforward such as the format are also in the we have the a interface for that files are correctly formatted and use appropriate CV terms is as a it also is users with files from a has also been by the this is to files because of the for to the All the tools we have are released with a and the is in Google allowing and to be and We the of the by using available data created by the Informatics Research of the of All three tools mzidLibrary, and are on top of F. R. F. D. Hermjakob H. Antonio Vizcaíno J. Jones A.R. a interface to the mzIdentML standard for peptide and protein identification 2012; PubMed Scopus Google which file and files into and The an the in which the of in the file are allowing to the file the to files into The ProteoIDViewer was using in was to process files using and or data into with appropriate created data represented on on top of external for database results using and an spectrum based on an open-source library by group H. M. N. A. L. an open-source library for 2011; PubMed Scopus Google Scholar). The mzidLibrary contains a of routines for file based on or such as and the associated for combining multiple search results to A.R. Siepen J.A. Hubbard in by of false discovery for multiple search PubMed Scopus Google the exponentially modified Protein Abundance Index for based on J. M. modified protein for of protein in proteomics by the of peptides Cell. Proteomics. 2005; 4: Full Text Full Text PDF PubMed Scopus Google and file format on top of open-source by the H. S. A. L. an open-source library to and data from search PubMed Scopus Google Scholar) and the M. H. L. A. an open-source library to and search PubMed Scopus Google which the XML format and XML was created to the the and to within In a new was created in the mzidLibrary for because there is a associated with search results in the and the results files are and to The created an file containing the search for a mzIdentML file format of the parameters file is described in the mzidLibrary available from the as these are not in the results The mzidLibrary also contains routines for setting identification in the file and data about protein sequences from the file that was the mzidLibrary contains new software for performing protein inference, for which the is described for the The the that all the parameters be PSM score of the CV to be used for a protein multiple search results have been the results must be in a list in mzIdentML with a by CV a should be on the PSM which will be to a score (i.e. to an or false discovery rate score into a with or all in the file should be in mzIdentML protein database are by all the if that have a mapping to the protein is created for each by the PSM scores to the is for all and is used for to proteins within a A peptide is that be to protein and is to the protein with the the in J. M. peptide identification mass and protein Biotechnol. 2008; 26: PubMed Scopus Google The order of is by the of protein protein that has any or has been peptides is as a putatively detected and a new protein group in peptides are defined as those that be to a protein database allowing for ambiguity that typically be by will be in or proteins with to the protein are in the protein that is by or proteins is in the group with the protein with which it the In the of a the protein is into the group that has the to the Here we the use of the tools within the mzidLibrary, in the using the data created by the and to to well groups could protein identification on a standard data peptides derived from the of proteins from with was not the focus of the was and peptides into The on a to mass available in The files by and using a of search and software for performing protein was a to the results “isoform” and a group of putatively detected sharing some peptides in with for individual is the core of biological containing a group of database for example, as a result of or group is in the source database that was was to results into the Research the of within each and all the each contain or each contains and by and in these by group or on their the available from the of In the by the the the of identifications from and is of database accessions within these is not are for any of the are as false We searched the data format using version version J.A. L. M. D.M. mass spectrometry search 2004; PubMed Scopus Google and D. Beavis A for the of mass spectrometry-based protein identifications using PubMed Scopus Google Scholar) with the parameters the on and on and of protein on and of the database by from the of in the mzidLibrary, which the using the described the use of the mzidLibrary and the we a of routines using a file to the described files searched in and from in mzIdentML format (using an of the as mzIdentML version is for from and and to mzIdentML using the and the three results mzIdentML files into a mzIdentML file and using the The mzidLibrary contains a for protein sequences and protein from a file and these into protein database in in search not this level of We created an file containing the for database and this for these into the file the for each protein to simplify we used the to the in the mzIdentML file to for any in a in with a of this A.R. Siepen J.A. Hubbard in by of false discovery for multiple search PubMed Scopus Google which identifications from search with the The was with the the used for is the of the which is used for and The mzIdentML file from the was to proteins protein using and with the results in of results by and proteins with protein by the use of a to the of accessions to each We also the on the results using the to of as used for multiple search PSI-MS CV to for and and post-processing as described The mzIdentML also in the ProteoIDViewer and with the mzidValidator to that and The mzidValidator was as a an of the L. S. Reisinger F. B. Jones A.R. L. Hermjakob H. The semantic a to of proteomics PubMed Scopus Google Scholar). the Information about a Proteomics Informatics a new mapping file was created containing and CV terms to be in of the file in the standard mzIdentML mapping semantic and mapping files are available in the for users to some in their mzIdentML files by The ProteoIDViewer was for use by laboratory and bioinformatics support is not in order for a to The be simply and with The ProteoIDViewer and of and the search and supports the use of multiple search The several for data all protein groups identified, as well as individual proteins within those groups and to the peptide for each identifications for each peptide all peptides identified, as well as the protein for each the of and proteins identified, such as if a database search has been and all protein within the the search parameters for the peptide and protein identification The ProteoIDViewer a mzIdentML file containing the identification results and an file containing the that searched and format are The is also able to other identification file (using and XML files. files are to mzIdentML format and are in the The is the that contains all protein groups and the individual proteins within those In it any about the protein sequences from the source such as the protein and protein these are in mzIdentML from the source database these are all The Protein a containing all the that used to the of a The is the which contains the list of all that searched and all When the on a a separate the spectrum a of the by the the by the search will be on top of the spectrum these are in the mzIdentML We that this will be for to reporting such as those of and which that protein identifications based on a PSM be by a of the The is which is to the into the data a list of all peptides identified, by The contains all database protein sequences that are in the mzIdentML contains the the the and the protein these are in the The of the data within the file, such as the of for which a peptide identification is the of the and the defined in the file, the of protein and so The also to routines in the mzidLibrary for the of false and and the if a database search has been identifications are using the in the file or using a in the in the mzIdentML software not correctly The three score false and The is which the parameters used within the peptide identification and protein identification The ProteoIDViewer has been with file of to and it mzIdentML and files is the for and mzIdentML The mzidLibrary has a of that be used to mzIdentML files so that mzIdentML support into their own We also have created a so the software be used by scientists the library contains a of routines that be in as in The library also contains a of or that not part of the public We also that other use the software to new routines to the The in the release is described routines within the mzidLibrary release All routines must the and files and have an for the using the routines have an for or about the protein sequences and from a file and into an mzIdentML of to the from the and A.R. Siepen J.A. Hubbard in by of false discovery for multiple search PubMed Scopus Google Scholar) a database search has been for the is not in the of to of CV for score in file to use for are to and from or three search to a A.R. Siepen J.A. Hubbard in by of false discovery for multiple search PubMed Scopus Google for of to of files and for the of search the on or to the it that the is for or to be used for setting the of CV for score in the file to use for are to format to for results in format to for of CV for score in file to use for the file is not from of the file containing search XML format to for from file or from lists for capturing to from file file list file from mzIdentML format to of to PSM protein proteins with of all protein or protein based on of to the from the protein inference from the with of CV for score in file to use for are to the score should be in a new the list of and the protein list within mzIdentML have the to identifications to have a and those that have not the The of or proteins the the of and the for of the files to of results using tools. protein inference in protein results, the PSM not typically a PSM The the appropriate PSM or protein setting the based on the parameters the also associated that the in order to the file The setting of a on the PSM list should be used with because if a protein list is also in the file, to the PSM list will not be in the proteins in the protein list (i.e. not a new protein inference of the of mzIdentML is that bioinformatics should be able to focus on new development file format and are search that this support of this in we have software for the of these search into packages lists of not protein inference, so the also PSM lists and not protein lists in as described be to protein the and the format an created file containing search In order for the from the format to be the search must have been in with the and search parameters in search the search parameters into the file that are for to The the XML of to In order for this to be the also must have the parameters performing the again so that the be correctly in the mzIdentML In order to support users to their results into or we a for the XML into a some bioinformatics to results from mzIdentML to the XML not in or The software on the a list of a list of protein groups all a list of all proteins and a list of protein groups the group protein The for based on a to for the of peptides in a protein J.A. L. M. D.M. mass spectrometry search 2004; PubMed Scopus Google Scholar). value is as where is the of for a protein (i.e. the of peptides identified, allowing for and and is the of peptide sequences in a that could be by the in terms of the and are the the source as as the is for a protein from all of that have a modified and peptide is and any peptide that be to multiple proteins is for the is by the and from in the and for each protein in the an in is and the of peptides is with the and this in the mzidLibrary, the results for the data by are in from as a (using an version of the to and as by and by The the for example, the value for the top is in the and from We that results to the example, and for as the for The a based on protein mass and to in for an for peptides as there are peptides in in this of which have the to be this this that is the value by In some the and also in the because we those with the for In this is based on the as the be using the The protein inference problem in proteomics because the from a peptide identification to the source protein is in peptides could be to and software is to the mapping and a of proteins to groups where ambiguity or where a protein contains the peptide identifications as protein a A of software is available for performing protein inference, in software or source W.H. and tools for and protein identifications from shotgun PubMed Scopus Google A.I. A. Aebersold R. A for proteins by mass PubMed Scopus Google B. Chambers and PubMed Scopus Google peptide data for PubMed Scopus Google Creasy D.M. Cottrell J.S. of shotgun proteomics data.Mol. Cell. Proteomics. 2011; 10 Full Text Full Text PDF PubMed Scopus Google Scholar). there is or support in these tools for a we have a protein inference within allowing lists of in mzIdentML to be into protein for example, a search with or which not support protein In order to the we the results from and and the for and false the parameters in the protein identifications and false identifications We have these results those by the group and in the The results in that the is performing protein inference all are false should also be that this is an as it is not to the of the and the protein inference In the the results with and which be to with data that on an the of the inference we the the from in and false The results with the results from a recently protein from that used the data and search parameters Creasy D.M. Cottrell J.S. of shotgun proteomics data.Mol. Cell. Proteomics. 2011; 10 Full Text Full Text PDF PubMed Scopus Google for which the and false for in The of results, all and is available in the mzIdentML The has the of able to from any search result that be to allowing it to be used with software or search development to results for that it is able to protein inference false or reporting protein with In order to the of the in mzIdentML a new has been as an of the semantic L. S. Reisinger F. B. Jones A.R. L. Hermjakob H. The semantic a to of proteomics PubMed Scopus Google Scholar). The was as a not for the XML also to regarding the use of CV example, it could be used to that the terms in the and that are used in the of a In the the of of that in the file in a Here we report the development of the of the for the mzidValidator The mzidValidator the to an mzIdentML file and standard or The semantic files are the and contain the CV terms has been a of are to the about any errors detected in the The is used to the of the files with the Binz Julian Jones A.R. R. Aebersold R. Deutsch E.W. A. M. M. L. P. Seymour S.L. Souda P. A. J. Hermjakob H. The about a proteomics Biotechnol. PubMed Scopus Google Scholar) for the Informatics R. Beavis Creasy D. D.M. Julian Seymour S.L. for reporting the use of mass spectrometry in Biotechnol. 2008; 26: PubMed Scopus (60) Google Scholar). mzIdentML Informatics new CV terms to the PSI-MS new defined in a new mapping file, and some We have a of mzIdentML that semantic or on the all is mzIdentML be in The version of the mzidValidator be from the group standards by are represented in XML format. XML is by because it is an standard there are tools available for XML and the of the format be formally defined with an XML Schema Definition XML are not straightforward for to with and be by users tools. We have the ProteoIDViewer to users and with files in mzIdentML format. The has the of able to with data from any search of exporting data into the mzIdentML format. there is a for or format from any search the ProteoIDViewer has the to as a for peptide and protein identification The data in a proteomics a search has the core of We have routines several post-processing to the mzidLibrary, so that search that not these capabilities be used to lists of proteins for We will to new routines to the and we other to as in order to with an vocabulary of the standards use an mapping to that terms are within files. a that files are is not simply a of a standard XML Schema available in a of tools. We have the mzidValidator to as are mzIdentML to that their files are and to users that files are if files that not in tools. We anticipate that tools will make it simpler for bioinformatics groups to mzIdentML support into their tools and improve adoption of the new The tools should also proteomics in from a of software peptide and protein and of which is to We the for the of data for these are for We also and group for tools in the source which we have in tools. with files

Mark Helpful

Bookmark

Relay

View Full Paper