- Jul 2018
-
europepmc.org europepmc.org
-
On 2015 Feb 09, Peter Csermely commented:
Authors of the paper are happy to highlight the available additional information regarding the localization tree, manual mapping and revision, as well as the updates of the ComPPI database http://comppi.linkgroup.hu.
Localization tree: ComPPI uses the GeneOntology (GO) cellular component terms for standardizing the different subcellular localization nomenclature in the source databases. The localization tree was needed for the unambiguous classification of minor localizations to the major subcellular compartments. Mappings of the original subcellular localization names to the relevant GO cellular component term for all input sources are available on our GitHub repository https://github.com/erenon/ComPPI/wiki/LOC-databases, while the whole localization tree and the mapping of minor localizations to major cellular components is accessible via our respective Help page http://comppi.linkgroup.hu/help/subcell_locs#structure, and was included in the Supplement of the paper http://nar.oxfordjournals.org/content/43/D1/D485/suppl/DC1 as Figure S2 and Table S4.
Manual mapping of proteins: As we described it in our published paper (Figure 1) we manually built a mapping table to map 30% of proteins in ComPPI when 1.) the original gene ID (e.g. HGNC gene symbol, EnsemblGeneId etc.) was translated to more than one UniProt accessions (~20% of proteins total, e.g. "HTT" gene symbol could be mapped to both P31645 and P42858 UniProt accessions, where our manual decision of P42858 was based on the name description, "Huntingtin" found in the source dataset of OrganelleDB); or 2.) the original gene ID had no automatic translation to UniProt ID (~10% of proteins total, e.g. manual mapping of the OrganelleDB source dataset "p53AIP1" gene name to Q9HCN2 UniProt accession having the official gene ID "TP53AIP1"). We note that the process was not entirely manual, as we used online ID cross-reference tools (such as PICR http://www.ebi.ac.uk/Tools/picr/) and the UniProt ID mapping http://www.uniprot.org/uploadlists/ as gold standard validation of gene ID to UniProt mappings.
Manual revision: Our manual revision process besides the construction of the localization tree and the ID mapping included 1.) the manual testing of the interfaces connecting the source databases to the ComPPI dataset during the database building process including the check for false-entries (e.g. non-existing UniProt accessions) and data content errors (e.g. minor subcellular localizations without GO terms) for randomly chosen 200 proteins, and 2.) six experts tested the integrated dataset for false-entries, data inconsistency and protein name mapping errors. Additionally, two of the six experts tested randomly chosen 200 proteins for exact matches between the entries in the source databases compared to the ComPPI dataset (this manual revision process is described in Figure 1 of the paper and in Supplementary Table S3). In Supplementary Table S3 of the publication we showed how many proteins, subcellular localizations and interactions were included from the source databases to the ComPPI dataset. The amount of data 'missing' in ComPPI compared to the sources is due to mapping differences between different namespaces (e.g. the integrated CCSB dataset contained 3,881 interactions, while only 3,733 interactions were included to the ComPPI dataset due to the removal of non-protein coding genes), data inconsistencies, or protein name mapping errors.
Database updates: The manual steps, such as the inclusion of new GO terms to the localization tree, or mapping of missing IDs, can and will be included to all future ComPPI updates. In these updates we will document our manual curation process in more detail. We schedule the release of ComPPI 2.0 before May 2016. The new version will include the update of the source databases (such as the 2014 update of the CCSB dataset), the revision of our mapping system with the update of the UniProt dataset based on the latest release, the inclusion of new input sources, such as LocDB, as well as an upgraded network visualization and extended export options (such as SIF or PSI-MI).
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2015 Jan 27, Robert M Flight commented:
Commended because the issues they raise are valid. It makes sense that protein-protein interactions (PPI's) should be localization specific, and that many false PPI may be removed by considering localization information.
However, the solution implemented by ComPPI is not an ideal solution. This is largely due to the "manual" curation steps required in its construction. These manual steps are described below, followed by a comment on why this is such a problem for this type of resource.
Localization tree: the authors' state that 1600 of the ~3600 gene ontology (GO) cellular compartment (CC) terms were used to "manually" construct a localization tree whereby each GO term has a single path to one of the six defined compartments of Cytosol, Nucleus, Mitochondrion, Secretory-Pathway, Membrane, and Extracellular. There is no justification for the requirement of a tree with single paths to root compartments beyond the multi-parent nature of the directed-acyclic-graph of the normal GO. The description in the paper implies that the generation of the tree from the GO terms was done fully by hand. No description of how terms are assigned to compartments are provided, or if any attempts at automated assignment methods were made.
Manual mapping of proteins: The authors state that 30% of protein IDs were un-mappable to UniProt IDs using databases, and had to be mapped manually. Although protein-protein ID mapping across databases is known to be a difficult problem (I will point to one possible solution: http://bioinformatics.louisville.edu/abid/), this seems like a rather large percentage to map by hand.
Manual revision of database: 200 entries were checked by six experts, and based on their findings, revisions of the database made. No description of the type of checks made is provided, nor numbers on the rates for true, false / positives / negatives, and what the corrective efforts involved.
Each of these manual steps is a possible point of introducing bias that is hard to quantify and replicate, and in fact makes it highly unlikely that the database will be updated at regular intervals in the future or expanded to other model organisms beyond yeast, c elegans, drosophila and human. This severely limits the reliability and future utility of the database in my opinion.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2015 Jan 27, Robert M Flight commented:
Commended because the issues they raise are valid. It makes sense that protein-protein interactions (PPI's) should be localization specific, and that many false PPI may be removed by considering localization information.
However, the solution implemented by ComPPI is not an ideal solution. This is largely due to the "manual" curation steps required in its construction. These manual steps are described below, followed by a comment on why this is such a problem for this type of resource.
Localization tree: the authors' state that 1600 of the ~3600 gene ontology (GO) cellular compartment (CC) terms were used to "manually" construct a localization tree whereby each GO term has a single path to one of the six defined compartments of Cytosol, Nucleus, Mitochondrion, Secretory-Pathway, Membrane, and Extracellular. There is no justification for the requirement of a tree with single paths to root compartments beyond the multi-parent nature of the directed-acyclic-graph of the normal GO. The description in the paper implies that the generation of the tree from the GO terms was done fully by hand. No description of how terms are assigned to compartments are provided, or if any attempts at automated assignment methods were made.
Manual mapping of proteins: The authors state that 30% of protein IDs were un-mappable to UniProt IDs using databases, and had to be mapped manually. Although protein-protein ID mapping across databases is known to be a difficult problem (I will point to one possible solution: http://bioinformatics.louisville.edu/abid/), this seems like a rather large percentage to map by hand.
Manual revision of database: 200 entries were checked by six experts, and based on their findings, revisions of the database made. No description of the type of checks made is provided, nor numbers on the rates for true, false / positives / negatives, and what the corrective efforts involved.
Each of these manual steps is a possible point of introducing bias that is hard to quantify and replicate, and in fact makes it highly unlikely that the database will be updated at regular intervals in the future or expanded to other model organisms beyond yeast, c elegans, drosophila and human. This severely limits the reliability and future utility of the database in my opinion.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2015 Feb 09, Peter Csermely commented:
Authors of the paper are happy to highlight the available additional information regarding the localization tree, manual mapping and revision, as well as the updates of the ComPPI database http://comppi.linkgroup.hu.
Localization tree: ComPPI uses the GeneOntology (GO) cellular component terms for standardizing the different subcellular localization nomenclature in the source databases. The localization tree was needed for the unambiguous classification of minor localizations to the major subcellular compartments. Mappings of the original subcellular localization names to the relevant GO cellular component term for all input sources are available on our GitHub repository https://github.com/erenon/ComPPI/wiki/LOC-databases, while the whole localization tree and the mapping of minor localizations to major cellular components is accessible via our respective Help page http://comppi.linkgroup.hu/help/subcell_locs#structure, and was included in the Supplement of the paper http://nar.oxfordjournals.org/content/43/D1/D485/suppl/DC1 as Figure S2 and Table S4.
Manual mapping of proteins: As we described it in our published paper (Figure 1) we manually built a mapping table to map 30% of proteins in ComPPI when 1.) the original gene ID (e.g. HGNC gene symbol, EnsemblGeneId etc.) was translated to more than one UniProt accessions (~20% of proteins total, e.g. "HTT" gene symbol could be mapped to both P31645 and P42858 UniProt accessions, where our manual decision of P42858 was based on the name description, "Huntingtin" found in the source dataset of OrganelleDB); or 2.) the original gene ID had no automatic translation to UniProt ID (~10% of proteins total, e.g. manual mapping of the OrganelleDB source dataset "p53AIP1" gene name to Q9HCN2 UniProt accession having the official gene ID "TP53AIP1"). We note that the process was not entirely manual, as we used online ID cross-reference tools (such as PICR http://www.ebi.ac.uk/Tools/picr/) and the UniProt ID mapping http://www.uniprot.org/uploadlists/ as gold standard validation of gene ID to UniProt mappings.
Manual revision: Our manual revision process besides the construction of the localization tree and the ID mapping included 1.) the manual testing of the interfaces connecting the source databases to the ComPPI dataset during the database building process including the check for false-entries (e.g. non-existing UniProt accessions) and data content errors (e.g. minor subcellular localizations without GO terms) for randomly chosen 200 proteins, and 2.) six experts tested the integrated dataset for false-entries, data inconsistency and protein name mapping errors. Additionally, two of the six experts tested randomly chosen 200 proteins for exact matches between the entries in the source databases compared to the ComPPI dataset (this manual revision process is described in Figure 1 of the paper and in Supplementary Table S3). In Supplementary Table S3 of the publication we showed how many proteins, subcellular localizations and interactions were included from the source databases to the ComPPI dataset. The amount of data 'missing' in ComPPI compared to the sources is due to mapping differences between different namespaces (e.g. the integrated CCSB dataset contained 3,881 interactions, while only 3,733 interactions were included to the ComPPI dataset due to the removal of non-protein coding genes), data inconsistencies, or protein name mapping errors.
Database updates: The manual steps, such as the inclusion of new GO terms to the localization tree, or mapping of missing IDs, can and will be included to all future ComPPI updates. In these updates we will document our manual curation process in more detail. We schedule the release of ComPPI 2.0 before May 2016. The new version will include the update of the source databases (such as the 2014 update of the CCSB dataset), the revision of our mapping system with the update of the UniProt dataset based on the latest release, the inclusion of new input sources, such as LocDB, as well as an upgraded network visualization and extended export options (such as SIF or PSI-MI).
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-