28 Matching Annotations
  1. Mar 2019
    1. Code and code comment

      Although I agree these are going to be the two most commonly used elements, how necessary is it that a data scientist knows how to "code"? Could one be a data scientist using Excel, for example?

  2. Nov 2018
    1. but also unfair criticism that was put forth to support a particular view.

      This is just it. A critique was published on our work, and we were not notified, or given the chance to respond. And the authors lied about us making our data available.

    1. Our strong vision to provide a premier preprint service tailored to chemists has resulted in this already robust support. We chose Figshare as its service provider to deliver a modern interface and ability to both host and interactively display data natively within the browser. Our authors and readers have made good use of these features by uploading crystal structures, computational files, videos, and more that can be processed and manipulated without the need for specialized software. ChemRxiv accepts all data types from authors—removing the limitations imposed by PDF and Word— providing a richer, more valuable reading experience for users. Since launch, we have added a number of new features, including a “Follow” feature, which allows readers to create notifications and RSS feeds based on precise search criteria, and an interactive citation-formatting tool. Our automated scans for plagiarism, viruses and malware, and improvements to the curation tools allow triage before posting to be quick, in fewer than two business days, and often in less than one day! Several new features will be available with the next release, including an interactive search widget to the “Funding” field. All of this, plus positive user feedback and the establishment of our global three-society governance, means that we are moving ChemRxiv from the beta stage to a full-service resource

      This is pretty neat!

  3. Oct 2018
    1. conservative threshold

      Missing e-value. Know it is in the Results, but I think it should be included in the Methods.

    2. range of different peptide

      Should the range be stated here? It is in the results, but seems relevant to the Methods.

    3. later verified

      missing "be", should read "later be verified"

    4. difficult characterize

      missing "to", should be "difficult to characterize"

    5. controlling

      The use of "controlling" here seems redundant.

    6. Ab

      This did not render over from whatever source was used to generate the PDF.

    7. RAA1(Hydrophobicity)

      In the paper, the acronym RAA is used for reduced amino acid alphabet, in the software, [RED is used](https://github.com/biodataganache/SIEVE-Ub/blob/master/KmerFeatures.py#L173). I can't see a good reason for having two different systems of naming when they are both 3 letter acronyms ...., in addition, the numbering here is 1-5, whereas the software is 0-4. Given the tight link between this paper and the software, it seems one naming / numbering system should be used in both.

    8. FW places all the samples belonging to a particular cluster in a 165single test set, while the classifier is trained using the remaining data.

      This is talking about a "cluster of related proteins", where the related proteins were decided by the BLASTP similarity cutoff. I think changing it to read "particular cluster of related proteins" would help the understanding here a lot. I didn't get it until I looked at the code with the data in it, and running it.

    9. WP_012732629.10.500.82Corynebacterium kroppenstedtii360hypothetical protein

      As noted here, I can't recreate this result using the code and the model provided.

      There are two possible issues with my trying to recreate this result:

      1. I used the wrong genome / proteome. This is possible b/c the full list of genomes downloaded from PATRIC is not provided.
      2. I used the wrong model. The model discussed in the publication has 100 features, and the one in the data has 6493, and as I indicated elsewhere, trying to do the RFE procedure I get something that has no resemblance to Figure 2. So I used the model with 6493 features for my test.
    10. PATRIC database

      Should the complete list of proteins and their SIEVE and SIEVEUb scores be provided as supplemental data?

    11. We found that top-scoring 273peptides from our model matched the second zinc finger sequences for several RING/U-box E3 ligases 274including the LubX protein and the herpesvirus ICP0 protein

      The data to support that top-scoring peptides match second zinc-finger sequences isn't provided. Unless this would be part of the data for the retained features and their locations in the peptides (see this other comment). But having the locations of the zinc fingers in conjunction with the feature locations would really help support this result.

    12. locations

      I could not find the list of retained features anywhere along with their locations. It would be very helpful if this is in the GitHub repo (or another data repo) that the full path to the file is identified.

    13. 100

      I tried to recreate the RFE procedure and Figure 2, as noted here.

      Running the code listed there, I got this figure instead:

    14. different peptide 136lengths and peptides

      Should the actual range of kmers used be indicated here? Code and plots show 3-20

    15. 171 genomes that are 126listed as human pathogens and are representative reference genomes from PATRIC [45]. This set 127comprises 480,562 protein sequences excluding all of the proteins used in the training set above.

      Is there a list of which genomes were downloaded? PATRIC has different versions for many organisms, so having the list of genomes seems pretty important.

      I'm also curious just how big a file would be with all of the 480K protein sequences ...., as a way to provide the set of data that the manuscript actually used.

    16. We identified a set of 168 confirmed bacterial or viral E3 ubiquitin ligase effectors from the UniProt 122database [30, 31]. Negative examples were 235 other bacterial effectors identified from literature [8, 20, 12324, 27, 30-44]. We include details on the dataset as Supplemental Data

      What exactly were the search criteria used to identify these from UniProt and the literature? The results seem to imply Pfam models, but it is not explicitly stated here or anywhere else that I can see.

      Also, which file contains the list is not explicitly identified here. I think it is data/FamiliesConservative.txt, is that correct?

    17. To identify a minimal set of features that are important for classification of E3 ubiquitin ligases from 248other effectors we used recursive feature elimination, a standard machine learning approach [8]. Briefly, a 249model is trained on all features, then weights for each feature are used to discard 50% of the features with 250the lowest impact on model performance. The remaining features are then used in another model training 251round in which this process is repeated until all the features have been eliminated. The training 252performance results from the RFE on the RAA1-K14 model are shown in Figure 2. We chose to keep 100 253features in our final analysis given that this provided good training performance (AUC >0.9), but retained 254a small portion of the initial features (3%). These features are provided as Supplemental Data along with 255their locations in each of the positive and negative examples in our analysis set.

      If one loads the provided file in data/SIEVEUbModel.Rdata, it seems to have 6394 features in the model, not 100.

    18. https://github.com/biodataganache/SIEVE-Ub.

      This is great that the code is provided, but this isn't really a proper data repository, as it could be deleted at the whim of the owner.

      Assuming the code is put in a proper data repository, the reference to that should also be provided.

    19. Python script

      Which version of Python will these scripts work with, and which version of Python was used for this work?

      The scripts depend on Biopython, it should be cited, and the version used for this work indicated.

    20. e1071 R library in our implementation.172173The area under the curve (AUC) and receiver-operator characteristic curve (ROC) calculation was 174performed using the R library pROC.

      Which versions of R, e1071, and pROC were used, and full references to the software packages should be provided.

  4. Sep 2018
    1. mean of the log-counts is not generally the sameas the log-mean count [1

      Is this something different than the fact that taking the mean one way is the geometric mean? Will have to look at the reference.

    2. suboptimal variance stabilization and arbitrariness in the choice of pseudo-count.

      Would like to see a reference for this statement.

    3. non-zero differences in expression and artificial populationstructure in simulations.

      huh, didn't know that.

  5. Sep 2017
    1. A survey shows that metal ions are modeled in ∼40% of all macromolecular structures deposited in the Protein Data Bank (PDB), yet the identification and accurate modeling of metals still pose significant challenges (Zheng et al., 2014). The development of any tools for systematic analysis based on the protein structures in the PDB should take into account that these structural data are not error-free. Failure to consider this may result in inaccurate conclusions, as happened in a recent study of zinc coordination patterns (Yao et al., 2015) that were shown to violate/ignore chemical and crystallographic knowledge (Raczynska et al., 2016).

      Here, Yao et al are called idiots, essentially.

    2. Both structural and catalytic zinc sites are usually tetrahedral, although trigonal bipyramid cases exist, especially at catalytic sites in a less stable transition state (Yao et al., 2015).

      And here, in contrast, our paper cited as evidence that a particular phenomenon actually occurs.