930 Matching Annotations
  1. Sep 2021
    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Sherry Miller 1Division of Biology, Kansas State University, Manhattan, KS 665062Allen County Community College, Burlingame, KS 66413Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTeresa D. Shippy 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa D. ShippyBlessy Tamayo 3Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Blessy TamayoPrashant S Hosmani 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezLukas A Mueller 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas A MuellerWayne B Hunter 5USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 1Division of Biology, Kansas State University, Manhattan, KS 66506Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom D’elia 3Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom D’eliaSurya Saha 4Boyce Thompson Institute, Ithaca, NY 148536Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.23), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Hailin Liu Is there sufficient data validation and statistical analyses of data quality?<br> No. The validation work is not revealed in the manuscript, such as the qRT-PCR experiment.

      Is the validation suitable for this type of data?<br> No. More validation work should be added instead of the RNA-seq data from the public database.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Formatting errors should be corrected, including the tables and the alignment method of words. The introduction and methods seemed to be too simple for readers. More biological meanings should be explained in the manuscript. The basic assessment of the utilized genome should be added.

      Recommendation: Major Revision

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. The manuscript reads very well.

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments :<br> 1) Line 149. "Multiple alignments of the predicted D. citri proteins and their insect homologs were performed using MUSCLE We need the output of MUSCLE (FASTA).

      2) Line 151. Phylogenetic trees were constructed (figures 1 and 4) using full-length protein sequences in MEGA7or MEGAX. We need the files underlying the phylogenetic tree (newick). Please indicate which version of MEGA was used for each tree.

      3) Line 152. Gene expression levels were obtained from the Citrus greening Expression Network and visualized using Excel and the pheatmap package in R. Please can you provide a file of the raw data used to produce the heatmap (figure 2) and the Expression levels of UAP2 in male and female tissues (figure 5).

      Are the data and metadata consistent with relevant minimum information or reporting standards? Yes. Nomenclature standards have been met.

      All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound? Yes. The curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes a comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> 1) Line 147. Apollo version should be included in the other D citri manuscripts.

      2) Citation [26] MUSCLE. https://www.ebi.ac.uk/Tools/msa/muscle/.

      • the website suggests users of MUSCLE cite DOI:10.1093/nar/gkz268

      Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Chad Vosburg 1Indian River State College, Fort Pierce, FL 349812Department of Plant Pathology and Environmental Microbiology, The Pennsylvania State University, University Park, PA 16802Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Chad VosburgMax Reynolds 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteRita Noel 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTeresa Shippy 3KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KSFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa ShippyPrashant S Hosmani 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezLukas A Mueller 4Boyce Thompson Institute, Ithaca, NY 14853Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas A MuellerWayne B Hunter 5USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 3KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KSFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom D’Elia 1Indian River State College, Fort Pierce, FL 34981Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom D’EliaSurya Saha 4Boyce Thompson Institute, Ithaca, NY 148536Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, AZ 85721Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.21), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Xingtan Zhang and Dongna Ma

      The manuscript by Vosburg et. al., systematically analyzed of the characteristics of the Wnt signaling genes in Diaphorina citri, and focusing on evolutionary history, expression patterns and potential functional. Finally, they also performed manual annotation of the Wnt signaling pathway. Indeed, this work would add important resource for the study of the evolutionary history of D. citri and Wnt signaling in this important hemipteran vector. The writing is acceptable. Even though, I still have some suggestion which may improve this manuscript.

      1. In the methods, the authors have indicated the process of identifying win genes, but the abstract describes it as Curation identification? I am confused whether this Wnt signaling genes in D. citri were identified by the author or whether the author just further analyzed it using the results already identified by others?
      2. The paper just did the identification of the win gene, evolutionary, and then the expression analysis using RNA-seq. It is recommended to also look at the chromosomal localization and mode of origin (e.g., tandem repeats)
      3. The Wnt signaling genes related to the hemipteran vector studied by the authors can be further verified by qPCR and then compared with the expression and function of other published insect-related genes for discussion.

      Major Revision.

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. The manuscript reads very well.

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments:<br> 1) Line 176. "High scoring MCOT models were then searched on the NCBI protein database...." We need the list Wnt pathway genes with high scoring MCOT models.

      2) Line 178. "The high scoring MCOT models that had promising NCBI search results were used to search the D. citri assembled genome." We need the list of high scoring MCOT models which had promising NCBI search results..

      3) Line 179. "Genome regions of high sequence identity to the query sequence were investigated within JBrowse" We need the list of models with high sequence identity with the assembled genome.

      4) Line 184. "MUSCLE multiple sequence alignments of the D. citri gene model sequences and orthologous sequences were created through MEGA7" We need the output of MUSCLE (FASTA). We need the files underlying the phylogenetic tree (newick).

      5) I note that MEGA7 has been used. I wonder why the newer release (MEGAX, March '21) was not used. Furthermore, the annotation protocol (dx.doi.org/10.17504/protocols.io.bniimcce) suggests using Mega7 or MegaX.

      Instructions on how to upload these files is given under "Any Additional Overall Comments to the Author".

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. Curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

      Minor Revision.

    1. This article is a preprint and has not been certified by peer review [what does this mean?].

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.20), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Feng Cheng. Crissy and co-authors annotated yellow genes in genome of Diaphorina citri, the vector of the Huanglongbing disease in citrus plants. The result is useful for close related area, and here I have some comments for the authors to improve the manuscript.

      1. The sections of introduction and background can be merged into one introduction section.

      2. Many sentences in the results section can be moved to the methods section.

      3. The methods section should be rewritten and re-organized as each analysis per paragraph.

      4. Some domain analysis and figures may be helpful for illustrating the evolution of important yellow genes in different insect species.

      Reviewer 2. Mary Ann Tuli Is the language of sufficient quality?<br> Yes. Line 18 'in-planta' should be in 'in planta' (in italics).

      Are all data available and do they match the descriptions in the paper?<br> No. Additional Comments: 1) line 224. "The MCOT protein sequences were used to search the D. citri genomes" We need the list of MCOT protein sequences that were used.

      2) Line 228. "A neighbor-joining phylogenetic tree of D. citri yellow protein sequences along with was created in MEGA version 7 using the MUSCLE multiple sequence alignment" a) Along with what? There are some words missing. b) We need the output of MUSCLE (FASTA). c) We need the files underlying the phylogenetic tree (newick).

      3) I note that MEGA7 has been used. I wonder why the newer release (MEGAX, March '21) was not used. Furthermore, the annotation protocol (dx.doi.org/10.17504/protocols.io.bniimcce) suggests using Mega7 or MegaX.

      4) Line 233. "Comparative expression levels of yellow proteins throughout different life stages (egg, nymph, and adult) in Candidatus Liberibacter asiaticus (Clas) exposed vs. healthy D. citri insects was determined using RNA-seq data and the Citrus Greening Expression Network (http://cgen.citrusgreening.org)." Results are presented in Fig 3(a) and Fig 3(b) We need the raw data underlying these figures.

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> Yes. Nomenclature standards have been met. All cited INSDC accession numbers are publicly available.

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. Curation workflow used for community annotation is available via protocols.io , nonetheless the manuscript includes comprehensive summary which is appropriate.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No See "Are all data available and do they match the descriptions in the paper?" above. Once the additional files are made available I believe reproduction will be possible.

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author:<br> Citation [39] is not complete. It should be MCOT protein database.

      Some of my comments/recommendations are pertinent to the other D. citri manuscripts currently under review.

    1. Abstract

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Aaron Shafer Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. I would include all flags for assemblies even if default; unclear how the 10x + Illumina data were integrated (if at all) - see comments below.

      Is there sufficient data validation and statistical analyses of data quality?<br> Yes. I suppose BUSCO and gene number is a form of validation.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> No. See comment below; while the short-read data is great, the genomic resource I likely would reassemble for a variety of reasons outlined in Additional Comments.

      Any Additional Overall Comments to the Author: The paper is well written, and I have no comments about the the content - well done here. My main concern lies with the genome resources - and in this case I would likely use the raw data, rather than the assemblies provided. I offer my rationale and suggestions:

      My lab was heavily pushed by a colleague towards the use of Meraculous in our short-read assembly of mammal genomes ( https://jgi.doe.gov/data-and-tools/meraculous/ ) ; this is because it’s really designed for short-read assemblies of big genomes (i.e. no addition of mate-pair) AND it performs very well in the Assemblathon metrics https://academic.oup.com/gigascience/article/2/1/2047-217X-2-10/2656129 - notably Figure 16-18 you start to see clear differences between meraculous and say soapdenovo. Thus for just the Illumina data I would very much like to see a more appropriate assembly explored as stats like N50 and no. scaffolds will likely improve considerably with the appropriate methods.

      Likewise, it’s very unclear in the methods how M. r. arvicoloides was assembled: I see SUPERNOVA for the 10X data (great), and probably soapdenovo for the Illumina data (see above). But how were they combined? This sequencing strategy is really designed for a hybrid assembly (see for example DGB2OlC https://github.com/yechengxi/DBG2OLC) this is appropriate for 10X data and really does work! But there are others.

      Note M. agretus that has an identical sequencing strategy to M. r. arvicoloides almost has ~3% the total scaffolds – follow whatever they did! And I will say, while the authors state their genome is on par with other Microtus, this appears true by Table 3, only M. agretus currently has an assembly that I think is at current standards. The level of fragmentation and low BUSCO scores really support re-visiting the assembly suggestions, as I think the current .fasta will be of limited utility in a population or comparative genomics study.

      The gene number is pretty high for a mammal and I worry that’s due to fragmentation. It would be reasonably to only annotate scaffolds >10Kb or 50KB, but then there’s not much of a genome left. Ideally the bulk of your genome (>>90%) would fall on these scaffolds. There is really no sense annotation your small fragments (have you tested for contamination? Note NCBI will do this before allowing for it to be deposited so I suggest it).

      You also align your data to mt genome, this is different than assembling it. You could assemble it (e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1927-y) and that might be interesting to see if there any differences

      I wish I could be more positive; an assembly like Mercaculous would take a week or so, and so would the hybrid approach, but would be worth it based on my experience with these data.

      Recommendation: Major Revision

      Reviewer 2. Joana Damas.

      Any Additional Overall Comments to the Author:<br> The genomes presented in these work will be extremely valuable tool for Microtus related research. The manuscript is very clear and easy to follow. I have, however, a couple of comments that I hope will further improve it.

      (1) Line 123: I believe more details on the measures used for the selection of the best M. r. macropus are needed. Even though the contiguity of the Discovar genome assembly is higher than the ones generated with SOAPdenovo, the BUSCO score is relatively low (54.5% versus 84% in M. r. arvicoloides, e.g.). Were the BUSCO scores for the other assemblies even lower? Is the Discovar assembly size closer to the estimated genome size?

      (2) Line 131/251: Was there any genome structure verification step for the M. montanus genome assembly? For instance, which percentage of the Illumina reads could be mapped back to the finished genome assembly?

      (3) Line 131/251: Was there a reason not to use a published reference-guided assembly method (e.g. RaGOO and those listed therein) for the assembly of M. montanus genome? These could maybe further improve the assembly or help identify misassemblies. (4) Line 180: the high difference between BUSCO scores for each M. richardsoni subspecies makes me believe that the completeness of the genomes is quite different and the fraction of the genome within repeats might be underrepresented in M. r. macropus and that the subspecies values might be closer than noted here. It is, however, difficult to depict phylogenetic relatedness from Fig. 1 for the other species, for non-experts as myself. It would be helpful to have a phylogeny next to the graph showing species relationships. (5) Please verify Tables 1 and 2. The statistics presented for M. r. macropus do not match for N50 and longest scaffold size.

      Recommendation: Minor Revision

    1. This article is a preprint and has not been certified by peer review [what does this mean?]. Jaclyn Smith 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jaclyn SmithFor correspondence: jaclyn.smith@cs.ox.ac.ukYao Shi 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMichael Benedikt 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMilos Nikolic 2University of EdinburghFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: JianJiong Gao

      In this manuscript, the authors introduced a tool named TraNCE for distributed processing and multimodal data analysis. While the topic and tool are interesting, the writing can be improved. The current manuscript reads more like a technical manual than a scientific paper.

      For example, in the background, the discussion on data modeling in the contexts of multi-omics analysis and distributed systems is extensive, but the writing can be better organized. The examples are helpful, but they are very technical and can be hard to follow. It would be good if the main challenges can be summarized on a high level. It might also be useful to have an example analysis use case to lead the technical discussion on data modeling.

      It is also unclear how are the targeted users of the tool and why distributed computing is needed. For example, in application 1 & 2, it is unclear why distributed computing is necessary.

      Reviewer 2. Umberto Ferraro Petrillo First review:

      The authors propose a new framework, called TraNCE, for automating the design of distributed analysis pipelines over complex biomedical data types. They focus on the problem of unrolling references between different datasets (which can be very large), assuming that these datasets contain complex data types consisting of structured objects containing collections of other objects. By using TraNCE, it is possible to formulate queries over collections of nested data using a very high-level declarative language. Then, these queries are translated by TraNCE in Apache Spark applications able to implement those queries in an efficient and scalable way. Apart from a quick description of the TraNCE framework and of the declarative language it supports, the paper also includes a vast collection of examples of multi-omics analyses conducted using TraNCE on real-world data. I found the contribution proposed by this paper to be very actual. Indeed, there is a flourishing of public multi-omics databases. But, their huge volumes make their analysis difficult and very expensive, if not approached with the right methodologies. Distributed analysis frameworks like Spark can be of help, but they are often not easy to be mastered, especially for those not having deep distributed programming skills. So, TraNCE looks like a very much need contribution on this topic. However, I have some remarks. The high-level querying language supported by TraNCE is not original because, as far as I understand, it has been presented in a previous paper [1] (which has been written by almost the same authors and that has been correctly referenced to in this submission). Even the TraNCE framework is not completely original because its name appears as the name of the project containing the code presented in [1]. Finally, at least one of the experiments presented in [1] seems to have been run on the same Hadoop installation used for the experiments presented in the current submission, and has involved the same datasets from the International Cancer Genome Consortium. So, I am a bit confused about what it is original in this new submission and what has been borrowed from [1]. My advice is to definitely clarify this point.

      Another issue that I think should be addressed is about the proposed framework being scalable. The authors state that the framework supports scalable processing of complex datatypes, however, no evidence is brought about this claim. The several different experiments that are reported seem to focus more on the expressiveness of the proposed language while no experiment about the scalability of the generated code is provided when run on a computing architecture of increasing size. I think we may agree on the fact that using Spark does not means that your code is scalable, neither I think it is enough to say that the scalability of TraNCE has been proved in [1]. So, I would suggest to elaborate also on this. To be honest, I am a bit skeptical about the practical performance of the standard compilation route. I think that when applied to very large datasets it is likely to return huge RDDs that could require very long processing times. Instead, the shredded compilation route looks much clever to me. Could you elaborate further on this difference, especially according to the results of your experimentations? I also disagree with your idea of not describing how data skewness is dealt with in your framework. It is indeed one of the main cause for bad performance of many distributed applications so it would be interesting to know how did you manage this problem in your particular case. On the bright side, I really appreciated the flexibility of the proposed framework, as witnessed by the vast amount of examples provided, as well as its positive implications on the analysis of multi-omics databases.

      Finally, the English of the manuscript is very good and I have not been able to find any typos so far.

      [1] Jaclyn Smith, Michael Benedikt, Milos Nikolic, and Amir Shaikhha. 2020. Scalable querying of nested data. Proc. VLDB Endow. 14, 3 (November 2020), 445-457.

      Re-review: I appreciated the robust revision done by the authors and think the paper is now ready to be published

  2. Aug 2021
    1. ABSTRACT

      This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.16), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Inge SeimIs the language of sufficient quality? No. The authors need to polish their English further. This is particularly obvious in the Abstract and is likely to result in an unwarranted lower readership of the work.

      Are all data available and do they match the descriptions in the paper?<br> Yes. I want to commend the authors for sharing data and associated code.

      Is there sufficient data validation and statistical analyses of data quality?<br> Not my area of expertise.

      Any Additional Overall Comments to the Author<br> • R2 should be R^2 (that is, please superscript the '2'). • The sentence 'Further comparison between sequencing platforms would be useful for for exploration using as similar amplification conditions as possible. This data being provided as one such benchmark' at the end of Results is vague and needs to be rewritten. • You need to more clearly state that you do not recommend to combine MGI and Illumina data sets for metabarcoding -- unlike e.g. BGISEQ-500 and Illumina RNA-seq/short-insert WGS data which can be readily combined.

      Recommendation: Minor Revision

      Reviewer 2. Petr Baldrian Are all data available and do they match the descriptions in the paper?<br> No. I was not able to locate the items listed as references (26) and (27). Due to this, I was not able to fully evaluate the paper.

      Are the data and metadata consistent with relevant minimum information or reporting standards?<br> No. I was not able to locate the data, see above.

      Is the data acquisition clear, complete and methodologically sound?<br> No. More details on sampling (mode of sampling, area sampled, depth sampled, sample size, sample handling) is missing. Information on number of repetitive extractions of DNA and the size of sample for extraction is missing. Protocols of amplification and barcoding are referenced as (27), but I was not able to locate this reference. These details have to be provided in the text for both types of sequencers.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> Yes. For fungal ITS, the ITS region should be extracted before annotation.

      Is there sufficient data validation and statistical analyses of data quality?<br> No. The authors do not report how do they deal with sequences of fungi that produce amplicons longer than 350 bases that can not be pair-end joint in the 2x200 base runs. Even the MiSeq 2x250 runs miss some fungal taxa (though not very many) and here the situation is still worse. For the length distribution of fungal ITS, please consult the UNITE database.

      Is the validation suitable for this type of data?<br> No. There should be additional validations including the analysis of those OTUs that are abundant in one setup but missing in another one (if any).

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> No. The metadata, supposedly in reference (26) are impossible to locate.

      Any Additional Overall Comments to the Author<br> I believe that this is a very good attempt to test the novel platform with fungal metabarcoding. If all required information is provided, I believe that this can be both an interesting paper and a valuable dataset.

      Recommendation: Reject (Unsound or Unusuable)

      Reviewer 2. Re-review. I have now carefully read the revised version of this manuscript and I am happy with the changes that the authors implemented as a response to my comments and the comments of the other reviewer. The paper is now much more clear, especially in the methodological section and the limitations of the use of the novel sequencing platforms/formats is sufficiently discussed.

      Minor comments that should be made in the present paper:

      L58: change "bacteria" to "bacterial" L65-66: the last part of this long sentence is difficult to comprehend and should be rephreased. I suggest to divide the long sentence into two L68-69: change "produces" to "produced" L84: delete "in" L98: please explain the abbreviation "ONT", likely "Oxford Nanopore Technologies" L162: the detail of the amplification methods should be expanded at least stating the primer pairs (names and sequences) used and targeted molecular markers; from the text it appears as if ITS2 was the marker selected, yet lines 361 and 366 discuss length differences in ITS1 L246: replace "common fungi several species" with "common fungal species" L248-251: the misclassification of fungal taxa was not due to the bad performance of the sequencing platform, it was because of the low variability of the ITS2 marker. I suggest to change the text to state that genus level assignment was reached for these taxa since multiple species had the same ITS2 sequence L264-265: the main reason is that the PCR bias (preferential PCR amplification of certain templates) skews the representation of taxa if the DNA is mixed prior to amplification L331-346: this section is unclear; it should be specified which primers (primer names and sequences) with what barcodes were used for each conditions; if different primer pairs were used for different sequencing platforms, it is unclear what is the use of this comparison. This should be either clarified and explained all this section may be removed. L381: delete "so" L387-392: I suggest that this part is either removed or it is clearly described why the authors are sure that PCR replicates are not necessary (which is against all present recommendations). While the increasing fidelity of polymerases can be a fact, the main problems with parallel PCR is not errors (due to low fidelity) but random effects where primers align to templates with random frequencies. This statistical effect is impossible to handle by increasing polymerase fidelity while it is easily handled by PCR replication. L424-426: This statement is rather obvious, I suggest to delete it.

    1. Now published in Gigabyte doi: 10.46471/gigabyte.14 Tianlin Pei 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, China2State Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, 200032, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMengxiao Yan 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJie Liu 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMengying Cui 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYumin Fang 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteBinjie Ge 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJun Yang 1Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai Chenshan Plant Science Research Center, Chinese Academy of Sciences, Shanghai, 201602, China2State Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, 200032, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: yangjun@csnbgsh.cn zhaoqing@cemps.ac.cn

      Reviewer 1. C Robin Buell Is the language of sufficient quality?<br> No. The manuscript could be improved with a round of editing for grammar.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. The sequencing, assembly and annotation methods need more details.

      Any Additional Overall Comments to the Author:<br> This manuscript describes the sequencing, assembly, annotation, and analysis of the Tripterygium wilfordii genome. T. wilfordii is a medicinal plant that has long been used in traditional medicine due to its production of alkaloids and triterpenoids; the focus of this study was identify cytochrome P450s involved in biosynthesis of the triterpenoid celastrol.

      Based on the genome assembly metrics, the authors generated a robust representation of the genome sequence. Improvements in the analyses of the genome and in the manuscript would greatly strengthen confidence in the assembly. The authors should provide these metrics and additional information to the manuscript:

      More details on the error correction of the assembly. Based on the methods, both nanopore and Illumina WGS reads were used, however, this is not explicit nor are any metrics of the error correction provided.

      Specifically it is not discussed how the nanopore reads were assembled. A company is cited for the genome assembly. Information on what assembly software that was used must be provided.

      Every software program used, its version, and the parameters used should be provided in the methods. This is often missing.

      The quality of the genome should be confirmed using both alignment of the whole genome shotgun reads and the mRNAseq data. Specific metrics should be provided include: total and percentage of reads that mapped, read pairs that mapped in the correct orientation.

      No details on read quality assessment or trimming are provided

      The CEGMA results should be omitted, this program has been deprecated.

      Line 337: The DNA was sheared not interrupted into fragments Line 343: More details on the library preparation and sequencing for the nanopore reads.

      Do the authors know the genome size of the species based on flow cytometry? Do you know the number of chromosomes that this species has? This should be stated and discussed in context of the assembly size and number of pseudochromosomes

      The genome wide identification of the CYP450 candidates was difficult to follow. This section should be revised so that it is clear how the authors identified their candidate genes. Potentially adding a supplemental figure would be helpful. I found the coexpression pattern extremely difficult to follow. I would not call coexpression patterns coexpression profiles. Specifically I did not understand the sentence on line 202 “However, no….”. Essentially this is just sub-functionalization at the expression level, not that there are two independent pathways.

      The evolution section should be expanded. How divergent are T. wilfordii from P. trichocarpa and R. communis?

      Table 1: Index should be replaced with metric

      Figure S1: What k-Mer was used in the analysis? Figure S5: Unclear what is on the X or y axis. Expand the figure legend.

      The manuscript should be proofed for grammar as there are numerous sentences that need editing.

      Recommendation Major Revision

    2. Tripterygium wilfordii

      Reviewer 2. Xupo Ding Is the language of sufficient quality? The language of one third paragraph is sufficient quality

      Comments This manuscript provided the reference genome assembly of T. wilfordii by using a combined sequencing strategy(Nanopore, Bionano, Illumina, HiSeq, and Pacbio)and functions of two CYP450 genes were identified with enzyme assays in vivo and in vitro. This research also provided valuable information to aid the conservation of resources and help us reveal the evolution of Celastrales and key genes involving in celastrol biosynthesis. However, it should be well improved about the text.

      1. The comma in the title is suggested to remove.

      2. Nothing in biology makes sense in the light of evolution (T.Dobzhansky), the abstract were not presented vitial results in the manuscript, such as gene numbers, repeat percentage, comparative evolutional analysis. The contribution or sense of T.wilfordii genome were not limited in celastrol biosynthesis in Line38-39, it also provide valuable information to aid the conservation of resources and help us reveal the evolution of Celastrales and key gene involving in celastrol biosynthesis.

      3. Nanopore is not an appropriate key word, the equal platforms, Illumina, Bionano, Pacbio and Hi-C, were also presented in the manuscript.

      4. Tales of legendia mentioned (line 59-61) in scientific paper might be controversial.

      5. Line 61-63 were described colloquially. Please consider replace it with The extraction of T.wilfordii bark have been used as a pesticide from ancient times in China, which recoded in the Illustrated Catalogues of Plants published in 1848 firstly.

      6. Line 103-104 is not coherent with the above sentence.

      7. Line 112, the N comprising rate is 0% ?

      8. Line 117-118, Both results indicated that the presented genome is relative complete. This is uncommon and definitely worth negotiating over. This sentence might be contained in the section of discussion even it is credible.

      9. Line 145, the full name should be entered for the mentioning firstly.

      10. Line 150-155, Copia and Gypsy were missed.

      11. The gene families contained TwCYP712K1 and TwCYP712K2 was expanded or contracted in the CAFÉ analysis?

      12. WGCNA might present much more reliable evidence for candidate of TwCYP712K1 and TwCYP712K2, even the pearson's correlation coefficients is the simplified version of WGCNA.

      13. The full peak should be presented in figure 5A and 5B. The data of NMR and MS uploading as the additional file will be enhance credibility of enzyme function.

      14. Line 269-272, the evolution analysis in Figure 2B indicated that the original time of T.wilfordii is earlier than the original times of P.trichocarpa and T.communis, is this suggested that the functions of TwCYP712K1 and TwCYP712K2 has been fused in the evolution of Malpighiales and Celastrales in Figure 6? If the authors insisted these two P450 came from the common ancestor, syntenic analysis of TwCYP712K1 and TwCYP712K2 within T.wilfordii and A.trichopoda, O.sativa or V.vinifera might be credible.

      15. The latin name should be contained complete specie name in all figures, such as T.wil should be replaced with T.wilfordii.

      16. Line322, transcriptom is transcriptome.

      17. Line330, please add the longitude and latitude.

      18. Please revise the English of total pages except the line 327- 509 and 526-599. line 327-509 might come from the concluding report of sequence project.

      19. Line 606. LAST might be BLAST?

      20. I noticed that the genome of T.wilfordii genome have been published on Nature communication in Feb. 2020. So I suggest adding some comparison to their assembly or triptolide synthesis and cite this paper. Mentioning these contents will look fair and also will highlight the special celastrol synthesis of the one you present here.

      Major Revision

    1. Now published in Gigabyte doi: 10.46471/gigabyte.13

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Yoonjoo Choi Is the language of sufficient quality? Yes There are some minor typos. Perhaps this would not be a matter in other systems or viewer - all "fi" do not appear on my computer (Mac OS Preview), e.g. "affinity" -> "a inity", "artificial" -> "arti cial".

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. The purpose of this software is clearly stated and it will be very useful for researchers in relevant research fields.

      Yes. The author recommended running this package on Linux machines, though it is written in Python. It would be great for a non-linux user to run TEPITOPE and BasicMHC1 (for a quick epitope screen). I pip-installed it on both Ubuntu and Mac OS (just to see whether I can run TEPITOPE and BasicMHC1). The installation on Ubuntu was very easy and running fine. The Mac OS installation failed, but perhaps not the trouble of epitopepredict (brew installed Python 3.9.0).

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes. (Definitely not mandatory at all but) It would be great this package also provides a wrapper for the IEDB tools.

      Recommendation: Minor Revisions.

      Reviewer 2. Jayaraman Valadi. Is the language of sufficient quality? Yes. There are lot of spelling mistakes. Must be corrected before acceptance.

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is clearly explained In the manuscript

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code? Yes. The source code is available on Github and it works as expected.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? No. The software depends on a number of external soft wares. Installation of the same need to be explained clearly in the manuscript.

      Is the documentation provided clear and user friendly? Yes. Overall the documentation is good. Doc-Strings need minor improvements to make it more comprehensive.

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? Yes. This is well explained in manuscript.

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages? Yes. Adding a note on comparing the performance of different methods would be useful.

      Additional Comments: The software developed is a python wrapper for a number of epitope prediction methods which are available. Unified architecture allows users to have easy access to all methods and compare the results of each method. Some of these methods/models have to be manually installed before the user can access it through the python wrapper. A new model trained by the authors has also been added additionally. users can utilize this prediction model without having to install any additional dependencies. Salient Features The software also supports visual comparison of predictions Users can select a target protein for epitope scanning users can prediction putative mhc1 and mhc2 epitopes using various predictive models using the python wrapper. Selection of best predictions possible Visual comparison of predictions from different predictive models possible.

      Highlights the positions of putative epitopes on the target protein sequence

      Overall the manuscript and software are quite comprehensive and can be accepted after minor revisions.

      Recommendation: Minor Revisions

    1. doi: 10.1093/gigascience/giaa146

      Reviewer 2. Mile Šikić Reviewer Comments to Author: In their paper Murigneux et. al. made a comparison of three long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. They generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies. Sequenced data are assembled using a bunch of state of the art long-read assemblers and hybrid Masurca assembler. Although paper is easy to follow, and this kind of analysis is more than welcomed I have several major and minor concerns. Major concerns 1) The authors use 780 Mbps as the estimated size of the genome. Yet, this is not supported by data. In chapter "Genome size estimation", they present the genome size estimation using K-mer counting, but these sizes are 650 Mbps or less 2) Since the real size of the genome is unknown, It would be worthwhile if authors provide analyses such as those enabled by KAT (Mapleson et al., 2017), which compares the k-mer spectrum of the assembly to the k-mer spectrum of reads (preferably Illumina). For control of the misassembled contigs, authors also might align larger contigs obtained using different tools to compare similarity among them (e.g., using tools such as Gepard or similar). 3) The authors compare assemblies with "Illumina assembly", but it is not clear what that means and why they consider this as a valid comparison. 4) Although they started ONT data analysis with four tools, they perform further analysis on just two tools (Flye and Canu). In addition, for PacBio data, they use three tools (Redbean, Fly and Canu). It is not clear why the authors chose these tools. Canu and Fly have larger N50, larger total length, and the longest contigs. However, this does not take into account possible misassembles. Assemblers might have problems with uncollapsed haplotypes, which can result in assemblies larger than expected. In their recent manuscript, Guiglielmoni et al (https://doi.org/10.1101/2020.03.16.993428) showed that Canu is prone to uncollapsed haplotypes. Also, in this manuscript is presented that using PacBio data Canu produces much longer assemblies than other tools (1.2 Gbps). Therefore, the longer total size of a assembly cannot guaranty a better genome. Furthermore, on ONT data Raven has the second-best initial Busco score (before polishing), and its assembled genome consists of the least number of contigs. Therefore, I deem that the full analysis needs to performed using all tools for both Nanopore and Pacbio data. 5) It would be of interest to a broad community if authors add the computational costs in total cost per genome for each sequencing technolgy. They might compare their machines with AWS or other cloud specified configurations. Besides, it is not clear which types of machines they used. Information from supplementary materials such as GPU, large memory, HPC is not descriptive enough. Minor comments: 1) The authors use the published reference genome of Macadamia integrifolia v2 for comparison. It would be interesting if they can provide us with information about sequencing read technology used for this assembly. 2) The authors mentioned that the newer generation of PacBio sequencing technology (Sequel II) which provides higher accuracy and lower costs. It would also be worth to mention the newer generations of assembly tools such as Canu 2.0, Raven v1.1.5 or Flye Version 2.7.1 It is worth considering Racon for polishing with Illumina reads too. Yet, this is not a requirement, because authors already use state of the art tools.

    2. Now published in GigaScience

      Review 1. Cecile Monat. Reviewer Comments to Author: Introduction part:

      • It would be nice to put the genome size and to indicate the reference genome that is already sequenced and assembled for Macadamia, just to put a context for the people who are not familiar with Macadamia. Methods part:
      • ONT library preparation and sequencing part:
        • What was the reason to used both MinION and PromethION and not only PromethION?
        • For what reason didn't you use the same version of MinKNOW to assemble the MinION (MinKNOW (v1.15.4)) and PromethION (MinKNOW (v3.1.23)) data?
      • Assembly of genomes part:
        • Is there a reason for doing 4 iterations of Racon? And not 3 or 5?
        • Maybe you should precise that Racon is used as an error-correction module and Medaka to create the consensus sequence.
        • "Hybrid assembly was generated with MaSuRCA v3.3.3 (MaSuRCA, RRID:SCR_010691) [32] using the Illumina and the ONT or PacBio reads and using Flye v2.5 to perform the final assembly of corrected mega-reads" this sentence is not very clear to me. Does it mean that you have first used ONT/PacBio data + Illumina on MaSuRCA software to generate what they call "super-reads" and then from this data you used Flye to get the final assemblies?
        • as I understood stLFR is similar to 10x genomics, why not compare this technology data too?
      • Assembly comparison part:
        • "We compared the assemblies with the published reference genome of Macadamia integrifolia v2 (Genbank accession: GCA_900631585.1)." First, I think it is important to add the reference paper. Secondly, I cannot see where did you compare your assemblies with the one published? For me, you compared all your assemblies between each other, but I cannot find any other assembly.
        • when you said "Illumina assembly" do you refer to the Macadamia integrifolia assembly? If so, please clarify it in the rest of the paper, and add the data for this reference genome in your figures. Results part:
      • ONT genome assembly part:
        • Is there any interested to combine MinION and PromethION data? Are there any advantages to combining it?
        • "The genome completeness was slightly better after two iterations of NextPolish (95.5%) than after two iterations of Pilon (95.2%) (Sup Table 1)." Here I would precise that it is the case for the Flye assembly, but surprisingly (at least for me?) after two iterations of NextPolish on the Canu assembly, the results were a little less good as with one iteration. So, depending on the assembler you use, the number of iteration needed might be different.
        • "As an estimation of the base accuracy, we computed the number of mismatches and indels as compared to the Illumina assembly." Here I am not sure which assembly you refer to when you use the "Illumina assembly" term. Do you refer to the Macadamia integrifolia assembly or to the MaSuRCA hybrid assembly? If you refer to the last one, I would suggest using the word hybrid assembly instead of Illumina assembly, it might be confusing.
        • Why not using the Pilon and NextPolish step on the ONT+Illumina (MaSuRCA) assembly since they are tools dedicated to long and short reads polishing?
      • PacBio genome assembly part:
        • Why did you use FALCON as the assembler for PacBio but not for ONT? If I am correct, it is not uniquely build to work on PacBio data but is ok for all long-reads technologies.
        • "Two subsets of reads corresponding to 4 SMRT cells and equivalent to a 43× and 39× coverage were assembled using Flye." why choosing Flye for this analysis? I'm also wondering if this part is necessary since afterward, you do the ONT equivalent coverage which is more interesting for the comparison of the technologies.
        • Comment on the structure: for this paragraph, I would prefer to have first the result with the same assemblers as with the ONT data, and then an explanation of why you choose to perform also a test with FALCON and then the FALCON results.
      • stLFR genome assembly part:
        • Supernova might have been used on PacBio data as well, why not?
        • why not trying to complement PacBio data with stLFR as you did with ONT? Are there any incompatibilities? Discussion part:
      • "The amount of sequencing data produced by each platform corresponds to approximately 84× (PacBio Sequel), 32× (ONT) and 96× (BGI stLFR) coverage of the macadamia genome" I would have put this information into the Results part, but it's only my preference.
      • "For both ONT and PacBio data, the highest assembly contiguity was obtained with a long-read only assembler as compared to an hybrid assembler incorporating both the short and long reads." I would suggest using the term "long-read polished" instead of "long-read only" since the assembly with the best contiguity integrates the Illumina data for the polishing. Tables and figures:
      • Table 2:
        • For this figure, if I understood properly you have chosen the best assembly of each technology. If I am right, then please precise it in the title of the figure. -Figure 1:
        • If I understood properly and here when you write "Base accuracy of assemblies as compared to Illumina assembly" you refer to the Macadamia integrifolia assembly, then I would add the Macadamia integrifolia assembly in this figure, and maybe put a dotted line at the limit of it for each category (InDels and mismatches) so it is easier for the reader to compare with it.
      • Figure 2:
        • Here I would put all the assemblies you had in Figure 1
    1. Now published in Gigabyte doi: 10.46471/gigabyte.11 Bruno C. Genevcius Department of Genetics and Evolutionary Biology, University of São Paulo, São Paulo, SP, BrazilFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Bruno C. GenevciusFor correspondence: bgenevcius@gmail.comTatiana T. Torres Department of Genetics and Evolutionary Biology, University of São Paulo, São Paulo, SP, BrazilFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tatiana T. Torres

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. Peter Thorpe. Are all data available and do they match the descriptions in the paper?

      No. They are submitted but still private. These need to be released.

      Final Comments: SRA datasets need to be released.

      Recommendation: Minor Revision

      Reviewer 2. Guillem Ylla.

      Is the language of sufficient quality? While the text is mostly clear, I detected a few spelling mistakes (listed below) and there might be more that escaped my attention. I would recommend the authors to exhaustively check the MS. Line 53: “Stink bug” missing “bug”. Lines 39,58,69, and figures: Mixed usage of “Chinavia impicticornis” and “C. impicticornis”. After first appearance of the full name, authors should be consistent whether they keep using the full name or the abbreviation, but not mixing both.

      Are all data available and do they match the descriptions in the paper?<br> No. The authors report multiple accession numbers from NCBI including a BioProject ID. But they are not open and I was unable to check if the data match the paper descriptions. The TSA accession seems that has not yet been created and the MS displays a placeholder (GIVF00000000) in its place.

      Are the data and metadata consistent with relevant minimum information or reporting standards? No. Missing items from the checklist. 1) "Any perl/python scripts created for analysis process ". In Line 94 “using a custom Perl script [16]”, the authors provide citation but not the code. 2) "Full (not summary) BUSCO results output files (text) ".

      Is the data acquisition clear, complete and methodologically sound?<br> Yes. The end of the fifth nymphal instar dataset was obtained at “seven days after molting from fourth to fifth instar”. Could authors specify how many days is the 5th nymphal instar to have a better idea of how much longer is the 5th nymphal stage.

      Could the authors briefly describe the rationale o behind choosing 5th nymphal and instead of other nymphal stages? They explain why nymphal stages were used instead of adults, but not why the 5th nymphal instar.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?<br> No. I would appreciate if the authors could share the code/commands for removing redundant reads and performing the assembly as supplementary materials or in GitHub (recommended).

      In the abstract, the authors describe 38,478 transcripts of which 12,665 had GO terms assigned. Is not clear where this number comes from. In line 120 is mentioned that “ 39,478 had successful matches in the NCBI”. Is there a type one of these two numbers (38,478 vs 39,478)? However, the MS says “we only kept contigs that matched to Arthropod species”, and this number is reported to be 33,871. I urge the authors to better explain the steps they followed and clarify where all these numbers come from.

      Is there sufficient data validation and statistical analyses of data quality?<br> Yes. Using the whole insect body often includes contaminant RNAs from the gut microbiome, endosymbionts, viruses, and other microbiological specimens from the cuticles and environment. Since the authors do not filter out reads from possible contaminants before the assembly, I would appreciate it if they could perform a BUSCO analysis using the prokaryote database before and after the selection based on similarity to databases. This would allow estimating the number of contaminants in the original assembly and if they had successfully discarded after the selection.

      Lines 126-127 are not clear. There are 12,665 contigs that have 5,087 GO terms. I deduce that there are 12,665 contigs that have at least 1 GO term, and that they contain 5,087 distinct GO terms. Could authors make it more clear on the text?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?<br> Yes. I don’t think that a dataset consisting of 2-time points (early and late) of the same sarge (nymph 5) can be considered a “developmental transcriptome”. I would urge the authors to change the terminology and title.

      In the abstract, the authors claim that this is the “ first genome-scale study with”. Since the study is only transcriptomic, I find it misleading to define it as “genome-scale study”.

      1- I don’t think that a datasets consisting of 2 time points (early and late) of the same sarge (nymph 5) can be considered a “developmental transcriptome”. I would urge the authors to change the terminology and title.

      2- In the abstract, the authors claim that this is the “ first genome-scale study with”. Since the study is only transcriptomic, I find misleading to define it as “genome-scale study”.

      3- In table 1 and line 117 the authors claim that they generated the highest amount of RNA-seq reads for pentatomids to date. However, for the Halyomorpha halys there are multiple available RNA-seq datasets not mentioned, which taken together I suspect that they would accede the data generated for C. Impicticornis. I would suggest to reduce the tone of this statement of L117.

      4- Additionally, there are at least 3 available genomes for pentatomidaes species. I think that this information should at least be mentioned in the introduction.

      5- In line 61, could the authors define “almost nonexistent”, how many are there?

      Additionally, there are at least 3 available genomes for pentatomidaes species. I think that this information should at least be mentioned in the introduction.

      In line 61, could the authors define “almost nonexistent”, how many are there?

      Recommendation: Minor Revision

    1. Background

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Chao Bian

      1. Is the language of sufficient quality? No

      2. Are all data available and do they match the descriptions in the paper? Yes

      3. Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes

      4. Is the data acquisition clear, complete and methodologically sound? Yes

      5. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      6. Is there sufficient data validation and statistical analyses of data quality? Yes

      7. Is the validation suitable for this type of data? Yes

      8. Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    2. Abstract

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Mile Šikić

      1. Is the language of sufficient quality? Yes

      2. Are all data available and do they match the descriptions in the paper? Yes

      3. Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples (http://gigadb.org/site/guide) Yes

      4. Is the data acquisition clear, complete and methodologically sound? Yes

      5. Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      6. Is there sufficient data validation and statistical analyses of data quality? Yes

      7. Is the validation suitable for this type of data? Yes

      8. Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Additional Comments: In their update to the previous study on the comparison of long read technologies for sequencing and assembly of plant genomes, Sharma et al. presented a follow-up analysis using a newer generation of base callers for nanopore reads and PacBio HiFi reads. I argue that this study is an important update, but it is not suitable for publication in the current form.

      My major comments are the following:

      1. It is not clear which version of the base caller the authors used in assemblies related to Table 1 and Table 3.
      2. For phased assemblies, it is important to provide information about the size of alternative contigs
      3. In Table 1, it would be great to have results for methods that do not phase assembly (i.e. Flye).
      4. There is no explanation why authors use IPA instead of other HiFi assemblers, i.e. hifiasm, which from my experience, perform better than IPA.
      5. A sentence related to Table 3, “The quality of the assemblies was more contiguous with less data in each of these cases when HiFi reads were used instead of the earlier continuous long reads (Table 3).” is not clear. Following Table 3, assemblies achieved using long reads have similar or longer N50 and higher BUSCO score. Also, it is not clear which assembler was used for long reads.
    1. Abstract

      Reviewer 1. Wei Zhao Are all data available and do they match the descriptions in the paper? No

      The BioProject PRJNA667278 is currently not accessible.

      Is there sufficient data validation and statistical analyses of data quality? No

      The size of the final genome assembly is significantly larger than the estimated size, which is indicative of redundancy. I would suggest removing the potential haplotype redundancy further. I would also suggest a k-mer analysis to validate the genome size. For a chromosomal assembly, the ratio of properly paired reads is lower than expected.

      Additional comments annotated on the paper have been provided to the author.

      Major Revision

    2. Now published in Gigabyte doi: 10.46471/gigabyte.10

      Reviewer 2. Ramil Mauleon Are all data available and do they match the descriptions in the paper? No Additional Comments Bioproject PRJNA667278 in NCBI appears to be still embargoed, a reviewer link would be helpful.

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No Additional Comments Sample provenance / passport information is lacking for the Cannbio-2 material. Outright mention of the source of RNAseq +TSA info in the methods would be helpful. Same comment as above for Genbank bioproject.

      Is the data acquisition clear, complete and methodologically sound? No Additional Comments It's mostly clear from the DNA extraction, pacbio sequencing and primary assembly. The anchoring of the assembled contigs into pseudochromosomes using another published genome lack detail and only broadly mention the software used (RaGOO). This is a very critical step that will distinguish if the Cannbio-2 assembly is an improvement vs the mentioned genome assemblies (esp. cs10, PK); it's a circular argument if the genome assembly is ascertained against existing assemblies from other cannabis accessions and declared improved. As noted by the authors, there are differences (rather than inconsistencies) between the compared published genomes, and these may be inherent in each genome; any analyses on an assembly based on these would cause ascertainment bias. Is there sufficient detail in the methods and data-processing steps to allow reproduction? No Additional Comments The previous comment regarding anchoring of contigs to an existing genome applies to this as well. Regarding genome annotation, is there any basis for the choice of annotation method, i.e. annotator software (Augustus), the consensus builder (EVN), and PASA ? MAKER (MAKER-P) and BRAKER are available pipelines, both being reported as good for plants, and GeneMark is a prediction software suite that excels in plant genome annotation. Re, evidences for annotation, it appears that transcript de novo assemblies were used, but the RNAseq data was not incorporated in the prediction step. No orthologous protein databases appear to have been used as hints for gene prediction. These are just observations/suggestions to further improve annotation quickly. In general, the annotation steps would benefit from a bit more detail for reproducibility, but I would say the annotation if done at the contig level would be very solid.

      Is there sufficient data validation and statistical analyses of data quality? No Additional Comments On the assembly itself, since there was no mention of the method for anchoring contigs into chromosomes, there is no information on how scaffolds are spaced along the genome, is it padding by a fixed # Ns? Are all assembled contigs anchored or are there unanchored ones? Again on the point of anchoring and ordering of contigs, ideally evidence from the same sequenced material would be the best to use (an example - genetic linkage map with sequence-based markers). Plant genomes are notorious for rearrangements (inversions, insertions, translocations, tandem repeats etc) even within species, and this appears to be the weakest evidence in this paper (how the contigs were anchored into chromosomes). Re gene annotation, you can conduct the BUSCO on the predicted genes and report those as well. Again, results will reflect the outcome of the annotation method used. For BUSCO in general, I'd be cautious in comparing results across published genomes and it would be more informative during an optimization of the assembly methodology or testing different assembly methods (checking whether you are improving the assembly of the same underlying dataset). On this same topic, are the unmapped contigs from other assemblies used? The same question with the assembly done by the authors apply.

      Is the validation suitable for this type of data? No Additional Comments Mostly yes for the primary genome assembly. The pseudochromosome assembly analysis data validation is not convincing. If done at the contig level, the genome annotation would be solid.

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No Additional Comments Recapping, missing are the biomaterial information,information on pseudochromosome assembly, explicit mention of genbank IDs for transcript assembly and RNAseq data used in annotation (instead of being in the reference) would improve re-use and integration. On the chromosome nomenclature, I don't understand why the author doesn't mention the ongoing nomenclature being used by the community as reported in the NCBI cs10 refseq release.

      Any Additional Overall Comments to the Author I believe reporting on results based on the main evidences generated by the authors (in this current work and the previous one on transcriptome) would make this a stronger data release, i.e. contig/scaffold assemblies, the annotation of that based on your own RNAseq data . On a related note, have you tried using your short-reads data during assembly? Could your assembly have been improved if you used the Illumina data during assembly itself (hybrid assembly, scaffolding)? Cannabis genomes are known to be highly heterozygous, a report of this would be easy to conduct from your assembly vs your reads dataset especially the short-reads and would be an important finding.

      Recommendation Major Revision

    1. Bone mass loss

      Reviewer 1. Levi Waldron Wang et al. present a shotgun metagenomics cross-sectional study of fecal specimens from 361 elderly women with the primary objective of identifying correlations between bone mass density and microbial taxa. The methods are reasonable and I have no major concerns about this manuscript, only some moderate suggestions to improve reporting and discussion.

      For items answered “Yes” it would help to provide line numbers in the manuscript, as done for some but not all checklist items.

      3.0 Participants:

      It’s stated that “Fecal samples of 361 post-menopause women were randomly collected at the People’s Hospital of Shenzhen” – I suspect the correct word here is “arbitrarily” rather than “randomly”, unless a random number generator was used to select a random sample of all eligible patients. Some statement of how the women were recruited and how representative they are of all patients at the hospital is warranted. E.g. were they recruited from emergency room, a cancer ward, all outpatients, all admitted patients, etc? See also later comment about generalizability.

      4.9 Batch Effects:

      This is left “NA” – can the authors at least comment (in the manuscript) on the potential for batch effects affecting cases and controls differently – ie were they all prepared together or in separate libraries, and were they sequenced in the same runs or completely separated?

      8.0 Reproducible research:

      I appreciate that data have been posted at EBI and CNGB. Could the authors also comment on whether the metadata essential to the analysis are also provided, and that these can be linked to the sequence data? Although I’m glad to hear that “Others could reproduce the reported analysis from clean reads by the declared software and parameters” I do think that the code to reproduce the analysis should also be reported.

      8.1 Raw data access

      The checklist states “no raw reads for ethical” but the manuscript states “The sequencing reads from each sequencing library have been deposited at EBI with the accession number: PRJNA530339 and the China National Genebank (CNGB), accession number CNP0000398.” so there is a disconnect. Assuming human sequence reads are removed from the data, I’m not convinced of ethical reasons not to post microbial sequence reads, but it seems the authors have posted the microbial sequence reads.

      10.1 – 10.5 Taxonomy, differential abundance, other analysis, other data types, and other statistical analysis are all blank. Some should be “N/A” but others just seem to be overlooked.

      13.2 Generalizability: I think this is an important element to include in the discussion. How typical are your volunteers of all women that age?

      Minor:

      “Making these data potentially useful in studying the role the gut microbiota might play in bone mass loss and offering exploration into the bone mass loss process.” -> These data are potentially useful in studying the role the gut microbiota might play in bone mass loss and in exploring the bone mass loss process.

      The manuscript is well written, but there are a few other places that would benefit from some copy editing.

    2. Abstract

      Reviewer 2. Christopher Hunter Is the language of sufficient quality?

      Yes.

      Is the data all available and does it match the descriptions in the paper?

      No.

      Most of the data are provided as supplemental files in biorXiv, but in Excel rather than CSV. These data files will need to be curated into a GigaDB dataset.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Yes.

      Is the data acquisition clear, complete and methodologically sound?

      No.

      Comment. The consent by the patients to openly share all metadata is not clearly stated, simply saying the study was approved by the bioethics review board does not mean consent was given to share the data, just that the institute consent to the study being done.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No.

      Comments: Maybe to someone with a good understanding of statistics there is sufficient detail, this is an area that a statistician should look at. For me, the descriptions of the analysis and the methods do not given anywhere near enough detail for me to either understand what was done or replicate it. The concept of "Gut metabolic modules" is not defined here, with just a reference to another paper, a brief explanation of what is meant by the term here would be useful.

      Is there sufficient data validation and statistical analyses of data quality?

      Yes.

      Comments. The sequences were filtered for human contaminants and adapter seq, also low quality reads were removed.

      Is the validation suitable for this type of data?

      No.

      Comments: The metadata is extensive but there are some basic points that are missing; collection date, antibiotic use, relatedness of samples/patients. Other less important details are also missing, like why and how this cohort was selected.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes

      Any Additional Overall Comments to the Author

      Yes

      • I am concerned about the open sharing of patient metadata without the evidence that it was consented prior to sharing. - A lot of metadata is collected and provided in the supplemental tables (which is great for reuse) but there are no explanations of what the values are, while some headers are self explanatory others less-so e.g. what is CROSSL(pg/ml)? or "Side crops", - how were the various conditions diagnosed? - I see no indication of antibiotic usage in the cohort - Are all the samples from different individuals? was each sample a single bowl movement? - There is no background given as to how this cohort was selected or why. - The is no discussion of the bone mass density of a "normal" cohort, does this cohort represent a normal cohort or is it already biased toward low or high density? Simply describing the cohort with respect to Normal (T of -1 or above), low (-1to-2.5) or osteoporosis (< -2.5) would be a help. I cannot see the T-scores included in the sTab1a file, are they computed from the L1-L4(z) values given? - There are a number of NA values in the table of samples metadata, but there is no explanation as to how these samples where handled in the analysis. - In general I feel that there is a lot of poorly described statistical analyses included that are not required as part of a data note, the focus should be on describing the data and ensuring the data and metadata are well explained.
    1. Now published in Gigabyte doi: 10.46471/gigabyte.9

      Reviewer 2. Levi Waldron Chen et al. use 16S amplicon metagenomic sequencing to investigate urinary bacterial communities and their correlation to lifestyle and clinical factors, and reproductive tract (cervix, uterine cavity, vagina) microbiota in a cross-sectional study of 147 Chinese women of reproductive age. This is an important but challenging study, because of the threat of microbial contamination in low microbial biomass specimens such as the upper reproductive tract and urine.

      Checklist item 4.0

      The laboratory /center where laboratory work was done is not actually stated in lines 121-133.

      Negative controls and contamination

      Negative controls were generated for the 10 women undergoing surgery through as sterile saline collected through the urine catheter. I assume this was done after the catheter was used for urine collection, but this should be stated.

      No negative controls were used for the self-collected urine specimens. However it seems likely that mid-stream self-collection would be more prone to contamination than catheter sampling by a doctor during surgery. Some possibilities for negative controls in this setting exist, such as including a sample of sterile saline with the self-collection kit and asking participants to fill another vial with it immediately following urine collection. The lack of negative controls for self- collected specimens should be stated as a limitation.

      The authors identify the risk of contamination from vulvovaginal region (lines 192-193) but not of cross-contamination. Discussion of the risk of cross-contamination during collection and subsequent processing, steps to mitigate and identify it, and comparison of results to bacterial taxa identified as common contaminants (e.g. Eisenhofer et al, PMID 30497919), is warranted.

      Comparability of urine sampling methods

      Since no specimens were collected by both self-collection and catheter sampling during surgery, there is no way to directly assess the accuracy of self-collection using catheter as a “gold standard” This should be stated as a limitation.

      I could not find an analysis comparing the microbial composition of the catheter-collected and self-collected specimens. Some analysis comparing the two could help address the quality of self-collected specimens lacking negative controls.

      Discussion

      The authors do not include overall interpretation or limitations in the Discussion, saying under checklist items 12.0, 12.1, 13.0 “The discussion was suggested to focus on the potential uses according to the article format.” I think the editors should clarify to authors where these key discussion points belong. I think no article is complete without some discussion of limitations; see above for limitations noted of this study.

      Checklist item 13.2 Generalizability

      Authors state “The generalizability of the study is to women of reproductive age, and is shown in line 236-237” but on these lines I see description of statistical methods. This does deserve some discussion though, because the sample includes only women who underwent hysteroscopy and/or laparoscopy for conditions without infections, and has a number of exclusion criteria. This cannot be a representative cross-section of all women of reproductive age, so some discussion of how this sample may be different or similar to the population of all women of reproductive age is warranted. If the authors claim this sample should be generalizable to all women of reproductive age, that should be stated along with the intentional restrictions of the sampling and rationale of why these criteria are not expected to have any impact on the microbiota sampled.

      Clustering of patients

      Lines 212-213: cutting a hierarchical clustering into discrete groups can be done for any dataset, and without some analysis such as Prediction Strength (Tibshirani and Walther, J. Comput. Graph. Stat. 14, 511–528 (2005)) or another measure of cluster validation, this isn’t evidence of distinct patient groups and that should be stated clearly. It is OK to use the grouping to discuss general trends as long as care is made not to imply these are distinct patient subsets without further analysis. I am cautious about this because distinct subsets are intuitively appealing to many readers and the existence of distinct subsets can be harder to correct than to claim.

      Minor

      Line 241 “As the large-scale cohort” -> As a large-scale cohort

    2. Abstract

      Reviewer 1. Christopher Hunter Is the language of sufficient quality?

      Yes.

      Is the data all available and does it match the descriptions in the paper?

      No.

      Comment: line 96-97 "In this study, a total of 147 reproductive age women (age 22-48) were recruited by Peking University Shenzhen Hospital (Supplementary Table 1)." B utSup. table 1 has only 137 samples. Revise text to explain only 137 samples were used for the main analysis, with the 10 extra for validation. Line 103 -104 "None of the subjects received any hormone treatments, antibiotics or vaginal medications within a month of sampling." Sup Table 1 has a column for "Antibiotic use True/False", 41 samples have "T"? this needs explaining. Its possible the spreadsheet True is referring to a longer time period, but thats not explained anywhere. line 110-112 "The samples from an additional 10 women were collected for validation purposes by a doctor during the surgery in July 2017." Where are these metadata? they are not included in Sup table 1. The data presented and discussed in "additional-findings.docx" are not included in the data files (yet), these should either be removed (as not included in the main article), or expand upon the methods (to include negative control details) and add this to main text.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Yes.

      Comment. The supplemental tables need some better legends/descriptions to help readers understand what data is in them.

      Is the data acquisition clear, complete and methodologically sound?

      Yes.

      Comment. The wet and bioinformatics methods could benefit from being included in protocols.io

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes

      Is there sufficient data validation and statistical analyses of data quality?

      Yes

      Is the validation suitable for this type of data?

      Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes

      Any Additional Overall Comments to the Author

      Yes

      The Figure appear to be mixed up, what’s displayed as Figure 1 in the manuscript appears to relate to the legend given for Figure 2, Figure 2 relates to legend of Figure 3, and Figure 3 relates to the legend of Fig 1!!! line 69 -Chen et al. no citation number link provided line 74 -Thomas-White et al. (2018) no citation number link provided line 79 -Gottschick et al. (2017) no citation number link provided line 246-248 "The initial results here indicate a close link between the urinary microbiota with the general and diseased physiological conditions,... " As this study is looking at "Healthy" individuals I do not believe there is sufficient evidence to back up this statement about the "diseased" physiological conditions. line 274-275 "The sequences of bacterial isolates have been deposited in the European Nucleotide Archive with the accession number PRJEB36743" this accession is not public so I am unable to see whats included here. If available we would like to see the Real-Time PCR Data from the experiments made available in Real-Time PCR Data Markup Language (RDML). The additional cohort of 10 women is almost a different study, it didn't have the same 16s RNA amplicon sequencing done, and was only a validation that some live bacteria can be cultured from urine in a small number of cases (3/10). If it is to be included table S5 should be updated to include the specific INSDC accessions for the submitted sequences. (title of Table S5 in file is currently saying Table 1).

    1. Now published in Gigabyte doi: 10.46471/gigabyte.8

      Reviewer #1 : Review MS by Wei Zhao Data Release Checklist Reviewer name and names of any other individual's who aided in reviewer Wei Zhao

      Is the language of sufficient quality? Yes Please add additional comments on language quality to clarify if needed<br> Are all data available and do they match the descriptions in the paper? Yes Additional Comments<br> Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes Additional Comments<br> Is the data acquisition clear, complete and methodologically sound? Yes Additional Comments<br> Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Additional Comments See attached PDF file Is there sufficient data validation and statistical analyses of data quality? No Additional Comments Check and filter potential contamination of the raw assembly. Is the validation suitable for this type of data? Yes Additional Comments But maybe no, see attached pdf Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes Additional comments annotated on the paper and shared with the authors. Recommendation Major Revision

      Reviewer #2 : Review MS by Daniel Lang Data Release Checklist Reviewer name and names of any other individual's who aided in reviewer Daniel Lang

      Is the language of sufficient quality? Yes Please add additional comments on language quality to clarify if needed<br> Are all data available and do they match the descriptions in the paper? Yes Additional Comments<br> Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes Additional Comments<br> Is the data acquisition clear, complete and methodologically sound? Yes Additional Comments<br> Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes Additional Comments<br> Is there sufficient data validation and statistical analyses of data quality? Yes Additional Comments There is a exceptionally high number of scaffolds for 10x, a bad BUSCO and a discrepancy between kmer <-> fcm&assembly size that is unusual. That would have been worthy of discussion. Is the validation suitable for this type of data? Yes Additional Comments<br> Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes Additional Comments<br> Any Additional Overall Comments to the Author<br> Recommendation Accept

    1. Now published in Gigabyte doi: 10.46471/gigabyte.7

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Qiye Li Since I am unable to access the data submitted to NCBI or GigaDB, I cannot judge this issue currently. Please make sure that the gene annotation, repeat annotation, transcriptome assembly, gene expression matrix, and genetic variant data have been uploaded somewhere in addition to the raw reads and genome assembly.

      While the bioinformatic tools used in all the steps are indicated clearly, the parameters for many tools are not defined.

      What is the gap ratio (i.e. % of unclosed gaps or Ns) of the genome assembly? As I know, the raw Supernova assembly may have a high proportion of gaps, although the scaffold N50 is pretty good. Additional gap closer steps (e.g. using GapCloser, RRID:SCR_015026) would improve the completeness of the assembly.

      BUSCO analysis is competent to access the completeness of the protein-coding gene space of the genome assembly. But a good BUSCO score does not necessarily mean good assembly completeness. Another conventional way to demonstrate the completeness of the assembly is to show the metrics of DNA read mapping, such as the overall mapping rate, % in proper pair, % of covered bases, etc.

      How is the completeness of the gene set generated by the Fgenesh++ pipeline? I suggest that the authors provide BUSCO score for the Fgenesh++ gene set as they did for the transcriptome assembly.

      Methods related to Alzheimer’s Genes Analysis: The methods used to identify the Alzheimer’s disease (AD) related human genes in antechinus seem to be flawed, as the authors only performed unidirectional searches for homologs in the antechinus gene set. I think the authors should identify bona fide orthologs of these AD-related genes in antechinus. The conventional way to determine orthologs between two species is based on a reciprocal best hit (RBH) strategy (i.e. RBHs between the human and antechinus gene sets).

      Reviewer 2: Walter Wolfsberger PRJNA664282 accession number is not found on NCBI. Is it scheduled to be released with the publication?

      Appropriate tools were used for appropriate analyses. The Y chromosome identification approach seems sound.

      The bioinformatic approaches the authors tools are sound, with the right tools and approaches to the analysis.

      The prep-print is well worded and easy to understand and follow. It provides good amount of context, that justifies the extra analyses done in the publication. The assembly quality is adequate, with relatively low N50, but good completeness scores, given that mammalian genomes have higher levels of low complexity\repetitive content. The metrics presented adhere to the scope of GigaByte, and the data itself is valuable to the scientific community.

    1. Genome sequencing

      Reviewer 2: Mahul Chakraborty

      Reviewer Comments to Author: In "Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola)." Schneider et al. described de novo genome assemblies of two tiny field collected Collembolan specimens. The authors collected high quality genomic DNA from the specimens following a Pacfiic Biosciences recommended protocol for ultra low input library, amplified them, and generated adequate sequence coverage to generate contiguous assemblies. This is a significant step forward in generating de novo genome assemblies from small amounts of tissues and cells and therefore will be a useful guide for not only people who are studying whole organisms but also people who are studying variation between cell or tissue types within an individual. I have some minor comments: "They were preserved in 96% ethanol, kept at ambient-temperature for one day until they would be stored at -20°C for 1.5 months, until DNA extraction." - Was the preservation at -20 a deliberate step to see the effect of this treatment on sequencing or just a conscious choice for specimen preservation? The specific conditions used (e.g. the time and speed of centrifuge) for the g-Tube shearing needs to be added in the Methods. "Circularity was validated manually, and nucleotide bases were called with a 75% threshold Consensus.?" - please clarify what the 75% threshold consensus is. "We then performed another estimation of the genome size by dividing the number of mapped nucleotides by mode of the coverage distribution" - Why was this done? Did the authors suspect the Genomescope estimate to be incorrect? "We compared our new genomes sequenced to previous Collembola assemblies that were generated with long read and sometimes additional short read data." - This statement needs citations for the previous Collembola assemblies. The authors used blastn and megablast to search the beta-lactams synthesis genes in the new assembly. Tblastx might be more appropriate. "For D. tigrina a total of 20,22 Gb HiFi data (Q>=20) was generated," - Do you mean 20.22 ? "For S. aquaticus a total of Gb HiFi data (Q>=20) was generated" - missing the number before Gb The authors report only one assembly from hifiasm, which I presume is the primary assembly. Given that the authors assembled diploid individuals, I am curious whether hifiasm assembled the alternate haplotype sequences. "The insect genomes have higher BUSCO scores (96.5 and 99.6%), but lower contiguity (Table 2, Fig. 3)."

      • This statement is incorrect. A number of insect genomes are more contiguous than the assemblies presented here, including Drosophila melanogaster (PMID: 31653862) and several other Drosophila species, Anopheles stephensi (DOI:10.1101/2020.05.24.113019), Anopheles albimanus (PMID: 32883756)
    2. ABSTRACT

      Reviewer 1. Arong Luo

      Reviewer Comments to Author: First, I'd like to commend the authors on attempting to sequence whole genomes of tiny metazoans, which account for a large part of biodiversity in nature and yet are difficult to be sequenced. Second, I am impressed by their ethanol-preserved specimens, which thus make genome sequencing more applicable and attractive in practice. We must admit that sometimes we cannot use fresh specimens directly for genome sequencing. Thus, I think this manuscript is really of scientific significance for specific fields such as insects. I found that the focal part of their sequencing protocol is the "whole genome amplification-based Ultra-Low DNA Input Workflow for SMRT Sequencing (PacBio)" throughout the text, which of course is very complex. So, I suggest the authors provide a flowchart showing critical or main steps during their workflow, and the readers can then understand easily and refer to their workflow in future projects. Finer points: Line 35: I suggest providing specific/important information for the 'novel' protocol herein. Line119-120: Are the specimens later for DNA extraction also morphologically identified? Line130-131: The DNA extract was selected randomly or based on certain measurements? Line 393: delete the dot '.'

    1. Gigabyte doi: 10.46471/gigabyte.6

      Reviewer 2. Yunyun Lv

      Do you understand and agree to our policy of having open and named reviews, and having the your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes

      Is the language of sufficient quality? Yes

      Are all data available and do they match the descriptions in the paper? No

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No

      Is the data acquisition clear, complete and methodologically sound? Yes

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No

      Any Additional Overall Comments to the Author This study presents a chromosome-level genome assembly of common dragonet. Hi-C method was applied to generate the high-quality genomic assembly. The result is valuable for further genomic analysis. However, some basic question should be solved or answered in the article to give a clearer insight.

      Line 35 findings section: The annotated total gene number and their quality should be evaluated and presented in the findings section. Line73-Line75:This sentence contains much speculation. I feel it should be removed or just mention the sympatry of their living location. Line 220: The section mainly described the method of gene annotation, however, the corresponding result is absent. These results are important to perform the various comparative genomic analysis. Thus, a detailed description of gene annotation result should be required in the revision. Line 238: Availability of supporting data; I searched the project accession number in NCBI database, but found no result. Thus, the supporting data is not unavailable in current.

      Line 33,type error: “syngnatiforms” should be syngnatiformes

      Recommendation Major Revision

    2. Now published

      Reviewer 1. Chao Bian

      Do you understand and agree to our policy of having open and named reviews, and having the your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.)

      Yes

      Is the language of sufficient quality? Yes

      Are all data available and do they match the descriptions in the paper? Yes

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide Yes

      Is the data acquisition clear, complete and methodologically sound? Yes

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes

      Is there sufficient data validation and statistical analyses of data quality? Yes

      Is the validation suitable for this type of data? Yes

      Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

      Any Additional Overall Comments to the Author?

      This paper, entitled ‘Chromosome-level genome assembly of a benthic associated Syngnathiformes species: the common dragonet, Callionymus lyra’, has provided a reference genome of the common dragonet with a high contig and scaffold N50 values. The genome size estimation, gene and repeat annotation were also performed in this study. The analysis approaches, such as genome assembling, annotation, are solid and well performed.

      However, for the gene annotation, there was no homology-based annotation for gene annotation. On the other hand, why the authors have not used the HISAT or Tophat to map the RNA reads onto genome to predict the gene structure. I really rarely see the transcriptome annotation by using the trinity assembly.

      In addition, I still consider that the first published genome should have at least one analysis point for illuminating the molecular mechanism of the special character of this species. Only an assembly and some genes will largely reduce the impacts and interests for this fascinating fish species.

      Some minor mistakes should be changed: The decimal place through whole paper should be uniformed. Line 41, 538 Mbp should be 538.0 Mbp. Line 45, 27.66% should be 27.7%. Line 76, change “suggest” to “suggests”. Line 83 and line 94, for “see [9]” and “by [10]”, the author’s name should be indicated in text, like “see XX’s study [9]”. Line 104, tissue should be tissues. Line 120 and line 131, change ‘562’ to ‘562.0’, and change ‘645’ to ‘645.0’. Line 156, explains should be explain.

      Recommendation

      Major Revision

    1. Now published in Gigabyte doi: 10.46471/gigabyte.2 Qiye Li 1BGI-Shenzhen, Shenzhen 518083, China2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Qiye LiQunfei Guo 1BGI-Shenzhen, Shenzhen 518083, China3College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYang Zhou 1BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yang ZhouHuishuang Tan 1BGI-Shenzhen, Shenzhen 518083, China4Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTerry Bertozzi 5South Australian Museum, North Terrace, Adelaide 5000, Australia6School of Biological Sciences, University of Adelaide, North Terrace, Adelaide 5005, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Terry BertozziYuanzhen Zhu 1BGI-Shenzhen, Shenzhen 518083, China7School of Basic Medicine, Qingdao University, Qingdao 266071, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJi Li 2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China8China National Genebank, BGI-Shenzhen, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteStephen Donnellan 5South Australian Museum, North Terrace, Adelaide 5000, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Stephen DonnellanGuojie Zhang 2State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China8China National Genebank, BGI-Shenzhen, Shenzhen 518120, China9Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, 650223, Kunming, China10Section for Ecology and Evolution, Department of Biology, University of Copenhagen, DK-2100 Copenhagen, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Guojie ZhangFor correspondence: guojie.zhang@bio.ku.dk

      This work has been peer reviewed in GigaByte, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Review 1. Walter Wolfsberger Is the language of sufficient quality? Yes.

      Is the data all available and does it match the descriptions in the paper? Yes.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Comment: The accession number for GigaDB provided in the paper does not yield any results in the GigaDB search. Using the species name works though.

      Is the data acquisition clear, complete and methodologically sound?

      Comment: Although it is clear in the paper that a significant portion of data was discarded during the early QC step, there is no indication of the reason for it, or the nature of the problem that was encountered. For total in the paper, the research group produced 396 Gb of raw sequence(211 Short insert and 185 long insert libraries) out of which only 180(130 Gb Short insert and never mentioned 55Gb Long insert) were used later on for the assembly. Upon a single library FastQC analysis I have encountered extreme levels of sequence duplication that might indicate the libraries were not diverse or there was a PCR-artifact(like overamplification), that might have lead to this low-quality initial data. The parameters for tool SoapNuke, used in early QC are not defined.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Is there sufficient data validation and statistical analyses of data quality? Yes.

      Is the validation suitable for this type of data?

      Comments: The assembly followed a logical order, with appropriate tools used at every step.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Comment: Although the resulting assembly was of moderate quality(highly fragmented, but good BUSCO score), a randomly picked library showed a really high duplication rates for sequencing, which indicates that there might be problems for future data reuse. Addressing these issues or at least acknowledging them would benefit the whole report and the dateset.

      Additional Comments:

      I don't think physical coverage is used widely in genome assembly as of now, as given the mate-pair reads nature - it inflates this statistics. I would put the resulting assembly statistics in a table, including all of the metrics(N50, N of Contigs, N of Scaffolds, Average Contig length and etc.) adding BUSCO score to the table, as the current formatting is not readable.  

      Review 2. Nandita Mullapudi Is the language of sufficient quality? Yes.

      Is the data all available and does it match the descriptions in the paper? Yes.

      Is the data and metadata consistent with relevant minimum information or reporting standards?

      Comment: I am unaware of defined reporting standards for assembly reports, however, all sample preparation, data generation and analysis methods have been described in adequate amount of detail.

      Is the data acquisition clear, complete and methodologically sound? Yes.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Comment: Following additional details would help to enable reproduction: (1) Parameters used for data pre-processing using SOAPnuke, as well as related adapter sequences etc. These would be necessary to reproduce the data clean up step. (2) Memory, processor and time details of computational resource used for assembly (3) Was Platanus assembly attempted using different parameters, how were the parameters reported in the paper arrived at? (4) For gene prediction, several vertebrate sequences were used, the details/source of these reference sequences are missing.

      Is there sufficient data validation and statistical analyses of data quality?

      Comments: 1) One approach to validating an assembly would be to use more than one assembly tool and compare the results. (This may or may not be within the scope of this study.) 2) With respect to the validation performed by mapping back paired end reads to the assembly, there is no discussion of the ~14% of paired end reads that did not map back in the expected orientation. Would tools like REAPR (https://www.sanger.ac.uk/science/tools/reapr) or SEQuel (https://bix.ucsd.edu/SEQuel/man.html) be appropriate to address this? (given the high level of heterozygosity in L. d. dumerilii as reported here).

      Is the validation suitable for this type of data? Yes.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Comments: It may also be helpful to make available the set of cleaned reads, to enable reproduction of the assembly pipeline.

    1. Now published in GigaScience doi: 10.1093/gigascience/giab045 Florian Heyl 1Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Georges-Köhler-Allee 106, 79110 GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Florian HeylFor correspondence: heylf@informatik.uni-freiburg.de backofen@informatik.uni-freiburg.de

      This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1. (Eric Van Nostrand) http://dx.doi.org/10.5524/REVIEW.102771 Reviewer 2. (Nejc Haberman) http://dx.doi.org/10.5524/REVIEW.102769<br> Reviewer 3. (William Lai) http://dx.doi.org/10.5524/REVIEW.102770

  3. Jun 2021
  4. gigabytejournal.com gigabytejournal.com
  5. May 2021
  6. Apr 2021
  7. Mar 2021
  8. Feb 2021
  9. Jan 2021
  10. Dec 2020
    1. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy community provide a means for users to learn, publish, and teach single-cell RNA-sequencing analysis.

      See the write-up by the Earlham Institute for more on how this training is going on

  11. Oct 2020
  12. Aug 2020
  13. Jul 2020
  14. Jun 2020
  15. May 2020
  16. rvhost-alpha.rivervalleytechnologies.com rvhost-alpha.rivervalleytechnologies.com
    1. Katherine James Natural History Museum, Department of Life Sciences,Cromwell Road, London SW7 5BD, UK Search for other works by this author on: Oxford Academic Google Scholar Katherine James, Emma Betteridge Wellcome Sanger Institute, Cambridge CB10 1SA, UK Search for other works by this author on: Oxford Academic Google Scholar

      This Q&A features some discussion of her contribution to this project

  17. Apr 2020
    1. Timothy P L Smith US Meat Animal Research Center, US Department of Agriculture, State Spur 18D, Clay Center, NE 68933, USA Correspondence address. Timothy P. L. Smith, US Meat Animal Research Center, US Department of Agriculture, Clay Center, NE 68933, USA. E-mail: tim.smith2@usda.gov   http://orcid.org/0000-0003-1611-6828 Search for other works by this author on: Oxford Academic Google Scholar Timothy P L Smith

      See the Q&A with Benjamin Rosen and Timothy Smith in GigaBlog for more insight http://gigasciencejournal.com/blog/dna-day-2020-cattle-reference-genome/

  18. Mar 2020
    1. Table S1. Online tools for TALEN and CRISPR/Cas9. Collected online tools for TALEN and CRISPR/Cas9 are presented in this table. Updates can be accessed in GitHub [107]. Table S2. Commercial service for TALEN and CRISPR/Cas9. Collected commercial service for TALEN and CRISPR/Cas9 are presented in this table. Updates could can accessed in GitHub [107]. Table S3. Representative applications of genome editing. A summary of the representative applications in different organisms.

      Given that new methods, kits, and services continue to be rapidly developed and updated, an editable version we set up on Github wiki, and readers encourage to update it. See https://github.com/gigascience/paper-chen2014/wiki

  19. Feb 2020
  20. Oct 2019