764 Matching Annotations
  1. Mar 2023
    1. computational

      Reviewer name: Alessio Milanese (revision 1)

      Many thanks to the authors for their detailed responses to my comments.The edits have improved the manuscript and I have only few minor comments.COMMENT 1:In Figure 4b I can see that "Tenericutes" and "Planctomycetes" are both in orange, meaning that they bothhave been measured only by mOTUs. But in the main text I read "mOTUs failed to detect theTenericutes group, while MetaPhlAn failed to detect Planctomycetes", which is wrong.COMMENT 2:I would improve the figure legends. In particular, the description of 4b is the same as in 2a and 3a and 1:"The size of the discs represents the total amount of relative abundance at the corresponding clade in theground truth, or the tool prediction if that clade is not in the ground truth. If the tool predictions agree,a disc is colored half orange and half teal. The proportion of teal to orange changes with respect to thedisagreement in the prediction of that clade's relative abundance between the two tools being compared. Highlighted blue text represents clades where the difference between the relative abundances of the prediction and ground truth exceeds 30%".I would suggest to have this description only for figure 1, and then have a shorter description for thefollowing figures.COMMENT 3:The second color is described sometimes as "green" and sometimes as "teal". For clarity, I would suggestusing just one of the two.

    2. Metagenomic

      Reviewer name: Francesco Asnicar

      The manuscript by Sarwal et al. presents a novel tool for a standardized visualization of metagenomic taxonomic profiler tools, named TAMPA, that also enables a more general assessments of performances of taxonomic profiler tools by providing an extensive of different metrics.It would be interesting to see (if possible) the comparison of three (or more) taxonomic profiles at the same time. The evaluations shown are always binary, but in a real-case scenario where a user would like to evaluate 3 or 4 different taxonomic profiling tools on his community, it would be great to be able to do it.Other than the evaluation on the agreement between two (or more) taxonomic profiling tools, it is not clear how TAMPA can drive improvement over biologically-relevant question. Although it is clear, as the authors stated in the introduction, that different taxonomic profilers (with different parameters settings) can produce very different taxonomic representations, to support this statement it will be important to be able to show, at least one case, where TAMPA can suggest a different taxonomic interpretation of a microbial community that is also biologically relevant.Figures in general appear to be of low-quality and stretched, please consider improving them as they are the main point of TAMPA.

    1. identify

      Reviewer name: Raul Guantes (Revision 1)

      In the revised version and the response letter, the authors have clarified all the questions and addressed the comments raised in my previous report, and I think the manuscript is now suitable for publica

    2. techniques

      Reviewer name: De-Shuang Huang (Revision 1)

      I think the paper can be accepted.

    3. entities

      Reviewer name: Thomas Schlitt

      The manuscript "contrast subgraphs allow comparing homogeneous and hetereogeneous networks derived from omics data" introduces and illustrates the application of contrast subgraph analysis to gene expression, protein expression and protein-protein interaction data. The method can be applied to weighted networks. The authors give a good description of the method and the context of other available methods.The authors apply the contrast subgraph analysis to three different omics data sets - overall these analysis are not very detailed and do not yield surprising results but they provide a nice illustration of the potential usefulnes of the contrast subgraph analysis in the context of omics data. To my opinion this is really where the merit of the paper is: to promote and make accessible the method to a wider audience of researchers in the field of bioinformatics/molecular biology. The authors have also applied their method to brain imaging derived networks, but that work is not part of this publication.The contrast subgraph analysis is particularly interesting, for data that is collected under different conditions but for the same set of nodes (i.e. genes, proteins, ...), i.e. where the nodes present do not change (much), but their interaction strengths differes between conditions. It remains to be seen where this method can deliver unique value that is not achievable by other means, but the approach is very intuitive. Its rationale can be readily understood, reducing the temptation to use it as a "black box" without critically questioning the results as might be the case for more complex methods. One of the downsides of the presented approach is that it does not provide any measures of confidence in the results - while there is a parameter >alpha< that allows some tuning, little information is given on how to choose a suitable value for this parameter (which obviously depends on the data). Another issue that might come a little too short is how to derive graph representations from experimental omics data in the first place. Usually these methods do not yield yes/no answers, but rather we obtain a matrix of pairwise measurements (e.g. correlation of coexpression) and to obtain a graph a threshold on these numbers is applied to obtain an edge or not. Various methods have been proposed to choose thresholds, but in the end, moving from a full matrix to graph representation means loosing some information - it would be interesting to see a deeper analysis on how much this thresholding influences the outcomes of the proposed method - this question is obviously linked to obtaining some confidence information on the results.Overall, the method described here is very interesting, it shares downsides with other graph based methods (thresholding), the biological examples given are brief, but illustrative for the use of the method, the manuscript is well readable. The manuscripts stimulates to add this method to your own toolbox and to apply it to interesting data sets to see if it yields results that were not obvious from other approaches.Minor comments:-figure captions esp 1-3 - please provide more information in the figure captions to make the figures "readable" on their own without a need for the reader to refer back to the text; figure captions for Fig 1-3 are almost identical, yet very different data is shown - a clear indication that important information is missing in the figure caption - such as what is the underlying data?Please explain all terms used in the figure in its caption: here what is "GeneRatio"? Figs A/B what is the x-axis showing for the violin plots?-figure 3c and para on Protein vs mRNA coexpression (p2-5) - are the differences really that striking - in 3C, the box plots do not look that different, super low p-values are probably due to very large number of data points, but not sure it is really that meaningful here (effect size?)-figure 4 is too small, nodes are barely visible, colours cannot be distinguished-algorithm 1 and description in text - I would probably move the description of the algorithm from the text to a "figure caption" for the algorithm box, to make it easier for the reader to find the definitions of the terms.

    4. Biological

      Reviewer name: Raul Guantes

      In this manuscript the authors apply the method of contrast subgraphs (developed among others by some of the authors), that identifies salient structural differences between two networks with the same nodes, to several biological co-expression and PPI networks. This method adds to the extensive toolkit of network analyses that have been used in the last two decades to extract useful biological information from omics data. In particular, the authors identify subgraphs containing maximum differences in connectivity between two networks, and basically use functional annotations to assign biological meaning to these differences. Of note, contrast subgraphs is not the only method that provides 'node identity awareness' when comparing networks. For instance, identification of network modules or community partitions are common methods to identify groups of nodes that highlight potentially relevant structural differences between two networks, and have been applied to many biological and other types of networks.I find the manuscript well motivated and clearly written in general, but lacking detailed information on part of the Methods. The discussion connecting their findings on structural differences between networks to potential biological functions is also a bit vague and could be worked out in more detail. I feel that the paper is potentially acceptable in GigaScience after a revision to provide more details on the methods and on their findings. Here are my comments:Methods:1.- Coexpression networks for luminal and basal cancer subtypes:1a.- The authors don't give enough information about the data they are using to build these networks. How many samples/points are they using to calculate correlations? Do they correspond to different patients, expression dynamics after some treatment…? Is there any preprocessing in the data (e.g. differential expression with respect to healthy tissue) or they just take all quantified transcripts and proteins with minimal filtering (they only specified that filter out genes with FPKM < 1 in more than 50 samples in transcriptomic data)? How many nodes and links have the final coexpression networks?.1b.- To determine links between genes/proteins they calculate Spearman rho and transform it to (0.5(1+rho)^12 to give a 'signed' network. But since Spearman correlation ranges between +1 and -1, this transformed quantity lies between 0 and 1, so I don't see the sign. Moreover, why the exponent 12 in the transformation??. Please clarify because I don't know if they are analyzing just weighted networks, unweighted networks or signed networks in the end because somehow they 'keep track' of the sign of rho. They spend some space in Methods discussing the extension of the contrast subgraph method to sign networks, but I don't know if they finally apply it, since coexpression networks built in this way and PPI networks are not signed.1c.- Do they keep all links or use some cutoff in rho by magnitude/significance? Presumably yes, because otherwise the final network would be a clique and unmanageable, but they don't give any info on that. Again, which is the final size (node/links) of the coexpression networks?1d.- As for coexpression networks based on relative abundance data as those from transcriptomic/proteomic experiments, it is well known that correlations may be misleading due to the possible large number of spurious correlations (see for instance Lovell at al., PLoS Computational Biology 11(3) (2015) e1004075). The use of correlations requires some justification, and at least to acknowledge the potential pitfalls of this measure.1e.- How many nodes/links are in the first contrast subgraphs shown in Figures 1-2? Is the degree calculated within the whole network or just within the extracted subgraph?1f.- Page 4, last paragraph before 'Protein vs mRNA coexpression in breast cancer' section: 'the results obtained with the two independent breast cancer cohorts show good agreement, with the top differential subgraphs significantly overlapping for both the basal-like and the luminal-A subtypes (Fisher test p < 2.2 · 10-16)'. I guess the overlapping is in terms of functional annotations, how is this overlapping and the corresponding statistical test calculated?.2.- Protein versus mRNA coexpression:2a.- Please provide again information about the number of samples, how the 'subset of breast cancer patients included in the TCGA' is chosen and if transcriptome and proteome are quantified in the same conditions (relevant if one is directly to compare both networks). Provide also details about the number of link/nodes of each subnetwork and corresponding subgraph. Since transcriptomic data are provided usually in FPKM and proteomic in counts (sum of normalized intensities of each ion channel), are data further normalized to facilitate their comparison?3.- PPI networks:3a.- Since they are going to compare PPIs about different 'contexts', a brief explanation about the tissue origin and peculiarities of the three cell lines investigated is in order.3b.- Please provide details about number of proteins/interactions in the contrast subgraphs obtained from the comparisons of the three cell lines. Since these subgraphs are going to be compared to RNA expression data from a different dataset, please specify if these data are obtained from the same cell lines. Why PPI data are compared only to upregulated genes? (and not to up-down regulated). Also, concerning the criterion for 'upregulation' (logFC>1), is this log base 2?. How do they quantify the overlap between proteins in PPI and upregulated genes? They just state that 'did indeed significantly overlap the corresponding up-regulated genes'. How much is the overlap and what does 'significantly' mean?3c.- Discussion of the results shown in Figure 4 is not clear to me. First, the authors state 'We thus analyzed in more depth the first contrast subgraphs obtained from the comparison of the HEK293T PPI network with those obtained from the other two cell lines'. Does this mean that they analyze four subgraphs (2 for HEK vs. HUVEC and 2 for HEK vs. Jurkat?. When they say that the 'top contrasts subgraphs were identical', do they mean that the four subgraphs contained exactly the same nodes?. Also, in main text Figure 4 seems to contain the subnetwork of these subgraphs with only the nodes annotated as 'ribosome biogenesis' and 'signal transduction through p53', and the links would be the PPIs. But in the caption to Figure 4 they state that 'green edges join proteins involved in the two biological processes' (probably a subset of the PPIs). Please clarify. Why do they give only the comparison between HEK and HUVEC, and not between HEK and Jurkat if the same nodes are present?Interpretation of results:1.- Coexpression networks in two cancer subtypes: they find that the subgraph with the stronger connections in the basal subtype is enriched in 'immune response' and the subgraph denser in the luminal subtype is enriched in categories related to microenvironment regulation. If they identify clearly enriched genes they should discuss in more depth their known roles in connection to these two functions in their biological context. This would enrich and support their findings. It is tempting to speculate that, since the basal type is less aggressive, cancer cells are challenged by the immune system of the organism but, once they developed mechanisms to evade the immune system (becoming more aggressive as in the luminal subtype) they are committed to manipulate their microenvironment to proliferate. Are there any evidences for this in these subtypes of cells?2.- Comparison of transcriptomic and proteomic networks: From their analyses in Figure 3 they claim in the Discussion that 'adaptive immune system genes are more connected at the transcriptional level, while innate immune systems are more connected at the proteomic level'. This is a rather vague statement based on the functional enrichment analysis. First, they should identify and discuss in more detail the genes/proteins responsible for this enrichment, to see if their documented function supports their speculations (and since the data they use are from breast cancer, I don't know how general could be this observation of if it is specific of this type of tumor). Moreover, caution should be exerted when interpreting these coexpression networks: the most connected transcripts are not necessarily those who are being simultaneously translated. Also, since apparently the network is not signed the abundance of connected transcripts may be anticorrelated. Finally, Figure 3 is not clear: which panel corresponds to the transcriptomic subgraph and which one to the proteomic one? This should be specified in the caption or with titles in the panel.Minor comments:- The distinction between 'heterogeneous' and 'homogeneous' networks in the Introduction is a bit confusing, as they classify mRNA and protein coexpression networks as 'heterogeneous'. Why is that? Is that because they are built from many different samples/individuals or time course data?.- Although I have nothing against how the authors display differences between the first contrast subgraphs in panels A-B of Figures 1 and 2, it may be more eye-catching to display these differences as usual boxplots or violin plots, with perhaps the test for significant differences between the means of both degree distributions.

    5. Abstract

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad010), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: De-Shuang Huang

      The authors proposed an algorithm based on contrasting subgraphs to characterize the biological networks, so as to analyze the specificity and conservation between different samples. It is interesting and I think there are some problems that need to be clarified.1, Sub-graphs are generated by dividing the whole graph in a certain way, and the similarity and difference of the samples are described by the comparison between the sub-graphs. The authors should discuss the advantages of the proposed approach in a non-heuristically way compared with the previous methods. Besides that, I wonder why subgraphs need to be non-overlapping.2, For TCGA or other databases, I think the authors should state the details of the samples, such as the number of samples, sequencing technology, batch effects, etc. In addition, the authors should describe the relationship between the subgraphs and GO modules to explain the results and draw some biological conclusions.3, The authors performed a similar analysis on protein networks and compared the results with RNA-seq, and get some conclusions. I'm a little confused whether the GO enrichment analysis of proteomics is to map the protein ID to the gene ID. If so, the authors can easily combine transcript co-expression and protein co-expression networks through ID-to-ID mapping, and I look forward to the results of such an analysis.4, I would like to know how the proposed method handles heterogeneous graphs by treating heterogeneous graphs as Homogeneous graph to generate subgraphs? I didn't figure out which dataset is the heterogeneous graph scenario.5, In addition to the elaboration of results such as degree and density differences between subgraphs, I would like to see the relationships between these results and the biological problems.6, Authors may consider citing the following articles on networks in molecular biologyBarabasi A L, Oltvai Z N. Network biology: understanding the cell's functional organization[J]. Nature reviews genetics, 2004, 5(2): 101- 113.Zhang, Q., He, Y., Wang, S., Chen, Z., Guo, Z., Cui, Z., ... & Huang, D. S. (2022). Base-resolution prediction of transcription factor binding signals by a deep learning framework[J]. PLoS computational biology, 2022, 18(3): e1009941.Hu J X, Thomas C E, Brunak S. Network biology concepts in complex disease comorbidities[J]. Nature Reviews Genetics, 2016, 17(10): 615-629.Z.-H. Guo, Z.-H. You, Y.-B. Wang, D.-S. Huang, H.-C. Yi, and Z.-H. Chen, "Bioentity2vec: Attribute-and behavior-driven representation for predicting multi-type relationships between bioentities." GigaScience 9.6 (2020): giaa032.Z.-H. Guo, Z.-H. You, D.-S. Huang, H.-C. Yi, K. Zheng, Z.-H. Chen, Y.-B. Wang, MeSHHeading2vec: a new method for representing MeSH headings as vectors based on graph embedding algorithm[J]. Briefings in bioinformatics, 2021, 22(2): 2085-2095.

    1. Conclusions

      Reviewer names: Alban Gaignard (Report on revision 1)

      The reading of the revised paper would have been easier by providing updates in a different color but thank you for taking into account the comments and remarks, and clearly answering the raised issues. I also appreciated the extension of the discussion. However, I still have some concerns regarding the proposed approach. The proposed platform targets both workflow sharing and testing. It is explicitly stated in the abstract: "the validation and test are based on the requirements we defined for a workflow being reusable with confidence". It is clear in the paper that tests are realized through the GitHub CI infrastructure, possibly delegated to a WES workflow execution engine. Although I inspected Figure 3 as well as the wf_params.json and wf_params.yml provided in the demo website. It doesn't seem to be enough to answer questions such as: how are specified tests ? How can a user inspect what has been done during the testing process ? What is evaluated by the system to assess that a test is successful ? I tried to understand what was done during the testing process but the test logs are not available anymore (Add workflow: human-reseq: fastqSE2bam · ddbj/workflow-registry@19b7516 · GitHub) Regarding the findability of the workflows, in line with FAIR principles, the discussion mentions a possible solution which would consists in hosting and curating metadata in another database. To tackle workflow discoverability between multiple systems, accessible on the web, we could expect that the Yevis registry exposes semantic annotations, leveraging Schema.org (or any other controlled vocabulary) for instance. This would also make sense since EDAM ontology classes are referred to in the Yevis metadata file (https://ddbj.github.io/workflow-registry-browser/#/workflows/65bc3bd4-81d1-4f2a8886-1fbe19011d81/versions/1.0.0).

    2. Background

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad006), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Kyle Hernandez

      Suetake et. al designed and developed a system to publish, validate, and test public workflows utilizing existing standards and integration with modern CI/CD tools. Their design wasn't myopic, they relied heavily on their own experiences, work from GA4GH, and interacting with the large workflow development communities. They were inspired by the important work from Goble et. al that applies the FAIR standards to workflows. As someone who had a long history of workflow engine development, workflow development, and workflow reusability/sharing experience I greatly appreciate this work. There are still unsolved problems, like guidelines on how to approach writing tests for workflows for example, but their system is one level above this and focuses on ways to automate the validation, testing, reviewing/governance, and publishing into a repository to greatly reduce unexpected errors from users. I looked through the source code of their rust-based client, which was extremely readable and developed with industry-level standards. I followed the read me to setup my own repositories, configure the keys, and deploy the services successfully on the first walk through. That speaks to the level of skill, testing, and effort in developing this system and is great news for users interested in using this. At some level it can seem like a "proof of concept", but it is one that is also usable in production with some caveats. The concept is important and implementing this will hopefully inspire more folks to care about this side of workflow "provenance" and reproducibility. There are so many tools out there for CI/CD that is often poorly utilized by academia and I appreciate the author's showing how powerful they can be in this space. The current manuscript is fine and will be of great interest to a wide ranging set of readers, I only have some non-binding suggestions/thoughts that could improve the paper for readers: 1. Based on your survey of existing systems, could you possibly make a figure or table that showcases the features supported/not supported by these different systems, including yours? 2. Thoughts on security/cost safeguards? Perhaps beyond the scope, but it does seem like a governing group needs to define some limits to the testing resources and be able to enforce them. If I am a bad actor and programmatically open up 1000 PRs of expensive jobs, I'm not sure what would happen. Actions and artifact storage aren't necessarily free after some limit. 3. What is the flow for simply updating to a new version of an existing workflow? (perhaps this could be in your docs, not necessarily this manuscript). 4. CWL is an example of a workflow language that developers can extend to create custom "hints" or "requirements". For example, seven bridges does this in cavatica where a user can define aws spot instance configs etc. WDL has properties to config GCP images. It seems like in these cases, tests should only be defined to work when running "locally" (not with some scheduler/specific cloud env). But the author's do mention that tests will first run locally on the user's environment, so that does kind of get around this. 5. For the "findable" part of FAIR, how possible is it to have "tags" of sort associated with a wf record so things can be more findable? I imagine when there is a large repository of many workflows, being able to easily narrow down to the specific domain interest you have could be helpful.

    3. Results

      Reviewer names: Alban Gaignard

      The reading of the revised paper would have been easier by providing updates in a different color but thank you for taking into account the comments and remarks, and clearly answering the raised issues. I also appreciated the extension of the discussion. However, I still have some concerns regarding the proposed approach. The proposed platform targets both workflow sharing and testing. It is explicitly stated in the abstract: "the validation and test are based on the requirements we defined for a workflow being reusable with confidence". It is clear in the paper that tests are realized through the GitHub CI infrastructure, possibly delegated to a WES workflow execution engine. Although I inspected Figure 3 as well as the wf_params.json and wf_params.yml provided in the demo website. It doesn't seem to be enough to answer questions such as: how are specified tests ? How can a user inspect what has been done during the testing process ? What is evaluated by the system to assess that a test is successful ? I tried to understand what was done during the testing process but the test logs are not available anymore (Add workflow: human-reseq: fastqSE2bam · ddbj/workflow-registry@19b7516 · GitHub) Regarding the findability of the workflows, in line with FAIR principles, the discussion mentions a possible solution which would consists in hosting and curating metadata in another database. To tackle workflow discoverability between multiple systems, accessible on the web, we could expect that the Yevis registry exposes semantic annotations, leveraging Schema.org (or any other controlled vocabulary) for instance. This would also make sense since EDAM ontology classes are referred to in the Yevis metadata file (https://ddbj.github.io/workflow-registry-browser/#/workflows/65bc3bd4-81d1-4f2a8886-1fbe19011d81/versions/1.0.0).

    4. analysis

      Reviewer name: Samuel Lampa

      The Yevis manuscript makes a good case for the need to be able to easily set up self-hosted workflow registries, and the work is a laudable effort. From the manuscript, the implementation decisions seem to be done in a very thoughtful way, using standardized APIs and formats where applicable (Such as WES). The manuscript itself is very well written, with a good structure, close to flawless language (see minor comment below) and clear descriptions and figures.

      Main concern

      I have one major gripe though, blocking acceptance: The choice to only support GitHub for hosting. There is a growing problem in the research world that more and more research is being dependent on the single commercial actor GitHub, for seemingly no other reason than convenience. Although GitHub to date can be said to have been a somewhat trustworthy player, there is no guarantee for the future, and ultimately this leaves a lot of research in an unhealthy dependenc on this single platform. As a small note of a recent change, is the proposed removal of the promise to not track its users (see https://github.com/github/site-policy/pull/582). A such a central infrastructure component for research as a workflow registry has an enormous responsibility here, as it may greatly influence the choices of researchers in the future to come, because of encouragement of what is "easier" or more convenient to do with the tools and infrastructure available. With this in mind, I find it unacceptable for a workflow registry supporting open science and open source work to only support one commercial provider. The authors mention that technically they are able to support any vendor, and also on-premise setups, which sounds excellent. I ask the authors to kindly implement this functionality. Especially the ability to run on-premises registries is key to encourage research to stay free and independent from commercial concerns.

      Minor concerns

      1. I think the manuscript is a missing citation to this key workflow review, as a recen overview of the bioinformatics workflows field, for example together with the current citation [6] in the manuscript: Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods, 18(10), 1161-1168. https://www.nature.com/articles/s41592-021-01254-9
      2. Although it might not have been the intention of the authors, the following sentence sounds unneccessarily subjective and appraising, without data to back this up (rather this would be something for the users to evaluate):

        The Yevis system is a great solution for research communities that aim to share their workflows and wish to establish their own registry as described. I would rather expect wording similar to: "The Yevis system provides a [well-needed] solution for ..." ... which I think might have been closer to what the authors intended as well. Wishing the authors best of luck with this promising work!

    1. The orb-web

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad002), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Jonathan Coddington

      This paper presents the first uloborid spider genome--and it is a chromosome level assembly. Genomes of this family are important because the orb web is supposedly independently and convergently evolved in this group. Although my expertise is not in the technology and informatics of genome sequencing, it appears to be well done.

      Figure 1 A. geniculate -- spelling N. clavipes = T. clavipes Table S1 Number of Componenet Sequences-- typo Text single exon We found a -- typo can be ascribed by -- can be inferred by? an Araneid orb-weaver-- araneid usually not capitalized ♂X1X2/♀X1X1X2X2.[48] should be ♂X1X2/♀X1X1X2X2 [48]. You might want to be careful about citing Purcell & Pruitt, see https://purcelllab.ucr.edu/blog6.html and other questions about Pruitt's work.

      Re methods, it would be of interest to know what HMW DNA fragment sizes were (expressed as kb, or mb), although Tape Stations are not very accurate. For people who collect spiders with the intent to yield HMW DNA, such data are important. Data are scarce, so any facts are significant.

      Any homologs of the Pyriform spidroin (PySp) in Acanthoscurria? Piriform silk attachment points are a synapomorphy of araneomorph or "true" spiders. Liphistiomorph and mygalomorph spiders do not (cannot?) make point attachments, and the inability to make point attachments either to substrate or silk-silk point attachments probably constrains/ed the evolution of web architectures in non-araneomorph spiders. Therefore finding homologs to PySp spidroins in non-araneomorph spiders is of great interest to explain araneomorph web architecture diversity.

      Likewise, tubuliform spidroin (TuSp) is probably a synapomorphy of entelegyne spiders, with derived female genitalia--a "flow-though" sperm management system. Eggsacs occur widely in non-entelegyne spiders, so it is a mystery why entelegynes have specialized spigots, glands, and spidroins for the same purpose. Indeed, the particular function of tubuliform silk is not clear. Any thoughts on this? E.g.

      It is good to see attention paid to the mitochondrial genome, as many whole genome studies ignore it. In spiders, early work claimed that tRNA's appeared to be peculiar. Masta and Boore. 2004. The Complete Mitochondrial Genome Sequence of the Spider Habronattus oregonensis Reveals Rearranged and Extremely Truncated tRNAs. Molecular Biology and Evolution, Volume 21, Issue 5, May 2004, Pages 893-902. Any comments on U. diversus tRNAs from that point of view?

      Finally, any comments on evidence for or against the convergent evolution of the orb web? Homology between the pseudoflagelliform and flagelliform spidroins would be pertinent. The intro does raise expectations that some of the macro / larger evolutionary questions will be addressed in the paper, but many, see above, are only cursory or not too much. Perhaps include a sentence in intro acknowledging this, but saying that this paper intends to present the genome and address sex chromosomes, but other topics? For example the sections on some of the spidroins do not extensively discuss comparisons with other spider genomes.

      Reviewer 2: Hui Xiang

      In this study, the authors generated huge genome sequencing data and RNA-seq data and provided a genome assembly with rather complicated merging approach, of a spider with novel phylogenetic position. The genome undoubtedly added novel and important resources for deep understanding of spider evolution. However, there are still severe issues that need to be addressed. 1. There are huge sequencing data from different samples. However, I don't think that marge of different assemblies is good for a final qualified genome. Given high heterozygosity, that illumina data and ONT data from different individuals is quite difficult to use for assembling a clean genome. As shown in Table 2, assembly by Hify approach is not obviously inferior compared with the merged one, but obviously much better in avoiding redundancy. I strongly suggest that the author adopt the genome assembly of Hify data from one individual, instead of merging two sets of assemblies. Illumina and Nanopore assembly may be helpful in fully deciphering silk proteins. 2. Proportion of repeats are somewhat affected by the quality of assembly. The high heterozygous genome assembly is complicated merged by diverse batch of data, so the real quality might be not as good as the author described. The quality of repeat is especially hard to evaluate. Hence the statements on genome size (Line 193-200) are not convictive. 3. About the assembly of RNA-seq data. The authors get huge amounts of data. However, it is not so helpful to obtain novel transcripts if the data is saturated. More importantly, assembly of short reads is even not so useful to obtain long transcripts. 4. As to whole genome duplication. The authors did not provided solid evidence supporting that WGD occurred in U. diversus genome. They only demonstrated two hox clusters therein. The synteny analysis was quite confusing which is not helpful in confirmation of WGD. They need to provide more solid genome-wide evidence, or otherwise totally downplay the statements. 5. The identification of the sex chromosome is still vague. The statements are not well organized. The statements and the results are so vague and not convictive. "While 8 of the 10 pseudochromsomes had a median read depth of 40 ± 2, pseudochromosomes 3 and 10 were outliers, with read depths of 36 and 33, respectively." The difference in sequencing depth is rather convictive. As I know the authors sequenced female and male samples. So why they didn't clearly compare the depth of the two sex chromosomes between them and make more evidence? Other: 1. The information of chromosome-level spider genome are not Incomplete. As I know, there is a black widow genome with chromosome-level. The authors need to added this one. 2. The authors need to release the sequences of the spidroins the identified and described.

      Reviewer 3: Zhisheng Zhang, Ph.D

      The manuscript GIGA-D-22-00169 presents a chromosome-level genome of the cribellate orb-weaving spider Uloborus diversus. The assembly reinforces evidence of an ancient arachnid genome duplication and identifies complete open reading frames for every class of spidroin gene. And the authors identified the two X chromosomes for U. diversus and identify candidate sex-determining genes.

      The methods of work are well fited to the aims of the study, clearly described, and well written.

      Minor comments:

      1. In the Figure 1B, I noticed that it noted the estimated divergence times of the Araneae, I think there should be add the reference, or detail describe how to do.

      2. There is something wrong with the table format, such as Table1, 2, 5 and Table 6.

      3. Line 70: "chromosome- scale" changes to "chromosome-scale".

      4. Line 147 to lines 148: Line breaks error.

      5. Line 458: "[48]" in the wrong location.

      6. Line 511-512: In the genome of spider Uloborus diversus, which chromosome the genes of "sex lethal (sxl)" and "doublesex (dsx)" located at?

      7. Line 515-516: "The 534 shared sex-linked genes in these three species, 14 are predicted to be DNA/RNA-binding", if these sex-linked genes have difference on RNA level between male and female?

      8. Line 685: "Dovetail Chicago and Dovetail Hi-C Sequencing" should be bold.

      9. Line 764: "We then used the Trinity assembler43 v.2.12.0", the number of 43 may be redundancy.

      10. Some softwares lack the number of RRID, such as line 223 "BRAKER2", line 245 of "NOVOplasty", line 790 of "tRNAscan-SE", line 773 of "RepeatModeler", line 774 of "RepeatMasker", line 797 of "EMBOSS", and so on.

      11. Lines 780 "using the BRAKER 2 pipeline" changes to "using the BRAKER2 pipeline".

      12. Lines 950: "Literature Cited" changes to "Reference".

      13. Lines 952-953: wrong cite. The World Spider Catalog is a web online, the version and the data you accessed from should also added, and the author's name should change to World Spider Catalog.

    1. Background Malignant Pleural Mesothelioma (MPM)

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac128), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Saurabh V Laddha

      Authors did a fantastic job by integrating MPM multi-omics datasets and created an integrative and interactive map for users to explore these datasets. MPM is a rare cancer type and understudied so such resources are very useful to move the field forward at a molecular level. The comprehensive data is well presented and the manuscript is well written to explain the complex genomics dataset for MPM. All the figures are well explained and very clear to understand

      Minor point: - Author mentioned an evaluation of tumor purity was done using pathological review, did author used molecular data such as genomic data to find tumor purity ? and if yes, how was the consensus ? This is very important factor to interpret the genomic results as the data was sequenced at 30X - In the same line, RNAseq can also be used to identify tumor purity and it will be really helpful for users to clear picture on tumor purity. - Is it not very clear from method section that the same MPM samples were used to sequence at DNA , RNA and DNA methylation level ? A brief explanation or table will be very easy for users to understand. - Recent WHO classify MPM into three different histopathological types. Did author do any unsupervised analysis from these comprehensive data to understand MPM heterogeneity or replicate WHO classification? or did author find WHO subtypes of MPM using molecular dataset ? A brief analysis/comment on usage of histological classification Vs Molecular classification will certainly move the MPM research field forward as researcher have found vast differences between histological vs molecular classification and the field is moving towards more molecular based classification in clinic.

      Reviewer 2: Jeremy Warner

      In this paper, the authors describe a new public resource for the molecular characterization of malignant pleural mesothelioma (MPM), which they describe as the most comprehensive to date. They perform WGS, transcriptome, and methylation arrays for 120 patients with MPM sourced through the MESOMICS project and integrate this dataset with an additional several hundred patients from previously published datasets.

      Although I cannot independently verify their claim that this is the largest and most comprehensive dataset for MPM, it is quite impressive and expansive. The pipeline utilized is well described and the results at all stages are transparently shared for prospective users of this dataset.

      The description of the methods to identify and remove germline variants is interesting, although the length somewhat detracts from the main goal of the paper in describing an MPM resource. Perhaps, this part could be condensed with the technical details presented in supplement. This comment pertains to both the Point Mutations and Structural Variants sections.

      Additional moderate concerns:

      There are insufficient details provided on the clinical and epidemiological parameters. Indirectly, it would appear that sex, age class, and smoking status are the clinical parameters - but what are the age classes? Is smoking status binary ever/never, or more involved? There ought to be a data dictionary provided as a supplemental table which describes each clinical/epidemiological variable, along with the possible values that the variable can take on. It should additionally be explained why other important clinical parameters are not available - most importantly, the presence of accompanying pulmonary comorbidity such as chronic obstructive pulmonary disease (COPD) and the existence of conditions that might preclude the use of standard systemic therapies, such as renal disease precluding the use of platinum agents.

      Context: I would like to see more here about the role of asbestos in the etiology, including what might be known about the pathophysiology of asbestos fibers at the molecular level. Also, there is nothing here about the evolution of treatment for MPM; the latest "state-of-the-art" regimens (platinum doublet + bevacizumab [MAPS; NCT00651456] and dual checkpoint inhibition [Checkmate 743; NCT02899299]) have reported median survival in the 18-month range, which is distinctly better than the median survivals quoted by the authors. Finally, I would like to see one or more direct references to the clinical trials where molecular heterogeneity has "fueled the implementation of drug trials for more tailored MPM treatments".

      Data Description: All specimens in the MESOMICS study are said to be collected from surgically resected MPM; this also appears to be the case for the integrated multi-omic studies from Bueno et al. and Hmeljak et al. and this should be explicitly indicated. Somewhere, it should also be explicitly discussed that this is an important limitation in the future utility of this data - surgical specimens are convenience samples and while they do provide important information, they lack treatment exposure. Given that many if not most patients with MPM will survive to 2nd or 3rd line systemic therapy, and that 1st line is fairly standardized, a knowledge of induced mutations is going to be essential to the ultimate goal of precision medicine.

      Minor concerns:

      The labels in the figures (e.g., Figure 2 - "Unmapped..too.short") are human-readable but could still be presented in a more friendly fashion. All acronyms should be defined.

      Reviewer 3: Mary Ann Tuli

      I have been asked to review the process of accessing the controlled data cited in this study to ensure that the process is clear and smooth. The study is available from the European Genome-phenome Archive (EGA) under accession number EGAS00001004812 (https://ega-archive.org/studies/EGAS00001004812). The paper is clear about how to obtain the DAA.

      The study has three datasets.

      I can confirm that the author was very prompt in his response to me requesting the DAC, in providing the DAA and in responding to the queries I had when completing the DAA. The completed DAA was sent to the EGA by the author on 29-Jul, and EGA responded within 3 working days, stating access had been granted. This is an excellent response time, so I conclude that the process of obtaining the DAA and the EGA making the data available to the user is very good.

      Today (1-Sep) I have attempted to gain access to the data via EGA. I was easily able to login to my EGA account and see that the datasets are available to me to download. Users need to download data using the EGA download client - pyEGA3. EGA provides a video on how to install the client, but I hit a problem and require technical support.

      I emailed the EGA help desk but have not had a response yet. I was quite surprised to receive a response from the author and have learnt that EGA include the owner of the study in RT tickets so they see any communication. I commend the author for his prompt response to my ticket (though it didn't solve my problem).

      I cannot hold on to this review for any longer, and I am not yet in a position to comment on the nature of the data held within this study.

      I do have concerns that the process of accessing controlled data held in the EGA is not straight forward. Users need to watch a 12 minute video to learn how to install the download client and may need to install programs on their computer). There is a FAQ which is very technical. This is not an issue for the author to resolve though.

      I understand the author has some minor revisions to make, so hopefully I should have a response from the EGA help desk before a final decision needs to be made (?).

    1. Background

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Shiping Liu

      How to model the statistical distribution of the gene expression, is a basic question for the field of single cell sequencing data mining. Dharmaratne and colleagues looked details at the distribution of very gene. By using the generalized linear models (GLM), the authors present a new program scShapes, which matched a specific gene with a distribution from one of the four shapes, Poisson, Negative Binomial (NB), Zero-inflated Poisson (ZIP), and Zero-inflated Negative Binomial (ZINB). As the authors present in this manuscript, not all genes adapted to a single distribution, neither NB or Poisson, and some of the genes actually adapted to the zero-inflated models because of the property of high drop-out rate in the modern single cell sequencing, says 3' tag sequenced. It is has been popular to employ GLM in single cell data mining recently, but it also got both praise and blame. So it is a good forward step to model a specific model for an individual gene. But the bad side is the computing cost, especially for the number of cells been sequenced reach to millions in currently research, and it believed that the dataset will be reached even bigger in the future. So it make a great obstacle arise to the application of the method presented by the author here. How to speed up the calculation using the mixed model or scShapes? The authors also performed the scShapes on some datasets, including the metformin, human T cells, and PBMCs. They found some potential genes that changed the distribution shape, but didn't easy to be identified by other methods. It demonstrated that scShapes could identified the subtle change in gene expression.

      Major points: (1) We didn't see any details about the metformin dataset, the segueing depth and quality, number of genes/UMIs per cell, and so on. It makes hard to evaluate the quality and reliability of the results generated by scShapes. If this dataset is another manuscript could not possible to be presented at the same time, I suggest the author could perform on alternative dataset, as there are so many single cell datasets has been published could be used in this study.

      (2) Even the authors taken the cell type account in the GLM, I wonder for a specific gene, whether the distribution shape will change in different cell type. If so, it will becoming more complex, that is need to model the distribution shape for individual gene for every cell type alone.

      (3) To identify the different gene expression in scShapes, the author didn't consider the influence of different cell number, or the proportion of cell number, in the different cell type. A possible way to evaluate or eliminate this bias is to down sampling from a big dataset, instead of just simulated total number 2k ~ 5k from the PBMC. To evaluate the influence both the total number cell and the proportion in cell type.

      (4) The author should present the comparative results of the computational cost for different methods. Says the accuracy, time and memory consuming under different number of cells. I suggest the authors use much a larger dataset, because currently single cell research may include millions of cells, and the ability to process big data is very important to the application and becoming a widely used one.

      Minor points: (1) No figure legends for Fig.2 c and d.

      (2) It is unclear whether the total 30% genes undergo shape change, or just the proportion of the remaining after the pipeline. So please clarify the details.

      Reviewer 2: Yuchen Yang

      In this manuscript, authors presented a novel statistical framework scShapes using GLM approach for identifying differential distributions in genes across scRNA-seq data of different conditions. scShapes quantifies gene-specific cell-to-cell variability by testing for differences in the expression distribution. scShapes was shown to be able to identify biologically-relevant switch in gene distribution shapes between different conditions. However, there are still several concerns required to be addressed.

      1. In this study, authors compared scShapes to scDD and edgeR. However, besides these two, there are many other methods for calling DEGs from scRNA-seq. Wang et al. (2019) systematically evaluated the performance of eight methods specifically designed for scRNA-seq data (SCDE, MAST, scDD, D3E, Monocle2, SINCERA, DEsingle, and SigEMD) and two methods for bulk RNA-seq (edgeR and DESeq2). Thus, it is also worthy to compare scShapes to other methods, such as SigEMD, DEsingle and DESeq2, which were supposed to perform better than scDD or edgeR.

      2. When scShapes was compared to scDD, authors mainly focused on the distribution shifting. However, to users, it would be better to present a venn diagram showing the numbers of the genes detected by both scShapes and scDD, and the genes specifically identified by scShapes and scDD, respectively. In addition, authors showed the functional enrichment results for DEGs identified by scShapes. It is also worthy to perform enrichment analysis for the genes detected by both scShapes and scDD or specifically identified by scShapes or scDD.

      3. Since scShapes detects differential gene distribution between different conditions, it would be better to show users how to interpret the significant results biologically. For example, authors mentioned that RXRA is differentially distributed between Old and Young and Old and Treated, so what does this results mean? Can this differential distribution be associated with differential expression?

      4. In Discussion, authors mentioned that scRATE is another tool that can model droplet-based scRNA-seq data. It would be clearer to discuss that why authors develop their own algorithm rather than using scRATE to model the distribution.

      5. In Introduction, authors talked about the zero counts in scRNA-seq data, and presented evidence in Results part. Since 2020, there are several publications also focusing on this issue, such as Svensson, 2020 and Cao 2021. These discussions should be included in this manuscript.

    1. Motivation

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac125), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Ruibang Luo

      In this paper, the authors proposed xAtlas, an open-source NGS variant caller. xAtlas is a fast and lightweight caller with comparable performance with other benchmarked callers. The benchmark comparison in multiple popular short-read platforms (Illumina HiSeq X and NovaSeq) demonstrated xAtlas's capacity to identify small variants rapidly with desirable performance. Although xAtlas is limited to call multi-allelic variants, the high sensitivity (~99.75% recall for ~60x benchmarking datasets) and desirable runtime (<2 hours) enable xAtlas to rapidly filter candidates and be considered as important quality control for further utilization.

      The authors presented a detailed explanation of xAtlas's workflow, design decisions and have done complete experiments in benchmarking, while there are still some points the authors need to discuss further listed as follow:

      The authors reported the performance in multiple coverages of the HG001 sample and the benchmarking result of HG002-4 samples by measuring the concordance with the GIAB truth set (v3.3.2). I noticed that GIAB had updated the GIAB truth sets from v3.3.2 to v4.2.1 for the Ashkenazi trio. The updated version included more difficult regions like segmental duplications and the Major Histocompatibility Complex (MHC) to identify previously unknown clinically relevant variants. Therefore, it would be helpful if the author could give a performance evaluation using the updated truth sets to give a more comprehensive performance of the proposed caller.

      In the Methods section, The authors stated the main three stages of the xAtlas variant calling process: read prepossessing, candidates identification, and candidates evaluation. The author fed hand-craft features (base quality, coverages, reference and alternative allele support, etc.) into a logistic regression model to classify true variants and reference calls in the candidate evaluation stage. But in Figure 1, the main workflow of xAtlas, only model scoring was shown, and the evaluation details were not visible. It would be useful if the authors could enrich Figure 1 to add more details to ensure consistency with Methods and facilitate reader understanding.

      In Figure 2, the authors reported the xAtlas performance comparison across in HG001 dataset with other variant callers. I noticed that the x-axis was F1-score while the y-axis was true positives per second. The tendency measurement of two metrics seems irrelevant, which might confuse the readers. we suggest the authors make separate comparisons for the two metrics. (For instance, plot Precision-Recall curves for F1-score measurement and Runtime comparison of various variant callers for speed benchmarking).

      Zheng, Zhenxian on behalf of the primary reviewer

      Reviewer 2: Jorge Duitama

      The manuscript describes a variant caller called xAtlas, which uses a logistic regression model to call SNPs after building an alignment and pileup of the reads. The manuscript is clear. The software is built with the aim of being faster than other solutions. However, I have some concerns relative to the method and the manuscript.

      1. Unfortunately, the biggest issue with this work is that the gain of speed is obtained with an important sacrifice in accuracy, specially to call indels. I ran xAtlas with two different benchmark datasets and the accuracy, especially for indels and other complex regions was about 20% lower compared to other solutions. Although the difference was smaller, xAtlas is also less accurate than other software tools for SNV calling. It is well known that even a simple SNV caller can achieve high sensitivity and specificity (see results from https://doi.org/10.1101/gr.107524.110). However, several SNV errors can be generated by incorrect alignment of reads around indels and other complex regions. For that reason most of the work on variant detection is focused on mechanisms to perform indel realignment or de-novo miniassembly to increase accuracy of both SNV and indel detection. The paper of Strelka is a great example of this (https://doi.org/10.1038/s41592-018-0051-x). The manuscript does not mention if any procedure has been implemented to realign reads or to increase in some way the accuracy to call indels. This is critical if xAtlas is meant to be used in clinical settings.

      2. The manuscript looks outdated in terms of evaluation datasets, metrics and available tools. Since high values of standard precision and sensitivity are easy to achieve with simple SNV callers, metrics such as the false positives per million basepair (FPPM) proposed by the developers of the synthetic diploid benchmark dataset should be used to achieve a more clear assessment of the accuracy of the different methods (https://doi.org/10.1038/s41592-018-0054-7). Regarding benchmark experiments, SynDyp should also be used for benchmarking. To actually support that xAtlas is reliable across heterogeneus datasets (as stated in the title), further datasets should be tested, as it has been done for software tools such as NGSEP (https://doi.org/10.1093/bioinformatics/btz275). In terms of tools, both DeepVariant and NGSEP should be included in the comparisons.

      3. Regarding the metrics proposed by the authors, I do not think it is a good practice to merge results on accuracy and efficiency, taking into account that the accuracy in this case is lower than other solutions, and for clinical settings that is an important issue. The supplementary table should also report sensitivity and precision for indels, not only for SNVs.

      4. The SNV calling method and particularly the genotyping procedure should be describe in much better detail. The manuscript describes the general pileup process, then, it mentions some general filters for read alignments and then it mentions that it applies logistic regression. However, it is not clear which data is used for such regression or in general how allele counts and quality scores are taken into account. A much deeper description of the logistic regression model should be included in the manuscript.

      5. There are better methods than PCA to show clustering of the 1000g samples. A structure analysis is more suitable for population genomics data and it is more clear to show the different subpopulations.

      6. Finally, about the software, genotype calls produced by the xAtlas should have a value for the genotype quality (GQ) format field to assess the genotyping accuracy. For single sample analysis the QUAL value can be used (although this is not entirely correct). However, for population VCFs, the GQ field is very important to have a measure of genotyping quality per datapoint. Regarding population VCF files it is not clear, either from the in-line help or from the github site, how population VCF files should be constructed.

    1. The tiger

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac112), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Jong Hwa Bhak

      This manuscript is about assemblies of Bengal tigers. It is a great improvement over past two tiger genome assemblies. The assemblies quality is unprecedented (exceeding perhaps any feline genome in terms of contiguity).

      This represented a ~50x improvement in genome contiguity (see materials and methods). PanTigT.MC.v2

      What was the most important factor in this big jump of improvement in length?

      the overall contiguity was better than the domestic cat reference genome

      The quality comparison section is informative.

      We identified the "repetitive elements" in the genome by combining both

      ==> repeat elements is better.

      How close are the two genomes (MC & SI)?

      This reviewer finds it a great contribution to existing feline genome assemblies. The authors have done all the usual QC and constructed really high quality assemblies.

      Reviewer 2: Gang Li

      The submitted manuscript 'Near-chromosomal de novo assembly of Bengal tiger genome reveals genetic hallmarks of apex-predation' assemble the high-quality near-chromosomal leveled reference genomes of Bengal tiger, which will be of great significance for the conservation and rejuvenation of tigers, even other endangered felids. I have some comments on this manuscript: 1. Considering this the assembled genome used the Hic technology to figure out the chromosome structure, the figure of Hic results need to be presented. While, the assemble of sex chromosome always attract attentions, especially Y chromosome of tiger. More detailed information need to be specified, such as the conserved Y chromosome genes compared to other mammals, or whether there are tiger-specific Y linked gene has been observed or not. 2. In this work, authors used four zoo-bred individuals with known pedigree to test the inbreeding index of ROH and intend to evaluate the assembly quality. But I don't find any information about these four individuals and I guess they should be Bengal tigers. If it is the case, the question is that the quantity of ROH will not be only decided by the reference quality, but also the divergence between the target resequencing date and the used reference genome. That is to say, if the resequencing data and the reference genome are all from the same tiger sub-species, Bengal tiger, the quantity of ROH supposed to be more than that of the different sub-species comparison, which may not be an appropriate method used to evaluate the assembly quality. 3. I have some advice about the evolutionary divergence calibrations. Using some other species which have closer phylogenetic relationship might be better, according to their shared similar substitution rate and generation time, for instance, other species of Panthera . 4. The format of references part need to be rechecked.

    1. Japanese eels

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac120), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer 1:Christiaan Henkel This paper describes a new chromosome-level assembly of the Japanese eel, which could finally supersede the various more fragmented assemblies. The assembly process is perhaps overly complex (many data sources and assembly steps, suppl. figure 3), but the result in general appears to be of high quality, as demonstrated by BUSCO (twice) and alignment to a closely related genome (Anguilla anguilla, suppl. figure 4). Figures 1 and 2, however, contain some inconsistencies:

      Figure 1: track B (nanopore coverage) shows a clear bimodal signal, with large blocks of high (double) coverage. These appear possibly correlated with areas low in gene content (track E). Are these possibly collapsed duplicate regions? That would have a strong effect on the analyses of genome duplication. Do other somewhat comparable data sources, for example PacBio CLR, show this feature?

      Figure 2, right panel: the new A. japonica assembly appears to have many unclustered genes (brown), similar to the fragmented draft assembly of A. rostrata and unlike the other included chromosome-level assemblies. This appears to be related to the annotation process? Or are there other problems that preclude orthology assignment for these genes? And how does A. rostrata get its gain of 11756 genes in this analysis? (By the way, line 323 has genus Anguilla as +919/-531, the figure +919/-631).

      Some other questions and comments I would like the authors to address:

      The discussion of previous and current eel sequencing efforts in the Introduction is not complete. For example, I miss the assemblies by Kai et al (2014) and Nakamura et al (2017) of the Japanese eel genome. In addition, the Introduction and Discussion (lines 415-417) present the current assembly as the first chromosome-scale Anguilla genome, which is not the case. At least two high-quality assemblies of Anguilla anguilla (European eel) are available, and should be acknowledged: one is by the Vertebrate Genome Project, and this assembly is even used in the manuscript for comparative purposes (line 199). The other has been described in a preprint (Parey et al 2022). Some of the mentioned papers include similar analyses (mostly on evolution after genome duplication and ancestral genome reconstruction, see figure 5).

      Kai et al (2014) A ddRAD-based genetic map and its integration with the genome assembly of Japanese eel (Anguilla japonica) provides insights into genome evolution after the teleost-specific genome duplication. BMC Genomics 15, 233. https://doi.org/10.1186/1471-2164-15-233 Nakamura et al (2017) Rhodopsin gene copies in Japanese eel originated in a teleost-specific genome duplication. Zoological Lett 3, 18. https://doi.org/10.1186/s40851-017-0079-2 Parey et al. (2022) Genome structures resolve the early diversification of teleost fishes. BioRxiv https://doi.org/10.1101/2022.04.07.487469 The different statistics listed for each alternative assembly in the Introduction make comparisons difficult.

      The statement in line 79, that eels as the most basal teleost group are 'close' to non-teleosts, is incorrect. They are just as close to non-teleosts as any other teleost. (The rest of the sentence, up to line 82, could use rephrasing).

      The statement in line 307 that 'Japanese eels are phylogenetically closer to American than European eels' contradicts the phylogeny presented (fig. 2), or is this based on some additional analysis (a density plot not shown), or even on figure 2 right panel (see comment earlier)? Even if they are incrementally 'closer' by some metric, I would not interpret this a phylogenetic distance, given the inferred divergence dates. In any case, the American eel assembly is still highly fragmented, and not the best basis for inferences which otherwise rely on chromosome-scale assemblies.

      Similarly, the statements on divergence between teleost groups in lines 495-500 need rephrasing. Anguilla species did not diverge from Megalops etc.

      Figure 2 & lines 205-213/310-313: These divergence times are calibrated using a few intervals taken from TimeTree.org (red dots). I wonder how reliable this is, as I get quite different intervals when checking now: for Anguilla-Megalops it is 162.2-197.3 (the paper has 179.3-219.3). Also TimeTree appears to have arowana (Scleropages) as the most basal branch among the teleosts, the paper has a combined Osteoglossomorpha(arowana)/Elopomorpha(eels) branch. Has the phylogenetic tree topology been inferred or imposed? Why have the specific calibration points been chosen? The early branching among teleosts (see line 310-312) is somewhat controversial, see the preprint by Parey et al.

      Line 346-348: This uses the eel genome size (~1 Gbp) and the further (4R) duplicated salmon genome (3 Gbp) to argue against a such further genome duplication in eels. Although I agree that the eel 4R probably did not occur, comparing genome sizes presents no evidence in this case. Genome size changes by other processes as well, and more dramatically (e.g. transposon proliferation). In addition, salmon and eel are not closely related, at all. Compare this to the genomes of the (much more closely related) common carp and zebrafish, both ~1.5 Gbp: the carp genome, but not zebrafish, has experienced an additional duplication, but the zebrafish genome contains a higher transposon density.

      The second argument against 4R (lines 352-356, figure 4b) also does not really work. With 8 Hox clusters, the eel genome appears duplicated with respect to the gar (4 clusters), and not quadruplicated. However, with 8 clusters and 70+ genes, eels actually have more than all established 3R teleost genomes (max. 7 clusters, 42-50 genes). So the question is then whether these 8 clusters form nice 3R WGD ohnolog pairs, or if some clusters have been lost (as in nearly all other teleosts) and re-duplicated. The former hypothesis is consistent with the high level of retained WGD genes (line 369), the latter with the inferred high level of local duplication (line 363). The observation of duplicate eel Hox clusters goes back to the initial European eel genome assembly (Henkel et al 2012), but there the draft status precluded confident assignment to 3R for some clusters.

      The eel olfactory receptors have previously been identified using an assembled transcriptome (Churcher et al. 2015, not cited). How do the analyses of line 214-229/324-333/420-434/figure 3 compare?

      Churcher et al (2015) Deep sequencing of the olfactory epithelium reveals specific chemosensory receptors are expressed at sexual maturity in the European eel Anguilla anguilla. Molecular Ecology 24, 822-834. https://doi.org/10.1111/mec.13065 Lines 460-467 state eels have retained duplicates of immune genes, which have been under positive selec tion. So how does this translate to a (very recent) negative effect on eel fitness (line 460-462)?

      The discussion of line 482-502 on chromosome numbers invokes ecological explanations (freshwater vs. marine habitats, 482-489), but subsequently does not translate this to the low Anguilla chromosome numbers. As these ecological factors are highly applicable to Anguillidae, this connection should be explored here - including their evolutionary history (e.g. Inoue et al, 2010, Deep-ocean origin of the freshwater eels. Biology Letters 6, https://doi.org/10.1098/rsbl.2009.0989)

      In this discussion: how do the numbers of line 482/3 (modal 2n 54/48 chromosomes in fish) correspond to those of line 492 (peak chromosome number n = 24/25 in extant teleosts)?

      The supplementary figures/tables lack legends (just mentions in the main text).

      Line 109: which ONT flowcell, kit, and basecaller versions have been used? In the M&M, please list software versions.

      Reviewer 2: Zhong Li This manuscript by WANG et al. titled "A Chromosome-level Assembly of the Japanese Eel Genome, Insights into Gene Duplication and Chromosomal Reorganization " provides a high quality genome assembly of Japanese Eel, and economically important fish. The authors have used for kinds of sequencing technologies, and assembling strategies, and provided well annotated genomes. This genome provides useful information for the genome organization and evolution and other fields of this species.

      Overall, the manuscript is sufficiently descriptive and easy to follow. I have three major concerns:

      The genome annotation rely on the transcriptome. No detailed information was given the the method section. The analyses do not include command lines or software versions and thus are not repeatable easily. A document that include these information is higly recommended included as a supplementary file. The genome assembly seems has not been released on NCBI database (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA852364). Besides, the gene models (nucleotide, protein, and GFF files) should also be made available and included in the Data Availability section when the manuscript is accepted.

    1. Background

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Dominik Heider

      The paper is well written, and the objectives are clear. The study is a very nice application of CGR in bioinformatics and shows the excellent performance of CGR-encoded data in combination with deep learning. I have a few things that should be addressed in a minor revision:

      1) Some very important studies have not been addressed in the related work part, e.g., in Touati et al. (pubmed:32645523) and Sengupta et al. (pubmed:32953249), the authors compared SARS-CoV2 with other coronaviruses based on CGR, or we (pubmed:34613360) used CGR in combination with deep learning for resistance predictions in E. coli.

      2) To me, it is unclear how accuracy was used in the model. Is it one class (i.e., clade) versus all others? If yes, accuracy might be misleading because of the high class imbalance. In such high class imbalances, MCC has been shown to be more suitable.

      3) "The undersampled dataset was randomly split into train...". Why did you undersample? To balance the data, which would make sense to use accuracy as a metric but discard a lot of valuable data. What about oversampling?

      4) Comparison with other tools: I wonder whether the good performance of your model is the result of deep learning or the CGR encoding. Please also provide the results for another ML model (besides SVM, e.g., random forests) to compare to, e.g., Covidex.

      Reviewer 2: Riccardo Rizzo

      The authors propose a classification experiment based on Frequency Chaos Game Representation and deep learning. They used the outstanding performances of a ResNet network as an image classification tool and the FCGR method that represent a genome sequence as an image.

      The work seems good, although some major points should be clarified.

      First, whether the performance index values came from a 5-fold validation procedure (5 because they said the split was 80-10-10) or a one-shot experiment is unclear.

      Second, the part that involves the frequent k-mers and the SVM should be better explained. The authors should clarify what the meaning of this comparison is.

      Another point to clarify is the quality of the sequences used; the authors worked on complete sequences, but, as far as I know, in the real world virus sequences are noisy data, and authors should discuss this point.

      Minor points:

      • Authors said that a sequence is a string $s \in {A, C, G, T, N}^*$, so they should explain the procedure used in Definition 2, where only 4 symbols seem to be used. If they discard the N, or consider 4 k-mers (consider that N means "any symbol") they should say it clearly.
      • Figure 1 and 2 report two different quantities but say the same thing; maybe one of them can be omitted.
      • Authors should add some details about the training time of the network.

      A final suggestion: probably it will be interesting to use the same deep network with transfer learning (the whole network or just the first sections) to evaluate the gain with ad-hoc training and the different training time.

    1. The

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac076) and has published the reviews under the same license.

      Reviewer 1 Satoshi Hiraoka

      In this manuscript, the authors developed a new tool, DeePVP, for predicting Phage Virion Proteins (PVPs) using the Deep learning approach. The purpose of this study is meaningful. As the authors described in the Introduction section, currently it is difficult to annotate functions of viral genes precisely because of its huge sequence diversity and existence of many unknown functions, and there are still many rooms to improve the performance of in silico annotation of phage genes including PVPs. Although I'm not an expert in machine learning, the newly proposed method based on Deep learning seems to be appropriate. The proposed tool showed clear outperformance compared with the other previously proposed tools, and thus, the tool might be valuable for further deep analysis of many viral genomes. Indeed, the authors conducted two case studies using real phage genomes and reported novel findings that may have insight into the genomics of the phages. Overall, the manuscript is well written, and I feel the tool has a good potential to contribute to the wide fields of viral genomics. Unfortunately, I have concerns including the source cord openness. Also, I have some suggestions that would increase the clarity and impact of this manuscript if addressed.

      Major: I did not find DeePVP source cord on the GitHub page. Is the tool not open source? I strongly recommend the author disclose all scripts of the tool for further validation and secondary usage by other scientists. Or, at least, clearly state why the source cords need to hold private. Also, I was much confused about the GitHub page because the uploaded files are not well structured. Scripts and data used for performance evaluation were included in 'data.zip' file, which should be renamed to be an appropriate one. 'Source code' button in the Releases page strangely links to the 'Supporting_data.zip' files which only contained installing manual but not source cord file. The authors should prepare the GitHub page appropriately that, for example, upload all source cords to the 'main' branch rather than include them in zip file, and 'source code' file in Releases should contain actual source code files rather than manual PDF. According to the Material and method section, 1) using the Deep learning approach, and 2) using th large dataset retrieved from PhANNs as teacher dataset, are two of the important improvement from the other studies in the PVP identification task. Someone may suspect the better performance of DeePVP was mostly contributed by the increased teaching dataset rather than the used classification method. Is there a possibility that the previously proposed tools (especially the tools except for PhANNs) with re-training using the large PhANNs dataset could reach better performances than DeePVP? The naming of 'Reliability index' (L249) is inaccurate. The score did not support the prediction 'reliability' (i.e., whether the predicted genes are truly PVP or not) but just reflects the fact that the gene is predicted as PVP by many tools without considering whether it is correct or incorrect. The sentence 'A higher n indicates that this protein is predicted as PVP by more tools at the same time, and therefore, the prediction may be more reliable.' in L252 is not logical. I dose not fully agree with the discussion that the tool will facilitate viral host prediction as mentioned in L294-302. It is very natural that if the phages are phylogenetically close and possess similar genomic structures including PVP-enriched regions, those will infect the same microbial lineage as a host. However, this is not evaluated systematically in wide phage lineages. In general, almost all phage-host relations are unknown in nature except few numbers of specific viruses such as E. Coli phages. Further detailed studies should be needed on whether and how degree the conservation of PVP-enriched region could be a potentially good feature to predict phage-host relationship. I think the phage-host prediction is beyond the scope of this tool, and thus the analysis could be deleted in this manuscript or just briefly mention in the Discussion section as a future perspective.

      Minor: The URL of the GitHub page is better to describe in the last of the Abstract or inside of the main text in addition to the 'Availability of supporting source code and requirements' section. This will make it easy for many readers to access the homepage and use the tool. Fig 2 and 3. I think it is better to change the labels of the x-axis like 0 kb, 20 kb, 40 kb, ..., and 180 kb. This will make it easy for understanding that the horizontal bar represented the viral genome.

      Re-review:

      I read the revised manuscript and acknowledge that the authors made efforts to take reviewers' comments into account. My previous points have been addressed and I feel the manuscript was improved. I think the word 'incomplete proteins' in L391-396 would be rephrased like 'partial genes' because here we should consider protein-encoding genes (or protein sequences), not proteins themselves, and the word 'incomplete' is a bit ambiguous.

    2. ABSTRACT

      Reviewer 2. Deyvid Amgarten

      The manuscript presents DeePVP, a new tool for PVP annotation of a phage genome. The tool implements two separate modules: The main module aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the ten major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs, a known tool in the area. Overall, the manuscript is well written, clear, and I could not identify any serious methodological inconsistence. I was not sure whether to consider the performance metrics shown as significant improvements or not, since PhANNs already does a similar job on that regard. And it is better for some types of PVPs for example. But I would rather give this task to readers and other researchers in the area. Specifically, I enjoyed the discussion about how one-hot encoded features may be more suitable for predictions that k-mers based ones. And by consequence, that convolution networks may present an advantage against simple multilayer perceptron networks. This manuscript brings an important contribution to the phage genomics and machine learning fields. I am certain that DeePVP will be helpful to many researchers. I have a major question about the composition of the dataset used to train the main module: Among the PVP proteins, do authors know if only the ten types of PVP are present? There is a rapid mention to key words used to assemble the PhANNs dataset in the discussion (line 340), but that is not clear to me. This will help me understand the following: Line 124: The CNN in the extended module has an output softmax layer, which outputs likelihood scores for 10 types of virion proteins. I wonder if only proteins from these 10 types were included in the datasets used to train the CNNs. I mean, is it possible that a different type of virion protein is predicted by the main module as PVP? And if so, how would the extended module predict this protein since it is PVP but none of the ten types? Minors: Line 121: By default, a protein with a PVP score higher than 0.5 is regarded as a PVP. How was this cutoff chosen? Was this part of the k-cross validation process? Line 157 and other pieces in the manuscript: I would suggest authors not to use sentences like "F1-score is 9.05% much higher than that of PhANNs" for obvious reasons that 9% may not seem such a great difference for using the "much" adverb. Same thing to "much better" and variations. About the comparisons between DeePVP and PhANNs: Did authors make sure that instances of the test set were not used to train the PhANNs model being used? Line 221: What authors mean by "more authentic prediction"? Looking at the github repository, I found rather unusual that authors chose to upload only a PDF with instructions of how to use and install. It is very detailed, I appreciate. The virtual machine and docke containers are also nice resources to help less experienced users. However, I noticed that the github repository has no clear mention to the source code of the tool. I found it by a mention in the Availability of supporting data, where authors created a release with the datasets and the scripts. Again, very unusual, but I suppose authors have chosen this approach due to github limitations to large files. Table 2: I would like to ask authors what might me the reason for such low performance metrics to some types of PVP (for example, minor capsid)? Figure 5 states: "Host genus composition of the subject sequences". But there is a "Myoviridae" category, which is a family of phages. Not anything related to bacterial hosts. Please, verify why this is in the figure.

      Re-review:

      Thank you for authors' responses. Most of my concern were addresses. I have to say, though, that the github page is not quite in the standards for a bioinformatics tools yet. I appreciate the source code upload, but I noticed that not a single line of #comments were present in the code I have checked. README file is also not very clarifying. I do not consider this as an impediment for publication (since there are detailed info in GigaScience DB), but perhaps this may hind usage of authors' tool. Most users will only look at the github repository. I suggest some improvements in case authors judge my comment makes some sense. Bellow I list three examples just to give authors an idea:

      https://github.com/fenderglass/Flye https://github.com/LaboratorioBioinformatica/MARVEL https://github.com/vrmarcelino/CCMetagen

      One last concern was about authors' response to the Myoviridae mistake in figure 5. Authors stated that the genus of a phage host is in its name (as for example Escherichia phage XX). But this is a dangerous assumption, since many phage names are outside of this rule. For example, there are many phages with Enterobacteria phage XXX (for instance NC_054905.1 ), meaning that they infect some Enterobacteria. Again, enterobacteria is not a genus. Phage nomenclature may be a mess sometimes, be careful.

    1. Studies

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac068) and has published the reviews under the same license.

      Reviewer 1 Tomas Sigvard Klingström,

      As a researcher who may occasionally use long read sequencing technique for projects it is immensely helpful to get an insight into the experience accumulated through work related to the Vertebrate Genomes Project (VGP). My personal research interest on the subject is more on understanding why and how DNA fragment during DNA extraction. Due to my work in that area I have one key question regarding the interpretation of the data presented in figure 2 and then a number of suggestions for minor edits. The answer on how to interpret figure 2 may require some minor edits but the article is regardless of this a welcome addition to what we know about good practices for DNA extraction generating ultra high molecular weight DNA. It should also be noted that the DOI link to Data Dryad seems broken and I have therefore not look at the supplementary material. In figure 2 the size distribution of DNA fragments is visualized from the different experiments. Most of the fragment distributions look like I would have expected them based on the work we did in the article cited as nr 25 in the reference list. However the muscle tissue from rats and the blood samples from the mouse and the frog indicates that there may be a misinterpretation in the article regarding the actual size distribution of fragments which needs to be looked in to. Starting with the mouse plots and especially the muscle one. There must either have been a physical shearing event that drastically reduced the size of DNA (using the terminology from ref 25 this would mean that physical shearing generated a characteristic fragment length of approximately 300-400 kb), or the lack of a sharp slope on the rightmost side of the ridgeline plot is due to the way the image was processed. All other animals got a peak on the rightmost side of the ridgeline plot and the agarose plug should, based on the referenced methods paper [7], generate megabase sized fragments which far exceed the size of the scale used in figure 2. I would presume these larger fragments would get stuck in or near the well which makes it easy to accidentally cut them out when doing the image analysis step which may explain their absence in the mouse samples. This leads me to the conclusion that the article is well designed to capture the impact of chemical shearing caused by different preservation methods but would benefit from evaluating whatever figure 2 properly covers the actual size distribution of fragments or only covers the portion of DNA fragments small enough to actually form bands on the PFGE gel with a substantial part of the DNA stuck in or near the well. The frog plot is a good example of how this may influence our interpretation of the ridgeline plots. If the extraction method generate high-quality DNA concentrated in the 300-400 kb range then there must be something very special with the frog DNA from blood as there is a continuous increase in the brightness all the way to the edge of the image. This implies that the sample contains a high amount of much larger DNA fragments than the other samples. I find this rather unlikely and if I saw this in my own data I would assume that we had a lot of very large DNA fragments that are out of scale for the gel electrophoresis but that in the case for the frog blood samples many of these fragments have been chemically sheared creating the "smeared" pattern we see in figure 2.

      Minor edits and comments: Dryad DOI doesn't work for me. Figure 1 - The meaning of x3 and x2 for the turtle should be described in the caption. Figure 2 - Having the scale indicator (48.5. 145.5 etc) at the top as well as the bottom of each column would make it quicker to estimate the distribution of samples. The article completely omits Nanopore sequencing, is there a specific reason for why lessons here are not applicable to ONT? There is a very interesting paragraph starting with "The ambient temperature of the intended collecting locality should be a major consideration in planning field collections for high-quality samples. Here we test a limited number of samples at 37°C to". Even if the results were very poor information about the failed conditions would be appreciated. What tissues/animals did you use, did you do any preservation at all for the samples and did you measure the fragment length distribution anyway? Simply put, even if the DNA was useless for long read sequencing it is an interesting data point for the dynamics of DNA degradation and a valuable lesson for planning sampling in warm climates.

      Re-review:

      All questions and commends made in my first review are now resolved. I understand the thought process behind the first cropping of figure 2 but appreciate the 2nd version as it makes it easier for researchers with a limited understanding of the experiment to interpret the data.

    2. Abstract

      Reviewer 2. Elena Hilario

      I am glad to have been selected as reviewer for the manuscript "Benchmarking ultra-high molecular weight DNA preservation methods for long-read and long-range sequencing" by Dahn and colleagues. The manuscript reports a detailed guide on the effect of preservation methods on the quality of the DNA extracted from a wide range of animal tissues. Although the work is only focused on vertebrates, it is a great foundation to conduct similar studies on plants, invertebrates and fungi, for example. Although the effectiveness of the tissue/preservative combination was only tested with the preparation of long range libraries, it would have been useful to select one or two cases for long range sequencing (PacBio or Oxford Nanopore) to explore the impact of the different QC parameters measured in this study.

      Minor comments and corrections are included in the file uploaded

    1. Background

      This work has been published in GigaScience Journal under a CC-BY 4.0 license https://doi.org/10.1093/gigascience/giac075) and has published the reviews under the same license.

      Reviewer 1. Nikos Karaiskos

      Reviewer Comments to Author: In this article the authors developed Stardust, a computational method that can be used for spatially-informed clustering by combining transcriptional profiles and spatial information. As spatial sequencing technologies gain popularity, it is important to develop tools that can efficiently process and analyse such datasets. Stardust is a new method that goes in this direction. It is particularly appealing to make use of the spatial information and relationships to cluster gene expression in these datasets. Overall the quality of data used is high and the manuscript is clearly written. The algorithm behind Stardust is simple and consists of an interpolation between spatial and transcriptional distance matrices. A single parameter called space weight controls the contribution of the spatial distance matrix. The authors benchmark Stardust against other recently developed tools in five different spatial transcriptomics datasets by using two measures. Stardust therefore holds the potential of being a useful method that can be applied in different datasets.

      Before recommending the manuscript for publication, however, the authors should thoroughly address the following points: 1. What is the rationale behind modelling the contributions as a linear sum of the spatial and transcriptional distance matrices? In particular, why did the authors not consider non-linear relationships as well? As cells neighboring in space often share similar transcriptional profiles (see for instance Nitzan et al., 2019 for this line of reasoning and several examples therein), I would expect product terms to be even more informative. 2. The authors demonstrate Stardust's performance only on datasets obtained with the 10X Visium platform. How does Stardust perform on higher-resolution methods, such as Slide-Seq, Seq-scope etc? As ST methods will improve in resolution in the future, it is critical to be able to analyze such datasets as well. An important question here concerns scalability: how well does Stardust scale with the number of cells/spots? 3. In Fig. 1b conclusions are driven based on the CSS for different space weights, but only for a clustering parameter=0.8. What happens for other clustering values? And can the authors comment on why the different space weight values do not perform consistently across the datasets (i.e. 0.5 is better for HBC2 but 0.75 for MK)? 4. The authors compared Stardust with four other tools. The conclusion is that Stardust outperforms all other methods --and performs equivalently with BayesSpace. All of these methods, however, rely on choosing specific values for a number of parameters. Did the authors optimize these values when they benchmarked these methods against Stardust? 5. I was able to successfully install Stardust and run it. The resulting clusters in the Seurat object, however, were all NAs. The authors should make an effort to better document how Stardust runs, including the input structure that the tool expects and potential issues that might arise.

      Re-review: The authors have successfully addressed all raised points. The introduction of Stardust*, in particular, is a valuable enhancement of the method. Therefore, I recommend the manuscript for publication.

    2. Spatial

      Reviewer 2. Quan Nguyen

      Reviewer Comments to Author: This work presents a new clustering method, Stardust, that has the potential to improve stability of clustering results against parameter changing. Stardust can assess the contribution to the clustering result by spatial information relative to gene expression information. Stardust appears to performs better than other methods in the two metrics used in this paper, stability and coefficient of variation. The essence of the method is the use of a spatial transcriptomics (ST) distance matrix as a simple linear combination of physical distance (S) and transcriptional distance (T) matrices. A weight factor is used for the S matrix to control and evaluate the contribution of the spatial information. The effort for evaluating multiple parameters and comparing with several latest methods and across a number of public spatial datasets is a highlight of the work. The authors also made the code available.

      Major comments: - The concept of combining spatial location and gene expression is not new and has been applied in most spatial clustering methods. It is not clear what are the new additions to current available methods, except for a feature to weigh the contribution of spatial components to clustering results. - The approach to assess the contribution of spatial information, by varying the weight factor from 0 to 1 is rather simple, because the contribution can be nonlinear and vary between spots/cells (e.g. spatial distance becomes more important for spots/cells that are nearer to each other; some genes are more spatially variable than the others; applying one weight factors for all genes and all spots would miss these variation sources) - The 5 weight factors 0, 0.25, 0.50, 0.75, and 1 were used. However, this range of parameters provided too few data points to assess the impact of spatial factor. As seen in figures, the 5 data points do not strongly suggest a point where the spatial contribution is maximum/minimum due to large fluctuation of values in the y-axis. - Although two performance metrics are used (stability and variation), there needs to be an additional metric about how the clustering results represent biological ground truth cell type composition or tissue architecture (for example, by comparing to pathological annotation). Consequently, it is unclear if the stardust results are closer to the biological ground truth or not. - Stardust was tested on multiple 10x Visium datasets, but different types of spatial transcriptomics data like seqFISH, Slideseq, MERFISH, ect. are also common. Extended assessment of potential applications to other technologies would be useful. Minor comments: - The paragraphs and figure legends in the Result section are repetitive. - The result section is descriptive and there is no Discussion section.

      Re-review:

      The authors have improved the initial manuscript markedly. There are a couple of important points regarding comparisons between Stardust and Stardust that need to be addressed: 1) In which cases Stardust improves over Stardust? It seems the results would be dependent on different biological systems (i.e., tissue types). The authors suggest both versions produce comparable results, but given the major change in the formula (replacing a constant weight with variable weights as normalised gene expression values to [0,1] minmax scale), there are likely differences between Stardust and Stardust. For example, certain genes will have higher weight than the others, therefore making the effects of the weights variable among genes. For this example, the authors may assess highly abundant genes vs low abundant genes 2) In cases where spatial distances are important, Stardust could be less accurate than Stardust version with a high space weight. How Stardust* considers cases that spatial distance is as important as gene expression.

    1. Survival

      Reviewer 2. Animesh Acharjee

      SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data.

      Authors compared many survival analysis methods and created a benchmarking framework called as SurvBenchmark. This is one of the extensive study using survival analysis and will be useful for translational community. I have few suggestions to improve the quality of the manuscript.

      1. Figure 1: LASSO, EN and Ridge are regularization methods. So, I would suggest including a new classification category as "regularization" or "penalization methods" and take out those from non-parametric models. Obviously this also need to be included accordingly in the methodology section and discussions
      2. Data sets: please provide a table with six clinical and ten omics data sets with number of samples, features and reference link.
      3. Discussion section: How the choice of the method need to be chosen? What criteria need to be used? I understand one does not fit all but some sort of clear guidance will be very useful. Also sample size related aspects need to be more discussed. In the omics research number of samples are really limited and deep learning based survival analysis is not feasible as authored mentioned in the line number 328-331. So, question come, when we should used deep learning based methods and when we should not.

      Reviewer 3. Xiangqian Guo Accept

    2. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac071) and has published the reviews under the same license.

      Reviewer 1. Moritz Herrmann

      First review: Summary:

      The authors conducted a benchmark study of survival prediction methods. The design of the study is reasonable in principle. The authors base their study on a comprehensive set of methods and performance evaluation criteria. In addition to standard statistical methods such as the CoxPH model and its variants, several machine learning methods including deep learning methods were used. In particular, the intention to conduct a benchmark study based on a large, diverse set of datasets is welcome. There is indeed a need for general, large-scale survival prediction benchmark studies. However, I have serious concerns about the quality of the study, and there are several points that need clarification and/or improvement.

      Major issues:

      1. The method comparison does not seem fair As far as I can tell from the description of the methods, the method comparison is not fair and/or not informative. In particular, given the information provided in Supp-Table-3 and the code provided in the Github repository, hyperparameter tuning has not been conducted for some methods. For example, Supp-Table-3 indicates that the parameters 'stepnumber' and 'penaltynumber' of the CoxBoost method are set to 10 and 100, respectively. Similarly, only two versions of RSF with fixed ntree (100 and 1000) and mtry (10, 20) values are used. Also, the deep learning methods appear not to be extensively tuned. On the other hand, telling form the code, methods such as the Cox model variants (implemented via glmnet) and MTLR have been tuned at least a little. Please clearly explain in detail, how the hyperparameters have been specified respectively how hyperparameter tuning has been conducted for the different methods? If, in fact, not all methods have been tuned, this is a serious issue and the experiments need to be rerun under a sound and fair tuning regime.

      2. Description of the study design Related to the first point, the description of the study design needs to be improved in general as it does not allow to assess the conducted experiments in detail. A few examples, which require clarification:

      3. as already mentioned, the method configurations and implementations are not described sufficiently. It is unclear how exactly the hyperparameter settings have been obtained, how tuning as been applied and why only for some methods

      4. concerning the methods Cox(GA), MTLR(GA), COXBOOST(GA), MTLR(DE), COXBOOST(DE): have the feature selection approaches been applied on the complete datasets or only on the training sets
      5. Supp-Table-3 lists two implementations of the Lasso, Ridge and Elastic Net Cox methods (via penalized and glmnet); yet, Figure 2 in the main manuscript only lists one version. Which implementations have been used and are reported in Figure 2?
      6. l. 221: it is stated that "the raw Brier score" has been calculated. At which time point(s) and why at this/these time point(s)?
      7. Supp-Table-2: it is stated that "some methods are not fully successful for all datasets", but only DNNSurv is further examined. Is it just DNNSurv or are there other methods that have failed in some iterations? Moreover, what has been done about the failing iterations? Have the missing values be imputed? Are the failing iterations ignored?

      I recommend that section 3 be comprehensively revised and expanded, in particular including the methods implementations, how hyperparamters are obtained/tuning has been conducted, aggregation of performance results, handling of failing iterations. Moreover, I suggest to provide summary tables of the methods and datasets in the main manuscript and not in the supplement.

      1. Reliability of the presented results In other studies [BRSB20, SCS+20, HPH+20] differences in (mean) model prediction performance have been reported to be small (while variation over datasets can be large). This can also be seen in Figure 3 of the main manuscript. Please include more analyses on the variability of prediction performances and also include a comparison to a baseline method such as the Kaplan-Meier estimate. Most importantly, if some methods have been tuned while others have not, the reported results are not reliable. For example, the untuned methods are likely to be ill-specified for the given datasets and thus may yield sub-optimal prediction performances. Moreover, if internal hyperparameter tuning is conducted for some methods, for example via cv.glmnet for the Cox model variants, and not for others, the computation times are also not comparable.

      2. Clarity of language, structure and scope I believe that the quality of the written English is not up to the standard of a scientific publication and consider language editing necessary (yet, it has to be taken into account that I am not a native speaker). Unlike related studies [BWSR21, SCS+20, e.g.], the paper lacks clarity and/or coherence. Although clarity and coherence can be improved with language editing, there are also imprecise descriptions in section 2 that may additionally require editing from a technical perspective. For example:

      3. l. 76 - 78: The way censoring is described is not coherent, e.g.: "the class label '0' (referring to a 'no-event') does not mean an event class labelled as '0'". Furthermore, it is not true that "the event-outcome is 'unknown'". The event is known, but the exact event time is not observed for censored observations.

      4. The authors aim to provide a comprehensive benchmarking study of survival analysis methods. However, they do not, for example, provide significance tests for performance differences nor critical differences plots (it should be noted that the number of datasets included may not provide enough power to do so). This is in stark contrast to the work of Sonabend [Son21].

      I suggest revising section 2 using more precise terminology and clearly describing the scope of the study, e.g., what type of censoring is being studied, whether time-dependent variable and effects are of interest, etc. I think this is very important, especially since the authors aim to provide "practical guidelines for translational scientists and clinicians" (l. 32) who may not be familiar with the specifics of survival analysis.

      Minor issues

      • l. 43: Include references for specific examples
      • l. 60: The cited reference probably is not correct
      • l. 266: "MTLR-based approaches perform significantly better". Was a statistical test performed to determine significant differences in performance? If yes, indicate which test was performed. If not, do not use the term "significant" as this may be misunderstood as statistical significance.
      • Briefly explain what the difference is between data sets GE1 to GE6.
      • It has been shown that omics data alone may not be very useful [VDBSB19]. Please explain why only omics variables are used for the respective datasets.
      • Figure 1: Consider changing the caption to 'An overview of survival methods used in this study' as there are survival methods that are not covered. Moreover, consider referencing Wang et al [WLR19] as Figure 1a resembles Figure 3 presented therein.
      • Figure 2: Please add more meaningful legends (e.g., title of legend; change numbers to Yes, No, etc.).
      • Figure 2 a & b: What do the dendrograms relate to?
      • Figure 2 d: The c-index is not a proper scoring rule [BKG19] (and only measures discrimination), better use the integrated Brier score (at best, at different evaluation time points) as it is a proper scoring rule and measures discrimination as well as calibration.
      • Figure 3: At which time point is the Brier score evaluated and why at that time point? Consider using the integrated Brier score instead.
      • This is rather subjective, but I find the use of the term "framework", especially that the study contributes by "the development of a benchmarking framework" (l. 60), irritating. For example, a general machine learning framework for survival analysis was developed by Bender et al. [BRSB20], while general computational benchmarking frameworks in R are provided, e.g., by mlr3 [LBR+19] or tidymodels [KW20]. The present study conducts a benchmark experiment with specific design choices, but in my opinion it does not develop a new benchmarking framework. Thus, I would suggest not using the term "framework" but better "benchmark design" or "study design".
      • In addition, the authors speak of a "customizable weighting framework" (l. 241), but never revisit this weighting scheme in relation to the results and/or provide practical guidance for it. Please explain w.r.t. the results how this scheme can and should be applied in practice.

      The references need to be revised. A few examples: - l. 355 & 358: This seems to be the same reference. - l. 384: Title missing - l. 394: Year missing - l. 409: Year missing - l. 438: BioRxiv identifier missing - l. 441: ArXiv identifier missing - l. 445: Journal & Year missing

      Typos: - l. 66: . This - l. 89: missing comma after the formula - l. 93: missing whitespace - l. 107: therefore, (no comma) - l. 121: where for each, (no comma) - l. 170: examineS - l. 174: therefore, (no comma) - l. 195: as part of A multi-omics study; whitespace on wrong position; the sentence does not appear correct - l. 323: comes WITH a

      Data and code availability

      Data and code availability is acceptable. Yet, the ANZDATA and UNOS_kidney data are not freely available and require approval and/or request. Moreover, for better reproducibility and accessibility, the experiments could be implemented with a general purpose benchmarking framework like mlr3 or tidymodels.

      References

      [BKG19] Paul Blanche, Michael W Kattan, and Thomas A Gerds. The c-index is not proper for the evaluation of-year predicted risks. Biostatistics, 20(2):347-357, 2019. [BRSB20] Andreas Bender, David Rügamer, Fabian Scheipl, and Bernd Bischl. A general machine learning framework for survival analysis.arXiv preprint arXiv:2006.15442, 2020. [BWSR21] Andrea Bommert, Thomas Welchowski, Matthias Schmid, and Jörg Rahnenführer. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings in Bioinformatics, 2021. bbab354. [HPH+20] Moritz Herrmann, Philipp Probst, Roman Hornung, Vindi Jurinovic, and Anne-Laure Boulesteix. Large-scale benchmark study of survival prediction methods using multi-omics data. Briefings in Bioinformatics, 22(3), 2020. bbaa167. [KW20] M Kuhn and H Wickham. Tidymodels: Easily install and load the 'tidymodels' packages. R package version 0.1.0, 2020. [LBR+19] Michel Lang, Martin Binder, Jakob Richter, et al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software, 4(44):1903, 2019. [SCS+20] Annette Spooner, Emily Chen, Arcot Sowmya, Perminder Sachdev, Nicole A Kochan, Julian Trollor, and Henry Brodaty. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific reports,10(1):1-10, 2020. [Son21] Raphael Edward Benjamin Sonabend. A theoretical and methodological framework for machine learning in survival analysis: Enabling transparent and accessible predictive modelling on right-censored time-to-event data. PhD thesis, UCL (University College London), 2021. [VDBSB19] Alexander Volkmann, Riccardo De Bin, Willi Sauerbrei, and Anne-Laure Boulesteix. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC medical research methodology, 19(1):1-15, 2019. [WLR19] Ping Wang, Yan Li, and Chandan K Reddy. Machine learning for survival analysis: Asurvey. ACM Computing Surveys (CSUR), 51(6):1-36, 2019.

      Re-review:

      Many thanks for the very careful revision of the manuscript. Most of my concerns have been thoroughly addressed. I have only a few remarks left.

      Regarding 1. Fair comparison and parameter selection The altered study design appears much better suited to this end. Thank you very much for the effort, in particular the additional results regarding the two tuning approaches. Although I think a single simple tuning regime would be feasible here, using the default settings is reasonable and very well justified. I agree that this is much closer to what is likely to take place in practice. However, it should be more clearly emphasized that better performance may be achievable if tuning is performed.

      Regarding 2. Description Thanks, all concerns properly addressed. No more comments.

      Regarding 3. Reliability I am aware that Figure 2c provides information to this end. I think additional boxplots which aggregate the methods' performance (e.g. for unoc and bs) over all runs and datasets would provide valuable additional information. For example, from Figure 2c one can tell that MTLR variants obtain overall higher ranks based on mean prediction performance than the deep learning methods. However, it says nothing about how large the differences in mean performance are.

      Kaplan-Meier-Estimate (KM) I'm not quite sure I understood the authors' answer correctly. The KM does not use variable information to produce an estimate of the survival function, and I think that is why it would be interesting to include it. This would shed light on how valuable the variables are in the different data sets.

      Regarding 4. Scope and clarity Thanks, all concerns properly addressed. No more comments.

      Minor points:

      • Since the authors decided to change 'framework' to 'design', note that in Figure 1b it still says 'framework'
      • l.51 & l.54/55 appear to be redundant
      • Figure 2 a and b:
      • Please elaborate more on how similarity (reflected in the dendrograms) is defined?
      • Why is the IBS more similar to Bregg's and GH C-Index than to the Brier Score?
      • Why is the IBS not feasible for so many methods, in particular Lasso_Cox, Rdige_Cox, and CoxBoost?
    1. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac073 and has published the reviews under the same license.

      Reviewer 1. Siyuan Ma

      Reviewer Comments to Author: In Kang, Chong, and Ning, the authors present Meta-Prism 2, a microbial community analysis framework, which calculates sample-sample dissimilarities and queries microbial profiles similar to those of user-provided targets. Meta-Prism 2 adopts efficient algorithms to achieve the time and memory efficiency required for modern microbiome "big data" application scenarios. The authors evaluated Meta-Prism 2's performance, both in terms of separating different biomes' microbial profiles and time/memory usage, on a variety of real-world studies. I find the application target of Meta-Prism appealing: achieving efficient dissimilarity profiling is increasingly relevant for modern microbiome applications. However, I'm afraid the manuscript appears to be in poor state, with insufficient details for crucial methods and results components. Some display items are either missing or mis-referenced. As such, I cannot recommend for its acceptance, unless major improvements are made. My comments are detailed below.

      Major 1. The authors claim that from its previous iteration, the biggest improvements are: (1) removal of redundant nodes in 1-against-N sample comparisons. (2) functionality for similarity matrix calculation (3) exhaustive search among all available samples.

      a. (1) seems the most crucial for the method's improved efficiency. However, the details on why these nodes can be eliminated, and how dissimilarity calculation is achieved post-elimination are not sufficient. The caption for Figure 1C, and relevant Methods texts (lines 173-188) should be expanded, to at least explain i) why it is valid to calculate (dis)similarity postelimination based on aggregation, ii) how aggregation is achieved for the target samples. b. I may not have understood the authors on (2), but this improvement seems trivial? Is it simply that Meta-Prism 2 has a new function to calculate all pair-wise dissimilarities on a collection of microbial profiles? c. For (3), it should be made clearer that Meta-Prism 1 does not do this. I needed to read the authors' previous paper to understand the comment about better flexibility in customized datasets. I assume that this improvement is enabled because Meta-Prism 2 is vastly faster compared to 1? If so, it might be helpful to point this out explicitly.

      1. I am lost on the accuracy evaluation results for predicting different biomes (Figure 2). a. How are biomes predicted for each microbial sample? b. What is the varying classification threshold that generates different sensitivities and specificities? c. Does "cross-validation" refer to e.g. selection of tuning parameters during model training, or for evaluation model performances? d. What are the "Fecal", "Human", and "Combined" biomes for the Feast cohort? Such details were not provided in Shenhav et al.

      Moderate 1. I understand that this was previously published, but could the authors comment on the intuitions behind their dissimilarity measure, and how it compares to similar measures such as the weighted UniFrac? a. Does Meta-Storm and Meta-Prism share the same similarity definition? If so, why would they differ in terms of prediction accuracies? 2. There seems to be some mis-referencing on the panels of Figure 1. a. Panel B was not explained at all in the figure caption. b. Line 185 references Figure 1E, which does not exist.

      Minor 1. The Meta-Prism 1 publication was referenced with duplicates (#16 and 24) 2. There are minor language issues throughout the manuscript, but for they do not affect understanding of the materials. Examples: a. Line 94: analysis -> analyze b. Line 193: We also obtained a dataset that consists of ...

      Re-review:

      I find most of my questions addressed. My only remaining issue is still that the three biomes from FEAST (Fecal, Human, and Mixed) are still not clearly defined. The only definition I could find is line 206-208 "We also obtained a dataset that consists of 10,270 samples belonging to three biomes: Fecal, Human, and Mixed, which have been used in the FEAST study, defined as the FEAST dataset". Are "Fecal" simply stool samples, and "Human" samples biopsies from the human gut? What is "Mixed"? As a main utility of Meta-Prism is source tracking, it is important for the reader to understand what these biomes are, to understand the resolution of the source tracking results. If this can be resolved, I'll be happy to recommend the manuscript's acceptance.

      Reviewer 2. Yoann Dufresne

      In this article the authors present Meta-Prism 2, a software to compute distances between metagenomic samples and also query a specific sample against a pool of samples. They call "sample" a precomputed file with abundance of multiple taxa. In the article they first succinctly present multiple aspects on the underlying algorithms. Then they provide an extensive analysis on the precision, ram and time consumption of the software. Finally, they show 3 applications of Meta-Prism 2.

      I will start to say that the execution time of the tool looks very good compared to all other tools. But I have multiple concerns about these numbers. - First, I like to reproduce the results of a paper before approving it. But I had a few problems doing so. * The tool do not compile as it is on git. I had to modify a line of code to compile it. This is nothing very bad but authors of tools should be sure that their main code branch is always compiling. See the end of the review for bug and fix. * The analysis are done using samples from MGnify. I found related OTU tsv files linked in the supplementary but no explanation on how to transform such files in pdata files that the software is processing. * The only way to directly reproduce the results is to trust the pdata files present on the github of the authors. I would like to make my own experiments and compare the time to transform OTU files into pdata with the actual run time of MP2. - The authors evaluated the accuracy of their method (which is nice) but did not gave access on the scripts that were used for that. I would like to see the code and try to reproduce the figure by myself on my own data. - The 2nd and 3rd applications are explained in plain text but there is no script related neither any table of graphics to reproduce or explain the results. The only way for me to evaluate this part is to trust the word of the authors. I would like the authors to show me clear and indisputable evidences.

      For the methods part it is similar. We have hints on what the authors did, but not a full explanation: - For the similarity function, I would like to know where it comes from. The cited papers [14] and [24] do not help on the comprehension of the formula. If the function is from another paper, I ask the authors to add a clear reference (paper + section in the paper) ; if not, I would like the authors to explain in details why this particular function, how they constructed it and how it behaves. - The authors refer multiple times to "sparse format" applied to disk & cache but never defined what they mean by that. I would like to see in this section which exact datastructure is used. - In the Fast 1-N sample comparison, the authors write about "current methods" but without citing them. I would like the authors to refer to precise methods/software, succinctly describe them and then compare their methods on top of that. Also in this part, the authors point at figure 1E that is not present in the manuscript. - The figure 1 is not fully understandable without further details in the text. For example, what is Figure 1C4 ?

      I want to point that the paper is not correctly balanced in term of content. 1.5 page for time execution analysis is too much compared to the 2 pages of methods and less than 1 page of real data applications.

      Finally, the authors are presenting a software but are not following the development standards. They should provide unit and functional tests of their software. I also strongly recommend them to create a continuous integration page with the git. With such a tool the compilation problem would not exist.

      To conclude, I think that the authors very well engineered the software but did not present it the right way. I suggest the authors to rewrite the paper with strong improvements of the "methods" and "Real data application" sections. Also, to provide a long term useful software, they have to add guaranties to the code as tests and CI.

      For all these reasons, I recommend to reject this paper.

      --- Bug & Fix ---

      make mkdir -p build g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/loader.o src/loader.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/newickParser.o src/newickParser.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/simCalc.o src/simCalc.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/structure.o src/structure.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/main.o src/main.cpp src/main.cpp: In function 'int main(int, const char)': src/main.cpp:128:31: error: 'class std::ios_base' has no member named 'clear' 128 | buf.ios_base::clear(); | ^~~~~ make: * [makefile:7: build/main.o] Error 1

      To fix the bug: src/main.cpp:128 => buf.ios.clear();

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.77 and has published the reviews under the same license.

      Reviewer 1. Cicely Macnamara

      The manuscript entitled " PhysiCOOL: A generalized framework for model Calibration and Optimization Of modeLing projects" is succinctly written; its purpose is clear and the software created simple yet effective. I think improvements could be made to the documentation allowing a non-expert user to make use of this valuable tool. I also have a few minor comments below. Otherwise I am happy to recommend the publication of this paper.

      Minor comments: (1) Could the authors clarify in the paper (where it says PhysiCool has partial support for PhysiCell v1.10.3 and higher) whether it is the author's intention to keep this tool up to date with newer releases of PhysiCell? (2) For the multilevel parameter sweep the authors suggest that the number of levels and grid parameters can be defined by the user. Do the authors have any suggestions on picking the appropriate number of levels, for example, or could future development include some form of dynamic choice for number of levels e.g. stop when a certain degree of accuracy is found?

      Reviewer 2. Daniel Roy Bergman

      This is a very nice addition to the PhysiCell ecosystem. Methods for parameterizing agent-based models is critical, and the ability to do so without expensive computing resources, i.e. HPC, will aid many researchers.

      Comments: 1) "Furthermore, experimental data could..., they can be used..." this feels like a run-on sentence. It is unclear who/what "they" is. 2) "bespoke HPC workflows..." Is this referencing DAPT and the PhysiCell-EMEWS workflow? If so, how does PhysiCOOL differ from these? 3) Is PhysiCOOL defining this multilevel sweep approach to parameter estimation? Or is this already established? If the former, please emphasize. If the latter, are there citations? 4) Please emphasize that the "Simple model of logistic growth" is not done with PhysiCell. 5) I needed Python version < 3.11.0 to install physicool

      Major revisions: 1) Please check on the issue I had with the motility example and it not generating output files.

      Minor revisions: 1) "As for many several computational modelling frameworks..." consider rewording. I would suggest "As with many computational modeling frameworks" 2) "...namely an Extensible..." 3) "...can be employed to randomly sample points within..." 4) Please change notation in Table 2 so that the "* point" columns report the values as coordinates ( , ) rather than like intervals [ , ].

      Re-review: The authors addressed all my concerns and I have no further reservations in recommending this manuscript for publication.

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.76 and has published the reviews under the same license.

      Reviewer 1. Sven Winter

      I am really sorry, and I do not want to sound mean, but this manuscript needs major improvements in structure, writing, and data validation. It violates so many standard practices of scientific writing. I have never seen anybody cite a full title of a previous manuscript. There is absolutely no need for that. The annotation is labeled as an improved annotation, but its results are only listed in the abstract, and it is not mentioned how it is generated anywhere other than the data availability section. That the genome is tagged under RefSeq by NCBI is absolutely unnecessary information in the abstract, this is just a label, and it tells not much about the quality. I would urge the authors to restructure the manuscript. Start with a short description of the species and why the species and its genome is important as an introduction, then focus on a detailed data description with methods and basic results such as assembly statistics (importantly not just scaffold N50 but also on the contig-level!), Busco, Merqury completeness and error rate, genome size estimate, annotation (repeat and gene), etc. There is really no need for 30 pages of useless supplementary tables (please also make sure that next time you sort the files during the submission so that the pdf does not start with 30 pages of tables). The data cannot support any information about gene loss, as there is so much of the assemblies not properly anchored into chromosomes. I would also try to improve the Hi-C contact map figure. There is really no need for the blue and green boxes and the assembly label at the x-axis. I may have overlooked it due to the writing style, but I would like to see mentioned how much of the assembly is in the chromosome-scale scaffolds and how much is unplaced. I like the improved assembly, it just needs a much better presentation in form of a well-structured manuscript, and unfortunately, in its current form, it clearly is not well-structured. There are plenty of other data notes available as templates. I personally would always opt for a more traditional manuscript structure (Introduction, Methods, combined Results and Discussion), but that is my personal preference. I hope my comments are helpful, and I am looking forward to seeing a revised version in the future.

      Re-review:

      Thank you for the improvement of the manuscript. It is now easier to follow and includes more information as before. It was a bit difficult to see the changes as they were not highlighted and the lines are not numbered. Despite that, I have only a few minor comments that should be addressed easily so that the manuscript will be ready for publication soon. Line numbers in the comments refer to lines of the specific paragraph/section.

      DNA and RNA extraction: L7:such as? If you listed all tissues, please remove such as, if you sequenced RNA for nor tissues please add them.

      Sequencing and Assembly: L5: 159 bp is an uncommon read length. Was this just a typo, or how did that come to be? L10: remove "the" before juicer; otherwise, it sounds like an actual fruit juicer instead of a bioinformatics tool ;-). Same for 3D-DNA in the line below. Please make it more clear in the text if you sequenced the RNA for each tissue separately or in one library. L11-12: I am not convinced that not allowing for correction was the right approach. Did you test how the results would look with corrections enabled?

      Assembly Statistics and Quast Results: Quast calculates assembly statistics so I am not sure why the header needs to include both. L5: Please avoid using "better" but instead rephrase so that is is clear that the NG50 is 1.75x larger than the previous assembly. "Better" is not clear.

      Busco and Merqury results: I would not claim that Busco says the genome is 95% complete, as busco only tries to find genes that are supposedly orthologous in Actinopterygii. So I would rather say Busco suggests a high completeness as it finds 95% of the orthologs. Also, all genes in the Busco dataset are supposed to be single-copy orthologs; therefore, I would not say that 93% are conserved single-copy orthologs, as the remaining duplicated or fragmented genes could just be assembly errors. Please also state the Merqury QV value, and I would suggest stating the error rate in %. I still find the discussion about missing Busco genes strange, as since Busco 4 or 5 the datasets all got much larger and the Busco completeness values went down in most assemblies, even in well studies taxa as mammals. With recent datasets, it is very unlikely to get much more than 95-97%. In my opinion, it is rather a sign of too large and incorrect Busco datasets than evidence for missing orthologs. I would at least add that point to the discussion.

      Table 1: Please follow standard practice in scientific writing and add separators to the numbers in all tables (main text and supplementary), e.g., 28444102  28,444,102. Otherwise, they are difficult to read.

      Annotation Results: L3: 20,101 coding genes, 18,616 genes … Please check throughout the whole manuscript for consistent style.

      Data Availability: L2: Annotation report release 100. What does "100" stand for? Also, "at here" sounds not correct; please remove "at". L4: Table S2 does not show the scaffold identifiers. L5: please state the complete BioProject accession not just the numerical part.

      Supplementary data: Please change numbers in all tables to standard format e.g., 21,671,036

      Reviewer 2. Yue Song

      (1) Please state clearly how much CCS Hi-Fi data has been produced by sequencing and hic-data finally used for chromosome assembly after filtration, not just the number of reads. (2) Please state clearly the estimated genome size using Hi-Fi data.
      

      (3) What is the process for “correct primary assembly misassembles”? Please described in detail. (4) In Table 1, I noticed that the difference between the new and previous genome of S.scovelli is more than 100M (about 25% of the size of the newly assembly). Otherwise, most of genome size of Syngnathus species ranged from 280-340 Mb, I think take some explanation of these extra sequences is necessary. (5) Need more detailed parameters and process about genome assembly and gene annotation. (6) Whether the previous version had any assembly errors and updated in this new one. if this exists, please state so.

  2. Feb 2023
    1. Abstract

      This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Dadi Gao

      Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots.

      Major comments:

      Have the authors compare the performance with different seed length? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).

      The authors need to benchmark BrumiR with more existing tools (e.g. those ML-based methods), and to include more genome-free methods (e.g. MiRNAgFree).

      It is also interesting to know whether de novo method for mRNA assembly would be useful on the miRNA side. It would be great if the authors were able to compare the performance of BrumiR2reference (without filtering for RFAM) with Trinity in genome-guided mode, by tweaking its seed length to be the same as BrumiR.

      The tool's sensitivity is promising across animal and plant datasets. However, the average precision is quite low, an average precision of 0.3 means a false discovery rate of 0.7. This is not an accepted value for a tool designed to discover novel miRNA. Is there any parameter the author could tweak towards a better performance? For example, is seed length of 18nt too short to start with? Is there any other sequences feature the authors should take into account to boost the performance? Or maybe some post-assembly filtering approaches might be sufficient and helpful.

      Wet-lab validation (e.g. Luciferase assay) for the identified novel miRs will leverage the real-life usefulness of BrumiR. This is extremely important, as the tool showed a high false discovery rate.

      Minor comments:

      MiRNA maturation involves RNA editing. Can the authors comment on how this would be handled and captured by BrumiR. It seems that the authors allow mismatches when cluster the potential miRNAs via edlib library. It is interesting to know whether or not, or to what extent, edlib would help in including RNA edited candidates in the final result.

    2. AbstractMicroRNAs

      This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Marc Friedlander

      The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more careful way.

      Major comments:

      RFAM filtering is not really part of the prediction step, this is rather a filtering step. Therefore, to make a fair comparison with mirnovo (the other genome-free tool), BrumiR should additionally be run without RFAM filtering, and mirnovo should additionally be run using the exact same RFAM filtering.

      it appears that 16-mers from miRBase miRNAs were specifically excluded from the RFAM catalog used for the filtering, which is reasonable. However, the miRNAs from the exact benchmarked species should not be included in the used miRBase 16-mer catalog, to avoid circular reasoning.

      miRDeep2 software should ideally not be run with default options - this is particular important since the miRDeep2 performance in this manuscript appears lower than what is reported in other studies (e.g. Friedlander et al. 2012). First, reference mature miRNAs from a related and well-annotated species should be included to support the prediction. Second, a score cut-off should be used that gives a decent signal-to-noise ratio according to the miRDeep2 output overview table (for instance 5:1). Third, all read pre-processing and genome mapping should be performed with the mapper.pl script which is part of the miRDeep2 package.

      it appears that only miRNA-derived sequences were included in the simulated data. In fact, real small RNA-seq data typically contains fragments from other known types of RNA and also sequences from unannotated parts from the genome. Therefore, the authors should use simulated data that also includes samples from RFAM and randomly sampled sequences from the reference genome (for instance 10% of each). Overall, the use of simulated sequence data could be put a bit in the background in this study, since real small RNA-seq data is in fact readily available these days and typically has a structure that is not easy to simulate. Further, there is little reason not to use real data, since the miRNAs in miRBase tend to be reasonably well curated for most species and therefore can function well as a gold standard for benchmarking.

      precision of BrumiR is in some cases lower than 0.2, for instance for one mouse dataset. From this dataset ~3000 mouse miRNAs are reported - the majority of which are not in miRBase and can reasonably be presumed to be false positives. The authors should comment on why this particular dataset appears to produce so many false positives for BrumiR - could this have to do with the prevalence of piRNAs that the software cannot easily discern from miRNAs? Also, the authors should reflect on in what kind of use cases could tolerate these thousands of false positives. Would this be for generating candidates for downstream high-throughput validation?

      the authors should either benchmark BrumiR against the genome-free methods miReader and MirPlex, or explain why this comparison is not relevant.

      Minor comments:

      the brief introduction to miRNA biology should be carefully edited by an expert in the field. Currently, very old reviews are being cited (e.g. Bartel 2004), and some of the other references appear to be a bit spurious (e.g. why focus on plant host-pathogen interactions out of the hundreds of established functions of miRNAs?). The excellent review of Dave Bartel from 2018 contains references to numerous milestone studies that the introduction could build on.

      the authors write on page 2 that genome-based methods struggle with a high rate of false positive prediction, citing [9]. However, this is a mis-reference, since the reference [9] states that methods that rely on only the genome and do not leverage on small RNA-seq data have high false positive rates.

    3. AbstractMicroRNAs (miRNAs)

      This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Ernesto Picardi

      The manuscript by Moraga et al. describes BrumiR, a software devoted to the de novo identification of miRNAs from deep sequencing experiments of the RNA fraction at low molecular weight. In contrast with existing tools, BrumiR is based on de Bruijn graphs, generated directly from raw fastq reads. The performances on simulated and real sequencing data, in terms of precision, recall and FScore, are very good. In addition, the tool is ultra-fast, enabling the analysis of huge amount of data. I have tried to use BrumiR but I always got a GLIB error. I have tested the script on different Linux and Mac computers but I was not able to fix the GLIB error. It seems that a very recent version of the GLIB library is required. So, unfortunately, I didn't have the possibility to test the program and look at the outputs.

      Major concerns:

      I was not able to run the program and, thus, provide a correct revision. In my opinion, the github page should take into account this by providing the minimal software and hardware architecture to run BrumiR. Authors could also include a copy of the output files (by the way, there is a typo in the description of the second output file).

      Since the tools is able to identify novel miRNAs and look also at known ones, they could provide an output file including the read count per miRNA. In addition, since the tool is expected to be ultra-fast (not checked … see above), the differential gene expression analysis could also be implemented.

      I suggest also to implement a graphical output. A sort of summary in a decorated html page.

      By using BrumiR, authors analyze miRNAs in Arabidopsis during the development, discovering three novel miRNAs. Although bioinformatics evidences indicate that they could be real miRNAs, an experimental validation is required. Indeed, these miRNAs have been detected by BrumiR only. I think that this validation could be easily done because authors directly performed sRNAseq data. In my opinion, this experiment could really improve the manuscript and assess the high performance of BrumiR.

    1. Background

      This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Giulia De Riso

      In this study, a workflow is presented to generate classification models from DNA methylation data. Methods to deal with harmonization and missing data imputation are presented and the benefit of adopting them for classification tasks is tested on case-control datasets of schizophrenia and Parkinson disease. The authors support this workflow with source code. Although mostly based on already known methodologies, the present study may help orient studies aimed at building and applying DNA methylation based models. However, some major concerns can be raised:

      Majors: In different points of the manuscript, the authors refer to their approach as a pipeline. Indeed, this approach should be composed of sequential modules, in which the output of a module becomes the input of the next one. Although the modules are clearly distinguishable, their organization in the pipeline is less straightforward (also considering that modules can be adopted both to build a model and to use it on new data). The authors could think to draw a scheme of the pipeline, or to adopt a different term to refer to the presented approach. From the model performance perspective, the ML models poorly perform for schizophrenia. The authors point to inner characteristics of the disease as a possible reason for this. However, this point should be better commented in the Discussion section.

      Besides this, the impact of the smaller number of samples included in the training set and the higher proportion of imputed features compared to Parkinson disease on the classification accuracy should be discussed. In addition, since the authors provided the code, is there a way to select samples to include in training/test sets based on random choice (classical 70-30% splitting) instead of source dataset? "For machine learning models, we used only those CpG sites that have the same distribution of methylation levels in different datasets in the control group (methylation levels in the case group typically have greater variability because of disease heterogeneity).": is this filtering performed only on the datasets included in the training set, or also on the test set? It seems the former, but the authors should clearly state this point. Accuracy with weighted averaging should be defined with a formula in the methods section Regarding the ML models, the authors chose different types of decision-trees ensemble, along with a deep learning one. They should contextualize this choice (why different models from the same family?).

      In addition, ML models built on DNA methylation are often based on elastic net or Support-Vector Machines, which are not accounted for in this work. The authors should comment on this aspect in limitations, and state whether the code they provided for their approach could be customized to adopt different models from the ones they presented.

      Regarding the Imputation Method column in Table 2, the meaning is not clear. Are the different imputation methods described in the Imputation of missing values section paired with the ML models presented in Table 2? If yes, some of the methods (like KNN) are missing. In the harmonization section, Models for case-control classification are trained on different numbers and sets of CpGs. To assess the effect of harmonization alone, the number of CpGs should be instead fixed. This is especially critical for schizophrenia, when the number of features for the non-harmonized data is 35145 whereas the one for harmonized data is 110,137. Dimensionality reduction section: are the models from imputed and not-imputed data trained only on harmonized data? And how the set of 50911CpG sites for Parkinson and 110137 CpG sites for schizophrenia is selected?

      Imputation of missing values section: it is not clear on which CpGs and on which samples imputation is performed. Also, it is not clear whether the imputation has been tested on the best-performing model.

      Minors: Page 1, line 2: "DNA methylation is associated with epigenetic modification". DNA methylation is an epigenetic mark itself. Do the authors mean histone marks?

      Page 1, from line 7: "DNA methylation consists of binding a methyl group to cytosine in the cytosineguanine dinucleotides (CpG sites). Hypermethylation of CpG sites near the gene promoter is known to repress transcription, while hypermethylation in the gene body appears to have an opposite, also less pronounced effect.": references should be added

      Page 2, from line 2 : "Current epigenome-wide association studies (EWAS) test DNAm associations with human phenotypes, health conditions and diseases.": references should be added

      Page 3: "In most cases, an increase in dimensionality does not provide significant benefits, since lower dimensionality data may contain more relevant information". This point could be presented in a reverse way (higher dimensionality data may contain redundant information), introducing the collinearity issue. In addition, this issue could be introduced before the missing values and imputation section.

      Page 3: references for "Modern machine-l earning-based artificial intelligence systems are powerful and promising tools" could be more specific for the field of epigenetics and DNA methylation.

    2. Abstract

      This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Liang Yu Reviewer

      Comments to Author: The paper by Kalyakulina et al. described the disease classification for whole blood DNA methylation. The author proposed a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence algorithms. For Parkinson's disease and schizophrenia, the author also demonstrates that a method for classifying healthy individuals and patients with various disorders based on whole blood DNA methylation data is an efficient and comprehensive approach.

      Overall, the manuscript is well organized. I have some suggestions for the authors to improve their work:

      1. The manuscript has constructed different models for the prediction study of CpG sites for different types of data. It is suggested to add a flowchart of the whole model construction process to the manuscript so that readers can understand the study more clearly.

      2. In Figure 4, the author only shows the top 10 important features and marks the highest accuracy and number of features with black lines in the figure. It is recommended to show the relevant data (optimal accuracy and number of features) in the figure. For the three subplots included in the figure, please label them separately, e.g., A, B, and C to indicate them separately.)

      3. Remark concerns model performance evaluation: author should provide standard deviations of the obtained values.

      4. In this manuscript, the author used graphs to present the results and suggested that a table summarizing the performance results of the model would be intuitive.

      5. I didn't find how the authors optimize the hyper-parameters, usually using grid search.

      6. The authors do not adequately address how their method outperforms existing methods in the discussion section.

      7. The "Dimensionality reduction" section: I think this section is more appropriately called "feature selection", a sequence forward search method. First sort the features according to their importance values, then add or remove features from a candidate subset while evaluating the criterion

    1. AbstractRecent studies

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac094 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Milad Miladi

      In this work, Tombacz et al. provide a Nanopore RNA sequencing dataset of SARS-CoV-2 infected cells in several timepoints and sequencing setups. Both direct RNA-seq and cDNA-seq techniques have been utilized, and multiplex barcoded sequencing has been done for combining the samples. The dataset can be helpful to the community, such as for future transcriptomic studies of SARS-CoV-2, especially for studying the infection and expression dynamics. The text is well written and easy to follow. I find this work valuable; however, I can see several limitations in the analysis and representation of the results.

      Notably, the figures and tables representing statistical and biological insights of the data points are underworked, lack clarity, and provide limited information about the experiment. Further visualizations, analysis, and data processing could help to reveal the value and insights from this sequencing experiment.

      Comments: The presentation of reads coverage and lengths in Figs 1 & 2 are elementary, unpolished, and non-informative. Better annotation and labeling in Fig. 1 would be needed. Stacking so many violin plots in Fig 2 does not provide any valuable information and would only misguide. What are the messages of these figures? What do the authors expect the readers to catch from them? As noted, stacking many similar figures does not add further information. The authors may want to consider alternative representations and aggregation of the information, besides or replacing the current plots. For example, in Fig.2, scatter/line plots for the median & 25/75% percentile ranges, with an aggregation of the three replicates in on x-axis position, could help identify potential trends over the time points.

      It is better to start the paper by presenting the current Fig.3 as the first one. This figure is the core of contributions and methodologies, and current Figs 1&2 are logical followups of this step.

      There is a very limited description in the Figure Legends. The reader should be able to understand essential elements of the figures merely based on the Figure and its legend.

      This study does not provide much notable biological insight without demultiplexing the reads of each experimental condition into genomic and subgenomic subsets. Distinguishing the genomic and subgenomic reads and analyzing their relative ratio is essential in this temporal study. Due to the transcription process of coronaviruses, the genomic and subgenomic reads have very different characteristics, such as length distribution and cellular presence. Genomic and subgenomic reads can be reliably identified by their coverage and splicing profiles, for enough long reads. It is essential that the authors further process the data by categorizing the genomic/subgenomic reads and the provide statistics such as read length for each category. It would also be interesting to observe the ratio of genomic vs. subgenomic reads. This is an indicative metric of the infection state of the sample. An active infection has a higher sub-genomic share, while, e.g., a very early infection stage is expected to have a larger portion of genomic reads.

      Page-3: "[..] the nested set of subgenomic RNAs (sgRNAs) mapping to the 3'-third of the viral genome". Is 3'-third a typo? Otherwise, the text is not understandable.

      Page-4: " because after a couple of hours, the virus can initiate a new infection cycle within the noninfected cells." More context and elaboration by citing some references can help to support the authors' claim. A gradual infection of non-infected cells can be assumed. However, "a couple of hours" and "initiate a new infection cycle" need further support in a scientific manuscript. The infection process is fairly gradual, but the wording here infers a sudden transition to infecting other cells only at a particular time point.

      Page-4: "[..]undergo alterations non-infected cells during the propagation therefore, we cannot decide whether the transcriptional changes in infected are due to the effect of the virus or to the time factor of culturing." This can be strong support for why this experiment has been done and for the value of this dataset. I would suggest mentioning this in the abstract to highlight the motivation.

      Page-4: "based studies have revealed a hidden transcriptional complexity in viruses [13,14]" Besides Kim et. al, the first DRS experiments of coronaviruses have not been cited (doi.org/10.1101/gr.247064.118, doi.org/10.1101/2020.07.18.204362, doi.org/10.1101/2020.03.05.976167)

      Table-1: dcDNA is quite an uncommon term. In general, here and elsewhere in the text, insisting on a direct cDNA is a bit misleading. A "direct" cDNA sequencing is still an indirect sequencing of RNA molecules!

      Figs S2 and S3: Please also report the ratio of virus to host reads.

    2. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac094 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: George Taiaroa

      The authors provide a potentially useful dataset relating to transcripts from cultured SARS-CoV-2 material in a commonly used cell line (Vero). Relevant sequence data are publicly available and descriptions on the preparation of these data are for the most part detailed and adequate, although this is lacking at times.

      Although the authors state that this dataset overcomes the limitations of available transcriptomic datasets, I do not believe this to be an accurate statement; based on comparable published work in this cell line, transcriptional activity is expected to peak at approximately one day post infection (Chang et al. 2021, Transcriptional and epi-transcriptional dynamics of SARS-CoV-2 during cellular infection), with the 96 hour period of infection described likely representing overlapping cellular infections of different stages.

      Secondly, many in the field have moved to use more appropriate cell lines in place of the Vero African Monkey kidney cell line, to better reflect changes in transcription during the course of infection in human and/or lung epithelial cells (See Finkel et al. 2020, The coding capacity of SARS-CoV-2). Lastly, the study would ideally be performed with a publicly available SARS-CoV-2 strain, as has been the case for earlier studies of this nature to allow for reproducibility and extension of the work presented by others.

      That said, the data are publicly available and could be of use. Primary comments I think that a statement detailing the ethics approval for this work would be essential, given materials used were collected from posthumously from a patient. Similarly, were these studies performed under appropriate containment, given classifications of SARS-CoV-2 at the time of the study? I do not know what the authors mean in reference to a 'mixed time point sample' for the one direct RNA sample in this study; could this please be clarified? Secondary comments I believe the authors may over-simplify discontinuous extension of minus strands in saying that

      'The gRNA and the sgRNAs have common 3'-termini since the RdRP synthesizes the positive sense RNAs from this end of the genome'. Each of the 5' and 3' sequence of gRNAs/sgRNAs are shared through this process of replication. 'Infections are typically carried out using fresh, rapidly growing cells, and fresh cultures are also used as mock-infected cells.However, gene expression profiles may undergo alterations non-infected cells during the propagation therefore, we cannot decide whether the transcriptional changes in infected are due to the effect of the virus or to the time factor of culturing. This phenomenon is practically never tested in the experiments.' I do not follow what these sentences are referring to. 'Altogether, we generated almost 64 million long-reads, from which more than 1.8 million reads mapped to the SARS-CoV-2 and almost 48 million to the host reference genome, respectively (Table 1).

      The obtained read count resulted in a very high coverage across the viral genome (Figure 1). Detailed data on the read counts, quality of reads including read lengths (Figure 2), insertions, deletions, as well as mismatches are summarized Supplementary Tables.' Could this perhaps be more appropriately placed in the data analysis section, rather than background?

    1. AbstractRecent technological

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Kamil S. Jaron

      Assembling a genome using short reads quite often cause a mixed bag of scaffolds representing uncollapsed haplotypes, collapsed haplotypes (i.e. the desired haploid genome representation) and collapsed duplicates. While there are individual software for collapsing uncollapsed haplotypes (e.g. HaploMerger2, or Redundans), there is no established workflow or standards for quality control of finished assemblies. Naranjo-Ortiz et al. describes a pipeline attempting to make one.

      The Karyon pipeline is a workflow for assembling haploid reference genomes, while evaluating the ploidy levels on all scaffolds using GATK for variant calling and nQuire for a statistical method for estimating of ploidy from allelic coverage supports. I appreciated the pipeline promotes some of good habits - such as comparing k-mer spectra with the genome assembly (by KAT) or treatment of contamination (using Blobtools). Nearly all components of the pipeline are established tools, but authors also propose karyon plots - diagnostic plots for quality control of assemblies.

      The most interesting and novel one I have seen is a plot of SNP density vs coverage. Such plot might be helpful in identifying various changes to ploidy levels specific to subset of chromosome, as authors demonstrated on the example of several fungal genomes (Mucorales). I attempted to run the pipeline and run in several technical issues. Authors, helped me overcoming the major ones (documented here: https://github.com/Gabaldonlab/karyon/issues/1) and I managed to generate a karyon plot for the genome of a male hexapod with X0 sex determination system. I did that, because we know well the karyotype and I suspected, the X chromosome will nicely pop-up in the karyon plot.

      To my surprise, although I know the scaffold coverages are very much bi-modal, I got only a single peak of coverages in the karyon plot and oddly somewhere in between the expected haploid and diploid coverages. I think it is possible I have messed up something, but I would like authors to demonstrate the tool on a known genome with known karyotype. I would propose to use a male of a species with XY or X0 sex determination system. Although it's not aneuploidy sensu stricto, it is most likely the most common within-genome ploidy variation among metazoans. I would also propose authors to improve modularity of the pipeline. On my request authors added a lightweighted installation for users interested in the diagnostic plots after the assembly step, but the inputs are expected in a specific, but undocumented format, which makes a modular use rather hard. At least the documentation of the formats should improve, but in general I think it could be made more friendly to folks interested only in some smaller bits (I am happy to provide authors with the data I used).

      Although I quite enjoyed reading the manuscript and the manual afterwards, I do think there is a lot of space for improvement. One major point is there is no formal description of the only truly innovative bit of this pipeline - the karyon plots. There is a nice philosophical overview, but the karyon plots are not explained in particular, which makes reading of the showcase study much harder. Perhaps a scheme showing the plot and annotating what is expected where would help. Furthermore, authors did a likelihood analysis of ploidy using nQuire, but they did not talk about it at all in the result section. I wonder, what's the fraction of the assembly the analysis found most likely to be aneuploid for the subset of strains that suspected to be aneuploids? Is 1000 basis sliding window big enough to carry enough signal to produce reliable assignments? In my experience, windows of this size are hard to assign ploidy to, but I usually do such analyses using coverage, not SNP supports.

      However, I would like to appraise authors for the fungal showcases, I do think they are a nice genomics work, investigating and considering both biological and technical aspects appropriately. Finally, a bit smaller comment is that the introduction could a bit more to the point. Some of the sections felt a bit out of place, perhaps even unnecessary (see minor comments bellow). More specific and minor comments are listed bellow. Kamil S. Jaron

      Minor manuscript comments: I gave this manuscript a lot of thought, so I would like to share with you what I have figured out. However, I recognise that these writing comments listed bellow are largely matter of personal preference. I hope they will be useful for you, bit it is nothing I would like to insist on as a reviewer. l56: An unnecessary book citation. It's not a primary source for that statement and if a reference was made a "further reading", perhaps better to cite a recent review available online rather than a book. l65 - 66: Is the "lower error rate" still a true statement? I don't think it is, error rates of HiFi reads are similar or even lower compared to short reads. (tough I do agree there is still plenty of use for short reads). l68 - 72: I don't think you really need this confusing statement " which are mainly influenced by the number of different k-mers", the problems of short read assembly are well explained bellow. However, I actually did not understand why the whole paragraph l76 - 88 was important. I would expect an introduction to cover approaches people use till now to overcome problems of ploidy and heterozygosity in assemblies. l176 - 177: "Ploidy can be easily estimated with cytogenetic techniques" - I don't think this statement is universally true. There are many groups where cytogenetics is extremely hard (like notoriously difficult nematodes) or species that don't cultivate in the lab. For those it's much easier to do NGS analysis. You actually contradict this "easily" right in the next sentence. l191: the first autor of nQUire is not Weib, but Weiß. The same typo is in the reference list. l222 - 223: and l69-70 explains what is a k-mer twice. l266 - 267: This statement or the list does not contain references to publications sequencing the original genomes. I am not sure, but when possible, it is good to credit original authors for the sequencing efforts. l302: REF instead of a reference l303: What is "important fraction"? l304: How can you make such a conclusion? Did you try to remove the contamination and redo the assembly step? Did the assembly improve? Not sure if it's so important for the manuscript, but I would tone down this statement ("could be caused by" sounds more appropriate). l310: "B9738 is haploid" are you talking about the genome or the assembly? How could you tell the difference between homozygous diploid and haploid genome? If there is a biological reason why homozygous diploid is unlikely, it should be mentioned. l342: How fig 7 shows 3% heterozygosity? How was the heterozygosity measured? Also, karyon plot actually shows that majority of the genome is extremely homozygous and all heterozygosity is in windows with spuriously high coverage. What do you think is the haploid / diploid sequencing coverage in this case? l343 - 345: I don't think these statements are appropriately justified. The analysis presented did not convincingly show the genome is triploid or heterozygous diploid. l350: I think citing SRA is rather unnecessary. l358: what "model"? How could one reproduce the analysis / where could be the model found? l378 - 379: Does Karyon analyse ploidy variation "during" the assembly process? Although the process is integrated in a streamlined pipeline, there are loads of approaches to detect karyotype changes in assemblies, from nQuire which is used by Karyon, through all the sex-chromosome analyses, such as https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002078.

      Method/manual comments:

      Scaffold length plots have no label of the x axis. As the plots are called distributions, I would expect frequency or probability on the y axis and the scaffold length on the x. Furthermore, plotting of my own data resulted in a linnear plot with a very overscaled y-axis. "Scaffold versus coverage" plot also does not have axis labels either. I would also call it scaffold length vs coverage instead. I also found the position of the illustrating picture in the manual confusing a bit (probably should be before the header of the next plot).

      Variation vs. coverage is the main plot. It does look as a useful visualisation idea. Do I understand right that it's just numbers of SNPs vs coverage? I am confused as I thought the SNP calling is done on the reference individual and in the description you talk about homozygous variants too, what are those? Missmapped reads? Misassembled references?

      I also wonder about "3. Diffuse cloud across both X and Y axes.", I would naturally imagine that collapsed paralogs would have a similar pattern to the plot that was shown as an example - a smear towards both higher coverage and SNP density. I guess this is a more general comment, would you expect any different signature of collapsed paralogs and higher ploidy levels? Should not paralogy be more explicitly considered as a factor?

    2. Recent tec

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: Michael F. Seidl

      The technical note 'Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other non-standard architectures in genome assemblies' by Naranjo-Ortiz and colleagues reports on the development and application of the Karyon framework. Karyon is a python-based toolkit that utilizes several software tools developed by the authors' and/or others with the overall aim to assess sequencing data and genome assemblies for potential assembly artefacts caused by a plethora of different features intrinsic to the analyzed species/strain. Karyon is publicly available from github and as a docker image.

      Genome assemblies are nowadays important tools to develop novel biological hypotheses. However, genome assemblies are often not ideal, i.e., they are highly fragmented and/or incomplete, which can significantly hamper their full exploitation. The genome assembly quality is impacted by different biological factors that can be, at least partially, discovered directly based on the raw sequencing data and from the genome assembly (e.g., allele frequency, k-mer profiles, coverage depth, etc.). There are already plenty of established computational tools available to perform these type of analyses (to name a few: KAT, genomscope, nQuire).

      Karyon will ease these analyses by providing a single computation framework that combines different and complex software tool and generates diagnostic figures to support biological interpretation. Karyon thus represents a valuable contribution to the scientific community. The Karyon toolkit is built around established software tools and the overall methodology is sound and suitable to assess genome qualities. The interpretation of the results of Karyon is on the user, which still necessitates expert knowledge to correctly interpret signals.

      While examples are provided in the manual, the level of experience required will likely hamper the full exploitation of the pipeline by not expert users. Furthermore, it can be anticipated that expert users already employ the separate software to study genome complexities, and thus might not be in full need for Karyon. Obviously, this is inherent to the problem at hand and cannot be easily addressed by the authors. However, I would like to encourage the authors to further improve the manual and the examples to guide the data interpretation with the aim to make this software as accessible to as many researchers as possible.

      I nevertheless also have some comments related to the data presented in the manuscript that the authors need to address. First, the introduction finishes by asserting that different biological factors are expected to impact published genome assemblies. Furthermore, the manuscript mentions that quality of fungal genomes is often sub-optimal. However, no evidence for these statements is provided. To strengthen this point and to further highlight the urgency of methods to discover and ultimately address these problems, the authors need to provide a more systematic analyses based on publicly available genome assemblies for the occurrence of compromised genome assemblies. For example, a random subset of genome sequences for different eukaryotic phyla and / or classes, and more systematic throughout the fungi, would

      i) significantly substantiate the manuscript's message and

      ii) confirm the applicability of the authors' framework to most eukaryotes and not only to specific fungal groups (Mucorales).

      Second, the table mentions the diagnosis derived from Karyon but simply mentions 'unknown' for most entries. Based on the manuscript is seem that these are supposedly haploid with very little heterozygosity (L279) but table 1 nevertheless reports for most species/strains strikingly different genome size estimates between the original and the Karyon-derived genome assemblies (Karyon is consistently smaller). The authors need to explain in much more depth the nature of these differences for the reported genomes. For instance, it could be that publicly deposited assemblies have been generated by a combination of different sequencing libraries and technologies that are not fully exploited by Karyon. Third, one additional measure often applied to assess genome quality is genome completeness as for instance assayed by BUSCO. Karyon should include as strategy such as BUSCO to

      i) assess the occurrence of marker genes in the genome assemblies and

      ii) the duplication level of these genes as this might reveal un-collapsed alleles etc. Especially the latter is important to interpret genome size differences between original and Karyon-derived genome assemblies.

      Further detailed comments and suggestions to improve the manuscript: L21: could the authors please specify what 'groups' they refer to? L22: there seems to be an extra space L59: could the authors please specify what they mean with a 'poor assembly'. What is poor in terms of genome assembly? Contiguity or completeness, or unresolved haplotypes, or …, or a combination of thereof? L63-: the authors only once refer explicitly to Fig 1 in this section. the manuscript would be clearer if they would refer to specific panels as they describe factors impacting genome assembly quality L66: could the authors please further substantiate their notion that most genome assemblies publicly available are formed by short-read sequencing data. This information should be readily available at NCBI and/or GOLD

      L119: the manuscript mentions pan-genomics, but the relevance of aneuploidy in these studies is not explain. The manuscript should provide a brief explanation for the importance of aneuploidy (or any form of ploidy shift) for pan-genomics L147: 'From' -> 'from' L148: 'Symbiotic' -> 'symbiotic' L232: the reference to nQuire should read Weiß et al. 2018. L302: the reference to blobtools is missing L349: To initiate the pipeline, was a single sequencing library or a combination of multiple libraries used? Table 1: The table formatting, at least in the combined pdf, seems to be broken.

    3. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Zhong Wang

      In this work, Naranjo-Ortiz et al. presented a software pipeline that is capable of de novo genome assembly, variant calling, and generating diagnostic plots. Applying this software to 35 publically available, highly fragmented fugal genome assemblies revealed prevalent inconsistencies between the sequencing data and the assembly. I really appreciate the authors' effort to make their software, Karyon, easy to use by providing multiple ways to install and a detailed software manual. I especially like the detailed explanation of how to use the diagnostic plots to infer the "nonstandard genome architectures".

      The manuscript is clearly written and very easy to follow. I have the following general comments:

      1. It wasn't clear to me the relationships between the raw sequencing data and the assembly -- were they belong to the same isolate? If so, then the inconsistencies may reflect assembly errors in the fungal genome assembly. Have the authors rule our this possibility? The fact that these genomes are highly fragmented suggests they likely contain many errors. If they were from different isolates, then I agree with the authors that the diagnostic plots could be examined carefully to detect structural variations. For that, have the authors used any alternative method to validate at least some of their findings? To establish the validity of their approach, it would be more convincing to obtain the same findings using independent approaches, including experimental ones.

      2. Given the raw WGS reads and assembled genome, another software, QUAST (http://quast.sourceforge.net/), automatically detect assembly errors and structural variations. It would be interesting to see a comparison between the findings via Karyon and via Quast.

      3.This is an optional suggestion, as I realize it may not be easy to implement. The biggest limitation of Karyon is that it does not automatically detect these usual genome organization. It may be possible by comparing the de novo assemblies produced by Karyon to the reference genomes. At least such possibilities should be discussed.

    1. Background

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: Reviewer name: Kevin Peterson Pati et al. report the expression profiles of miRNAs and a vast array of other, non-miRNA sequences, across 196 cell types based on thousands of publicaly available data sets. Although this will be an outstanding contribution to the miRNA field, I am recommending rejection for now with the strong encouragement to resubmit once the authors have addressed my concerns. To be frank, the authors have two diametrically opposed research agendas here:

      1) What are the expression profiles of bona fide miRNAs (as determined by MirGeneDB); and

      2) What else might be expressed in human cell types that could be of interest to small RNA workers and clinicians? Because of these opposed goals, the paper is not only confusing to read and process, but it gives fodder to the numerous paper-mill products that continue to identify non-miRNAs as diagnostic, prognostic, and even mechanistic indicators into virtually every human malady under the sun. Let me try and highlight why the use of only MirGeneDB (MGDB) would be highly useful for this paper.

      1) miRBase (MB) is not consistent in its identification of both arms of a bona fide hairpin, resulting in the authors not reporting star reads for highly expressed miRNAs such as mir-206 and mir-184.

      Further, there are numerous examples where the authors do in fact report a "mature" versus "star" read without both arms annotated in MB with some included in MGDB (e.g., Mir-944, Mir-3909) and others not (e.g., mir-3615) raising the question of how these data were annotated.

      2) The authors write that, "the majority (46%) of the reads are mature miRNAs." But MB makes no attempt to distinguish mature from star arms. Hence, if they are annotating to MB, they cannot distinguish between these two processing products. This is not only confusing, but also very unfortunate as one cannot get a sense of the expression of evolutionary intended gene products versus processing products.

      3) The authors report on the use of 5p versus 3p strand dominance, but have no examples of "codominant" miRNAs (Fig. 1C) when, in fact, there are numerous examples in their data including Mir-324, Mir-300, Mir-339, Mir-361 etc. with some switching arms depending on the variant. All of this is available at MGDB; none at MB. 3) MB does not allow the identification of loop or offset reads separate from the arm reads, allowing to authors to accurately report the amount of reads derived from the "hairpin" versus the arms (and how the authors reported this in Fig. 1B is not at all clear given that these sequences are not annotated as such at MB).

      4) The authors bias their genic origins of small RNA reads by filtering first using MB, and then identifying remaining reads as arising from other sources including tRNAs, rRNAs, mRNAs etc. However, numerous "miRNAs" in MB arise from these genic sources including mir-484 (mRNA) and mir-3648 (rRNA). So if I understand the authors pipe-line these sequences are mistakenly included in the "mature miRNA" column.

      5) The use of MGDB would allow the user to see the saturation of mature reads across the different cell types in Fig. 1E, and, if mature is distinguished by star, then one could also see the (near)-saturation of star reads as well. As it stands, their plot just simply highlights the non-genic nature of much of MB. Further, because MGDB identifies the age of each miRNA, if the authors were interested, they could also test a long-standing pattern that evolutionary older miRNAs are expressed at higher levels than younger miRNAs relative to specific cell types.

      6) The authors report the expression profiles of bona fide miRNAs in Figs 3 and 5, but report the expression profiles of non-miRNAs in Fig. 4. These include mir-3150b, mir-4298, mir-569, mir-934, mir302f, and mir-663b. None of these supposed miRNAs have the requisite reads for miRNA annotation, and all but mir-3150b fail a structural examination as well. In fact, MGDG has no reads (which includes numerous data sets from the Halushka lab) for mir-302f, mi-4298, and mir-569, and only a few reads from one "arm" for mir-663b and mir-3150, highlighting the need to examine these supposed reads in detail. The inclusion of obvious non-miRNAs here is confusing and needlessly undermines the authors study and conclusions. So, my strong recommendation is to potentially write two papers. The first (this one here) focuses only on the expression of miRNAs, emphasizing really interesting results (like what they report in Fig. 5), and providing to the miRNA field a robust cell-type expression profile for humans. This would eliminate the need for read/rpm cutoffs as they are simply reporting the read profiles for what is in MGDB. This would not hamper their attempts to include these data at UCSC as MGDB includes links to both MB as well as UCSC, and indeed, why report "miRNA" read data to a genome browser for well over a thousand nonmiRNAs? This simply will lend credence to all of these non-miRNAs that already clutter the literature. A second paper could focus on potentially interesting or relevant small RNAs that show interesting patterns of expression in normal and/or diseased tissues, highlighting the structural and expression profiles of these genic elements, and possibly trying to identify what they might be (including potential false negatives in MGDB). As Corey and colleagues (2022, NAR) recently stressed, we as a field must focus on mechanism as the identification of a "biomarker" in and of itself is of no real value if we don't understand what it is or where it comes from.

      Minor comments:

      1) The seed sequence is 7-8 nt in length, not 6 nt.

      2) miRNAs reads - both mature and star - have a mean length of 22 nt in length, and no miRNA is less than 20 nt long (5p: median = 22, mean = 22.56, SD = 0.94, range = 20-27; 3p: median = 22, mean = 22.11, SD = 0.57, range = 20-26. All data from MGDG.).

      3) Its misleading to write miRNAs "block protein translation." Please rewrite.

      4) I don't believe our understanding of the expression profile of miRNAs is hampered by the numerical naming scheme. MB's nomenclature system obscures the evolution of miRNAs by erecting both paraphyletic (e.g., MIR-8, which includes mir-141) and polyphyletic groups. Why would distinct monophyletic families like MIR-142, MIR-143 and MIR-144 create confusion regarding their expression?

      5) The use of the term "leading strand" is confusing given its clear association with DNA replication (and not a term I've heard of associated with miRNAs).

      6) Please give cut-offs for things like "infrequent", "frequent" etc.

      7) I was surprised at the lack of co-expression for Burge's co-targeting miRNAs, especially in the brain. I think it would be worthwhile to examine more carefully these miRNAs and discuss in a bit of detail why they don't appear together in Fig. 2A.

      8) Fig. 6 should be moved to the supplemental figures as this is not readable and of no real value.

      9) The authors might want to reference Lu et al. (2005) for Mir-1 expression in the colon as this is one of the obvious down-regulated miRNA in diseased colon tissues.

    2. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Ian MacRae

      In this study, Patil and co-workers have combined the largest set of publicly available small RNA-seq datasets to provide a comprehensive analysis of cell-type-specific miRNA distributions. Moreover, the authors made their results easily accessible to the public via Bioconductor and UCSC genome browser. This deeply curated resource is a valuable asset to biomedical research and will help researchers better understand and utilize the otherwise overwhelming number of small RNA-seq datasets currently available.

      Here are some minor points for the authors to address:

      In the background section, the first sentence, "microRNAs (miRNAs) are short, ~18-21 bp, critical regulatory elements that block protein translation". Mature miRNA is single-stranded, so it would be more appropriate to use 'nt' (nucleotides) instead of 'bp' (base pairs) to describe miRNA length. Additionally, many mature miRNAs have a length of 22 and 23nt. Finally, "block protein translation" is not quite right as mammalian miRNAs are believed to primarily function by promoting the degradation of targeted mRNAs . 2. In Fig. 1C, is the "co-dominant" category bar missing? Since the sum of 5p and 3p bars are not equal to 100%.

      In Fig. 1D and 1E, the y-axis label "Unique miRNA count" is misleading/confusing. Would a more appropriate label be "Unique miRNA species"?

      In the "DESeq2 VST provided superior normalization" section, the authors mentioned that "An HTML interactive UMAP with cell type information is available in the GitHub repository (https://github.com/mhalushka/miOme2/UMAP/Figures)." However, the provided link is not accessible.

    3. Abstract

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac083), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1 Reviewer name: Qinghua Cui

      In this paper, the authors reported a curated human cellular microRNAome based on 196 primary cell types. This could be a valuable resource. The following comments could improve this study.

      1. Euclidean distance could be not a good metric for clustering analysis. I am wondering the results when using other metrics, e.g. spearman's correlation.

      2. More analysis are suggestted, such as cell-specific miRNA, functional set analysis etc.

    1. This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Reviewer name: Roberto Pilu **

      The manuscript "Association Mapping Across a Multitude of Traits Collected in Diverse Environments in Maize" by Ravi V. Mural et al. reported the application of high-density genetic marker data from two partially overlapping maize association panels, comprising 1,014 unique genotypes grown in seven US states, allowing the identification 2,154 suggestive marker-trait associations and 697 confident associations and suggesting the possible application to study gene functions, pleiotropic effects of natural genetic variants and genotype by environment interaction.

      The background data are well documented, experimental data are convincing, clearly presented and well discussed, the paper is suitable for publication in Giga Science in its present form.

    2. This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Reviewer name: Yingjie Xiao.

      The authors described a study of integrating multiple published datasets for reanalysis. They combined previously community panel data and newly collected data in the present study, finally assembling 1014 accessions with 18M SNP markers and 162 traits at different environments. They used a resample-based GWAS method to reanalyze this assemble dataset, and identified 2154 suggestive associations and 697 confident associations. They found genetic loci were pleiotropic to multiple traits.

      As the authors mentioned, I acknowledge their efforts for collecting and assembling different sources of previously datasets, which should be useful for the maize community. However, to the manuscript per se, I feel the paper seems not to be sufficiently quantified regarding the novelty and significance of reported findings. If the authors could present several novel results because the previous studies had the limitations on population size, diversity, trait dimensions and environments. In this study, the authors seemed trying to present like this, but it may be improved further and more.

      It's hard to let me understand there are some novel things which was found due to the merged large dataset. On the other hand, using this assembled dataset, I'm not very clear what's the scientific questions that the authors want to address. In technical sense, I'm wondering how did authors deal with the batch effects when merging datasets phenotype from different environments? It's not comparable for the phenotypes from different accessions collected in different environments. It's hard to figure out the phenotypic difference is caused by genotype, environment, or their interaction.

      The introduction section lacked the proper review for the project background, related progress and publications and findings.

    3. This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac080 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Reviewer name: Yu Li

      Reviewer Comments to Author: Mural et al. reported a large-scale association analysis based on publicly published genotype and phenotype datasets and a meta-GWAS. This study provides a good example for mining community association panel data and further identifying candidate genes, pleiotropic loci and G x E. Actually, metaanalysis of GWAS has been used in humans and animals. However, I have some major concerns as follows.

      1. This study only used three association panels (MAP, SAM, and WiDiv), as I know, some publicly available genotype and phenotype could be obtained for other association panels, for example the association panel including 368 inbred lines (Li et al., 2013, Nat Genetics, 45(1):43-50. doi: 10.1038/ng.2484), which was used widely in GWAS studies in maize. Can other association panels be integrated into this research, which would provide a rich genetic resource for maize research groups.

      2. For association analysis, a total of 1014 unique inbred lines and 162 distinct traits from different association panels were used, but these traits were not measured for each of 1041 inbreds. For example, cellular-related traits were mainly measured in the SAM association panel. Hence, association analysis for cellular-related traits were conducted in SAM or 1014 inbreds. If 1014 inbreds were used to perform association analysis for cellular-related traits, how did you analyze the phenotype data? Please describe the method of phenotype data analysis in the Method section.

      3. Authors used RMIP values to identify significant association signals, please add more details about the RMIP method. What advantages of the resampling-based genome-wide association strategy over other methods?

      4. Although some important functional genes could be identified, were some new candidate genes obtained in this study functionally verified by the mutants or overexpression experiments.

      5. The authors identified pleiotropic loci based on categories of phenotypes associated with the same peak. For example, the phenotypes associated with the pleiotropic peak on chromosome 8 from 134,706,389 to 134,759,977 bp belongs to Flowering Time, Root and Vegetative categories, thus the locus was associated with different traits. Do you have any ideas on pleiotropic genes based on the results?

    1. at the same time

      Reviewer name: Ruben Dries (revision 1)

      The authors responded adequately to my original concerns and have adjusted their manuscript accordingly. I have no further questions or comments. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. State

      Reviewer name: Ruben Dries

      In this article, the authors created a modular and scalable pipeline to process raw sequencing data from spatially resolved transcriptomic technologies. In contrast to other popular genomics technologies, such as (single-cell) RNA sequencing, there are virtually no existing public tools that allow users to quickly and efficiently process the raw spatial transcriptomic sequencing data that are generated through Illumina sequencing. This is largely due to the fact that each spatial transcriptomic workflow creates its own unique spatially barcoded reads and thus typically requires technology-specific tools or scripts to extract both the barcode and gene expression information. Here the authors created Spacemake which consists of multiple modules that are tied together using the popular workflow management system Snakemake. The innovative part of Spacemake comes from the creation of specific 'sample variables', such as the barcode-flavor, run-mode and puck, which allows them to create a flexible pipeline that in theory can be adapted to any type of spatial array-based sequencing technology. The authors use well-established tools for downstream quality control and data processing and provide useful additional modules to assess or improve spatial data quality. Finally, Spacemake is also directly linked to Squidpy for downstream analysis and creates a web-based report, which could certainly help to lower initial spatial data analysis barriers. Overall, the presentation of the tool and the methods used in the pipeline as described in their contents are comprehensive and the user manual is easy to understand. We appreciate the efforts to provide this tool to the spatial transcriptomics community and to make it open-source and flexible. However, we do have some suggestions and concerns regarding the manuscript and/or use of this tool. Major comments: 1. We managed to install the spacemake software on the linux based server but failed to install it on a MacOS machine due to the compatibility issue with bcl2fastq2. Unfortunately, we also ran into an issue on our linux server, which happened during one of the reading steps from "/dev/stdin" in the middle of the spacemake workflow. More specifically we encountered the following error: Job error: Job 7, TagReadWithGeneFunction Error message: [E::idx_find_and_load] Could not retrieve index file for '/dev/stdin' Even with the help of our IT team we were unable to resolve this issue. To help troubleshoot it might be helpful if the authors can provide exact commands for the examples provided in the manuscript and show what should be expected output of each job in the snakemake pipeline. As a result we were unable to re-run any of the provided examples, which severely limited our reviewing options. 2. A major drawback of Spacemake is that it currently does not offer solutions for the integration of imaging information, which is typically an essential step in any spatial sequencing workflow. The authors do note this shortcoming in their discussion and as a potential solution they argue that Spacemake can be used with another tool called Optocoder, which is currently being developed in their lab. However no information can be found anywhere. There is no biorxiv or github page available based on our search results and as such we were unable to test or assess this solution. At minimum the authors should provide general guidelines on how users could potentially integrate images together with the created spatial downstream results. Minor comments: 1. The figure labels and legends are not always clear. More specifically it's sometimes hard to figure out which samples are being used for each figure or panel. This could be simply resolved by writing more informative legends that specifically state which sample was used to create each figure panel. According to the text Seq-Scope was used to generate figure 3, however in the legend of figure 3 it says Slide-seq … 2. Overall, the figures are pretty and informative, however I would suggest starting with a general overview figure that highlights the spacemake pipeline and it's innovative framework. Given the goal and content of the manuscript this seems to be appropriate as a main figure. 3. In order to initialize a spacemake project, the dropseq tools that are required by Spacemake lack any introduction. Please provide a brief introduction and a link to the associated github page to improve this step. 4. In order to configure the spacemake project by adding a sample species, the pipeline does not allow compressed versions of genome files. This could be simply fixed and allows the user to directly link to their, typically compressed, genome files. 5. More information is needed about the R1 R2 arguments in the add sample function. For example, SeqScope has two separate libraries to get sequenced. Where each round of libraries should be loaded is not immediately clear from the tutorial the authors provided. 6. The downsampling and NovoSparc modules together might create an opportunity to identify the relative error that is introduced when NovoSparc is used to enhance spatial expression patterns. Although this might be outside the scope of this paper. 7. As mentioned in the Major comments section we were unable to successfully run an example script, but it would be of great interest to the large spatial community if this pipeline can easily be used with other downstream analysis tools, such as Giotto, Seurat, Bioconductor (spatialExperiment class), etc. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    3. Spatial

      Reviewer name: Qianqian Song (revision 1)

      The revised version mostly addressed my concerns. Hopefully this tool can be widely used with the emerging spatial transcriptomics data. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    4. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Qianqian Song This manuscript proposed a python-based framework named spacemake, to process and analyze spatial transcriptomics datasets. It offers functionalities including sample merging, saturation analysis and analysis of long-reads as separate modules, etc. Overall, this tool holds promises for spatial analysis, though this manuscript lacks details and explanations of methods and results. Specifically, I have some concerns regarding this manuscript. 1) As shown in table 1, it is noticeable that spacemake doesn't include H&E integration, which is kind of necessary in spatial data. I would recommend the authors at least discuss the potential functionality in including H&E images. 2) From the legend of Fig 2B, I didn't find the plot with Shannon entropy, please double check. 3) I don't understand the meaning of fig 2D. The authors should explain how they calculate the Shannon entropy and string compression length of the sequenced barcodes, as well as how they define the expected theoretical distributions. More details are needed here. Though the authors mentioned related information/details would be in methods (last line in QC section), I didn't find any in methods. 4) In Fig 4 A, the authors show the mapped scRNA-seq of mouse cortical layers. I think a complement spatial plot with annotations is necessary, as there is a gap between Fig 4A and Fig 4B. 5) Fig 5C lack the annotations of different colors. 6) In page 16, the authors cited a manuscript in preparation, which is not good. I suggest remove the citation. 7) Supplementary Fig 1 would be better if put as fig 1, thus it would show the overall flow & functionality of spacemake. 8) Based on Supplementary Fig 1, the authors should add a section illustrating how they annotate the spatial data and the involved gene markers. 9) The paragraph "Spacemake can readily merge resequenced samples" lacks detailed explanation and results. 10) Though spackemake claims it is fast in processing data, well, Supplementary Fig 5 doesn't fully support that. Meanwhile, the authors should explain what the different colors represent. 11) In Supplementary Fig 2, the authors show very high correlation between spacemake and spaceranger, especially the exon intron and exon sub-figures. It looks like the correlations is close to 1. I suggest the authors double check the results and give explanations on their correlation analysis. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    1. compute

      Reviewer name: Aleksandra Pakowska (revision 2)

      Thank you for the feedback and for including more analyses. Figure S 5 is hard to read (it is unclear where the loops are), in Figure S 6, HiCExplorer looks in fact worse than HiCCUPS. Both tools have issues at noisy loci but seem to be calling the most relevant interactions. The authors decided not to address the issue of pixel merging and its impact on the analysis which might have perhaps helped to understand the discrepancies between tools. Given that almost half of the loops detected by HiCExplorer are not detected by HiCCUPS, it would be interesting to check what these loops connect - convergent CTCF sites, cis regulatory elements to each other? This point could be addressed either in this or in another study. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. are

      Reviewer name: Feng Yue (revision 1)

      My main concern for the revised manuscript is the additional benchmarking the authors performed with Fit-Hi-C and Peackachu. Since Fit-Hi-C is one of the first algorithms for Hi-C loop prediction (published in 2014) and Peakachu is the only method that uses the supervised machine learning approach for such purpose, I suggested that these two software should be recognized. If the authors can perform a fair benchmarking and find out where the differences come from, the results would be really interesting. The authors decided to test the aforementioned methods during the revision. Unfortunately, I believe there were some errors during the testing. For Peakachu: 1. Most importantly, the authors used the wrong form of normalized Hi-C files for Peakachu. Peakachu model was trained and should be used with ICE-normalized Hi-C matrix. However, based on page 8 in the supplementary file, the input file is gm12878_KR.cool. The data range for ICE and KR normalization is very different, and therefore, the model trained in ICE file will not work with KR format and the prediction will wrong. Therefore, all the following evaluations and descriptions for the Peakachu prediction are not accurate and needs to be revised (such as Fig. 4, Table S1 ...). 2. In the response letter, there is another misunderstanding about merging. Because Fit-Hi-C predicted too many contacts, the authors of Peakachu merged "the top 140,000 interactions into 14,876 loops (Fig. 3a, b), with the same pooling algorithm used by Peakachu." The reason is that if multiple continuous bins on a Hi-C map are all predicted as loops, the merging/filtering step will use the bin with the most significant P-value as the chromatin loops (local minimal). As the authors noted, Fit-Hi-C by default will generate "significant contacts in the 100,000-ends." Therefore, this merging/filtering step is necessary if we want to compare the loops predicted by each method. This is also what the author did in this manuscript as well - I am quoting their own writing here, "This filtering step is necessary to address the candidate peak value as a singular outlier within the neighborhood." Therefore, I do not understand the authors are "irritated" by such approach. 3. The authors of Peakach have released their prediction in 56 Hi-C datasets on their 3D Genome Browser website (http://3dgenome.fsm.northwestern.edu/publications.html), including the ones used in this manuscript. The authors used models trained at different sequencing depths for different datasets. Therefore, I would suggest the authors use this dataset for a fair evaluation. Regarding Fit-Hi-C, what are the number of peaks the before and after filtering? The author also needs to provide the loop locations so that reviewers can evaluate their claim independently. This information is critical. This manuscript might be helpful for the authors to evaluate Fit-Hi-C (Arya Kaul et al. Nature Protocol 2020). Finally, the authors need to provide all the predicted chromatin loops in the cell lines as well as loops predicted by other software used in this manuscript as supplementary materials (loops in Supplementary Table 1). Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    3. Chromatin

      Reviewer name: Feng Yue

      This paper provided a loop detection method using continuous negative binomial function combined with donut approach. To test the performance of this method, the authors used in-situ Hi-C data by Rao 2014 in GM12878, K562, IMR90, HUVEC, KBM7, NHEK and HMEC cell lines. This method showed comparable results with HiCCUPS and cooltools and better outputs than HOMER and chromosight. The significant advantage is the utilization of modern computational resources. The following are my comments: 1. The author claimed the advantages in utilizing computational resources. The authors need to clarify how their algorithm contributes to this advantage. 2. It will be helpful for the users to know the performance of the software at various sequencing depths, which can be achieved by down-sampling the high resolution datasets. 3. The authors need to compare (or at least discuss) Fit-Hi-C and Peakchachu. A table showing the strength and limitation of each method will be helpful. To be honest, I don't think any method is clearly better than the other. They are just different approaches. 4. It is better to use other types of orthogonal data like HiChIP, ChIA-PET to evaluate the loops called by these methods. There are H3K27ac HiChIP, SMC1 HiChIP, CTCF ChIA-PET and RAD21 ChIA-PET data in GM12878. 5. Just a minor suggestion. There are a lot of tables in the manuscript, which makes it hard for the readers to compare. It might be better to use figures instead. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    4. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Borbala Mifsud

      Wolff et al. present the python version of HiCExplorer for loop detection. The algorithm is included in the Galaxy HiCExplorer webserver (Wolff et al. 2020), although the publication about the webserver did not describe the algorithm in detail. HiCExplorer uses the same donut approach as HiCCUPS (Rao et al. 2014) with a few notable differences. HiCExplorer selects candidate peaks based on the significance of the distance-corrected observed/expected ratio using a negative binomial model, and compares the peak's enrichment to its neighbourhood's using a Wilcoxon rank-sum test. The method is appropriate for chromatin loop identification and it performs similarly to existing methods both in terms of computational requirements and specificity of the detected loops. However, the manuscript in its current format does not describe the method adequately, and the comparison with the other methods is limited and inconsistent. It would be good to describe each step of the method (filtering based on distance, candidate selection based on negative binomial test, additional filtering options, local enrichment testing using different neighbourhoods in a Wilcoxon rank-sum test). The graphical representation currently included for the algorithm is not informative for most of these steps. For the scientific community, it would be more informative if this method's performance would be further analyzed. Even though it is mentioned that the loop detection greatly depends on the initial parameters, the results do not show how the parameters influence it. The comparison of HiCExplorer with other existing methods is inconsistent. Finally, the text would need heavy editing for language, clarity and minor spelling mistakes. Specific comments: The background does not clearly lay out the motivation behind designing this algorithm. There are similar existing methods that are fast. Why is it expected to detect chromatin loops better? This is not a 3D genomics specialized journal, therefore the text should introduce Hi-C and its challenges clearly. For example, the notion that genome properties and ligations affect Hi-C data analysis is mentioned in the methods section without further elaboration. It would be hard for readers to understand why authors are normalizing for ligation events in their algorithm. The background introduces a few methods that are not aimed at detecting chromatin loops (e.g. GOTHiC) or not designed for Hi-C (e.g. cLoops) and are also not used in the comparison. It would be more useful to describe the algorithms of those methods that are comparable to Hi-C explorer in terms of their goal and design. Figure 1, which represents the steps of the algorithm, does not make it clear what happens at each step, some of arrows seem to point to random pixels, e.g. in panel C. More elaboration on the use of the three different expected value calculation methods would be needed. Which one is more appropriate for a mammalian vs. an insect Hi-C does it depend on the genome size, the sequencing depth or the sparsity of the data? The negative binomial distribution does model well the read counts in most high-throughput sequencing experiments, but the rationale given for choosing it is not appropriate. Also, citing a stackexchange discussion for the methods is not suitable. The numbers in most tables could be better appreciated if they were represented in a figure. What was the reason to increase the distance only to 8Mb instead of using the full genome as comparison, especially given that some of the compared methods only work on the full genome? The bottom left neighbourhood in HiCCUPS is assessed, because they only use the upper triangle in the Hi-C matrix, and the bottom left neighbourhood represents the shorter interactions. In Figure 2, the detected interactions are indicated on the bottom triangle , which is counterintuitive. Fig 2A is showing the same data as Fig 2A in the Galaxy HiCExplorer publication (Wolff et al 2020), but the detected loops indicated are different. What is the reason for that? The difference between the proportion of CTCF-bound loops for the different methods is probably not significant. It should be tested. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published

    1. Results

      Reviewer name: Lutz Brusch (revision 1)

      The revised version of the manuscript "ChemChaste: Simulating spatially inhomogenous biochemical reaction-diffusion systems for modelling cell-environment feedbacks" addresses all my previous comments and I would also like to thank the authors for their in-depth response. Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. Motivation

      *Reviewer name: Lutz Brusch*

      The manuscript no. GIGA-D-21-00383, entitled "ChemChaste: Simulating spatially inhomogenous biochemical reaction-diffusion systems for modelling cell-environment feedbacks" addresses the important technical challenge of hybrid discrete-continuous models. The presented extension of the widely used Chaste software library, termed ChemChaste, now supports simulations of reactiondiffusion dynamics in a 2-dimensional environment bi-directionally coupled to motile and chemically active but point-like cells. Specifically, ChemChaste supports arbitrarily many spatial domains within the system, each with individual uniform diffusion coefficients. It supports arbitrarily many coupled reaction-diffusion equations and coupling via membrane reactions and transport reactions between bulk molecular species and intracellular species. Cells are coarsely represented as points on a cell-mesh that is distinct from the FE-mesh for solving the reaction-diffusion dynamics. The user interface is established through a tree of many small text and csv files that are human-readable. All these extensions to Chaste are valuable and their presentation is important for the large user base and beyond. The manuscript is clearly structured and well written. The source code is openly available under the permissive BSD 3- clause license at the provided GitHub link (https://github.com/OSS-Lab/ChemChaste) and includes all models, parameters and data as used in the present manuscript. As the motivation and title focus on "...modelling cell-environment feedbacks", then also the implications and limitations of the coarse cell representation in ChemChaste must be clearly stated, see comments below. Major comments:


      1. Coarse spatial cell representation: Cells are represented by their node position in the cell-mesh and interact with the environment through a single node at the same position in the FE-mesh. Can this formalism properly account for transport reaction fluxes in strongly heterogeneous environments where the FE-mesh needs many nodes with differing field values in a spatial area equivalent to the size of a single cell (with the cell node inside this area)? For example, how does this formalism evaluate the uptake from an exponential concentration gradient (as is common for diffusion and degradation around a localized source). For such a field, the local concentration value at any single position is always smaller than the average over any symmetric interval around it. Hence a transport reaction flux calculated with the single concentration value at the cell center will systematically underestimate the flux that would result from averaging over the area equivalent to the size of the cell. Moreover, such systematic errors also occur for linear concentration gradients and can get amplified when transport or membrane reactions are nonlinear with for instance high Hill coefficient. For comparison, with a spatially more explicit cell representation with many paired cell-nodes and field-nodes, one could directly sum the flux contributions from these paired field-nodes. But with the single cell-node here, usability seems limited to weak gradients at the scale of cell size. Alternatively, can a spatial kernel or stencil function be used to average or sum over field values in the spatial area equivalent to the size of a cell?
      2. Conservation of mass for transport: In biology, the number of molecules per time taken from the environment in a transport reaction has to equal the number of molecules per time added to the cell, and vice versa. So mass needs to be conserved and not concentration whereas ChemChaste seems to add and subtract the concentration flux in the different spatial compartments (cf. page 7 of SI.S1.4). For example, if the FE-mesh needs to use multiple nodes in a spatial area equivalent to the size of a single cell (hence Ve<Vc) but the transport reaction only relates the concentration value at one of these nodes to the cell-node, then mass is not conserved and results will be wrong. One option may be to attach volume attributes to nodes in both meshes. A node i in the cell-mesh would store the current cell volume Vc_i and a node j in the FE-mesh would store that node's share of the volume in the environment Ve_j (doubling the number of nodes in the FE-mesh would on average halve each node's volume Ve_j). Then secretion of molecules with intracellular concentration u at rate k would reduce the intracellular concentration by a flux of molecule number per per time and per volume, i.e. k*u*Vc/Vc=k*u, and increase the concentration at the environment node with flux k*u*Vc/Ve which in general is and must be different from the intracellular concentration flux k*u. Likewise, if the FE-mesh is coarse (hence Ve>Vc) then the transport flux must get diluted like kuVc/Ve < k*u. The factor Vc/Ve does not appear to be implemented and the equations on page 7 of SI.S1.4 omit this factor, limiting the usability to the special case Vc=Ve. This implies that the construction of the FE-mesh has to match the cell-mesh wherever cells are positioned and in their neighborhood. This limitation and the required construction of the FE-mesh must be described.
      3. Scaling of fluxes with cell surface area: In biology, membrane reactions and transport reactions occur at the molecular scale and yield a characteristic flux density per membrane area. The total flux per cell is then the integral of the flux density over the cell surface. Hence cells with larger surface area must be able to exchange more molecules with the environment. Since differently shaped cells will have different surface to volume ratios, it appears necessary to attach not only a cell volume Vc_i to each node i of the cell-mesh but also a surface area value Ac_i. The transport reaction fluxes from item 2. above then become k'AcuVc/Vc=k'Acu and k'AcuVc/Ve, respectively, with a new rate constant k' with units [1/(areatime)]. The same argument applies to membrane reactions. Only if all cells have the same and constant surface area then Ac does not need to be attached to nodes and k may be used instead of k'Ac.
      4. User interface and model format: To improve Interoperability according to FAIR,
      5. please explore and comment how the files that are required for model definition in ChemChaste can or cannot be packaged in a COMBINE archive [Bergmann et al. (2014). COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Systems Biology 15:369. https://doi.org/10.1186/s12859-014-0369-z].
      6. please compare ChemChaste's declaration of the reaction-diffusion model in the environment to that of the SBML Level 3 Spatial Processes Package (SBML-spatial) [https://synonym.caltech.edu/documents/specifications/level-3/version-1/spatial/].
      7. please compare ChemChaste's declaration of the reactions to that of the Antimony model format as used in the Tellurium framework [Smith et al. (2009). Antimony: a modular model definition language. Bioinformatics 25:2452. https://doi.org/10.1093/bioinformatics/btp401].
      8. please discuss the necessary steps to convert model files available in SBML-spatial or Antimony to ChemChaste and vice versa.
      9. Numerical accuracy of the 3-fold operator splitting scheme for cell-environment coupling: As shown in Fig.1b, the three operators 1 (Cell dynamics), 2 (Environment dynamics), 3 (Cellular fluxes) are applied sequentially for a coupled cell-environment model. How is the numerical error controlled for this 3-fold operator splitting scheme? How are time steps chosen or adapted internally?
      10. Model equations for test case with cell-environment coupling: In SI, Figure S10.c (and file CellA/Srn.txt in the code repository) apparently all 5 reactions are defined as reversible with "<->" and each has a nonzero kr=1.0 but only two of these reactions are reversible in the reaction scheme in main Fig.4a. Probably the file in the repo and SI is wrong (as the reverse generation of Precursor directly from Biomass and Enzyme is not physiological) and possibly the simulation results in Fig.4b may change after correction of the file CellA/Srn.txt.
      11. Findability of repository: To improve Findability of ChemChaste according to FAIR, the code repo should be integrated with or referenced from the core project at https://github.com/Chaste/ . This integration should also facilitate future code maintenance and usability in a sustainable manner. Minor comments:

      1. Further tests may be easily implemented for the Schnakenberg model which was qualitatively simulated but not quantitatively compared to an analytical prediction (main text, lines 368-375). One (rough) quantitative comparison could be achieved for the dominant mode of the Fourier-transformed simulated pattern (Fig.3b; or some other measure of the spatial period of the pattern) versus the critical mode of the diffusion-driven instability (|k_cr|^2 = 1/(2D_U) * dR_U/dU + 1/(2D_V) * dR_V/dV). In addition, the instability threshold from eq. (25) in SI.S6 (page 27) can be tested in simulations along a one-parameter scan across the instability and the temporal oscillation period in Fig.3a can be (roughly) compared to the predicted period from the imaginary part of the eigenvalues of the steady state or computed by means of numerical continuation in AUTO (http://indy.cs.concordia.ca/auto).
      2. Main text, lines 460-463: "Thus...lead to a spatial segregation of the two cell types." This behavior may be subject to the slow or lacking active motility of the cells. Now, cell division alone seems to generate compact clones of the same cell type instead of emergent spatial segregation. Maybe comment if/how ChemChaste handles random walks of cells or even chemotaxis of cells towards ES. Then the interesting question of emergent spatial segregation can be studied with ChemChaste.
      3. Please clarify if/how ChemChaste allows to incorporate transport reactions directly between neighboring cells (like auxin or calcium transport in tissues)?
      4. Where are the membrane reactions involving a cell and the environment included in Fig.1b: in steps 1./2. or in step 3.? That is interesting for the numerical operator splitting scheme and may be added to the caption.
      5. In addition to item 7. above (which should ensure future usability), the reproducibility of the current model results as presented in this manuscript should be ensured by archiving the current software version from the ChemChaste code repo at Zenodo or a similar service and the DOI of that archive should be given in the manuscript. In addition, that archived code shall be given a version number on GitHub and that version number shall also be given in the manuscript. Figure improvements:

      • Figure 2.b may have axes flipped or may have an unfortunate color scale with too little contrast for convergence scores between 0.4 and 0.5 to show the gradual change of score at the horizontal row with dt=0.1 (which is apparently used in Fig. 2.c and shows a change of accuracy there). Please check and improve the correspondence between panels b) and c) such that the data from panel c) helps to get a feeling for the L2 score changes in panel b).
      • Figure 2.b: How can we understand the loss of convergence if the time step is reduced (say from 0.006 to 0.0002) at any fixed dx? From other solvers, one is used to that finer dt improve convergence while this plot shows dark (high L2 score) areas on both sides of the light (low L2 score) areas at intermediate values of dt.
      • Figure 2.c: The color code is not suited for so many curves. Either include line style or reduce the number of curves (preferred). It must become clear which curve belongs to which dx. The green curve with dx=0.8 seems to be hidden?
      • Figure 3.a: The figure caption should explain the source of variation between nodes (e.g. by pointing to the noise terms in eqs. 13,14) and the color code for the two bands (dark and light) around each curve (1-sigma and 2-sigma or 1-sigma and min/max ?).
      • Figure 4b: These two panels could be given more space. Suggestion: re-arrange part a) horizontally and then put both diagrams of b) at the bottom, left and right.
      • Figure 5: The caption wrongly announces "and t=100" which is not shown. Also the words "towards the" in the first line seem to be linked to t=100. Text corrections:

      • main text, line 61. The sentence "...centred on the role chemical coupling." seems to miss the preposition "of".
      • main text, line 71. The phrase "cellular network reaction size" appears misleading, when it shall refer to "the size of the cellular reaction network".
      • main text, lines 280, 284, 286: Since the subsections of the Results section are not numbered here, then the text pointers "(Section )" can be omitted.
      • main text, one line below eq.(7): "reaction rate constants parameters" can drop the word "parameters"
      • main text, lines 450 and 451: "a...concentrations" should be either singular or plural
      • SI.S1, page 1, line 5 above eq. (1): text "exchange chemical concentrations" should read "exchange molecules" and, correspondingly, "controlling the chemical concentrations passing between the bulk and the cell" should read "controlling the flux of molecules between the bulk and the cell".
      • SI.S1, page 2, line 2: "asssociated" has an "s" too much
      • SI.S1, page 5, at the end of Fig.S1's caption: $k-p$ should be $k_p$
      • SI.S2.2.1, page 14, eq. (11) has capital U_0 and V_0 as initial values while the sentence above has small u_0, v_0. These should be the same symbols.
      • SI.S6, page 26, 1 line below eq. (19): "is a spatial case" should be "is a special case" Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
    3. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Cheryl Sershen

      It would be nice to include the Github link for Chaste. I was able to use the software and reproduce the results presented in the paper. Software is easy to use and install. A broader discussion of what would be necessary to expand Chemchaste to three dimensions is necessary. In a follow-up paper, comparisons to actual experimental results would be useful and promote users to consider this software. Only proximity to the analytical solutions were presented here. Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    1. report

      Reviewer name: Yang Zhou (revision 1)

      The authors have resolved most of my comments. However, I am still confused about the gap in the Pilon step from the information in Table 1. In the table, I could read that the assembly length of "Flye + Pilon" is 2,383,228,608 bp, and the ungapped legnth is 2,383,226,373 bp, so the gap length is 2,383,228,608 - 2,383,226,373 = 2,235 bp. Because in the "Flye" version the assembly length is equal to the ungapped legnth, this means that gaps are introduced after Pilon correction. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. Findings

      *Reviewer name: Yang Zhou*

      The authors have resolved most of my comments. However, I am still confused about the gap in the Pilon step from the information in Table 1. In the table, I could read that the assembly length of "Flye + Pilon" is 2,383,228,608 bp, and the ungapped legnth is 2,383,226,373 bp, so the gap length is 2,383,228,608 - 2,383,226,373 = 2,235 bp. Because in the "Flye" version the assembly length is equal to the ungapped legnth, this means that gaps are introduced after Pilon correction. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    3. Syrian

      Reviewer name: Derek Bickhart (revision 2)

      The authors have addressed all of my remaining concerns. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    4. Background

      Reviewer name: Derek Bickhart (revision 1)

      Summary: In this revision, the authors have addressed most of my major concerns with the manuscript. More details must be provided in two sections of the manuscript based on new details provided by the authors. However, these concerns could feasibly be addressed in revision. Line 124: While the authors have provided an explanation for the sequencing of different target fragment length library preparations, I do not see any results that suggest that one particular preparation was more efficient than the others. This is particularly important given the prevalence of four experimental runs of varying dataset sizes that were uploaded to the cited Biosample accession on SRA. Currently, the metadata provided for that Biosample and its associated experiments is lacking, and one cannot easily distinguish which experiment resulted from different target length preparations. A discursive analysis is not required here, but a statement that provides limited data supporting the authors' preference for library prep is necessary. Line 301: I believe that the authors misinterpreted the comment on this section in my last review. I requested the proportion of sequence identity differences between assemblies due to INDELs, not assembly gaps. Residual INDELs are still a major problem in polished assemblies that may impact gene annotation. Figure 1 caption: Given the new k-mer genome size estimation analysis provided by the authors, it does not make sense to use the total length of the MesAur1.0 assembly here. I believe that the authors should choose a genome size estimate that seems most reasonable (from the two options provided) and then use that as the basis for NG50 comparisons. Otherwise, are they conceding that the MesAur1.0 assembly size is the full length of the Syrian Hamster sequence-accessible genome? Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    5. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Derek Bickhart

      Summary: In this manuscript, Harris et al. detail the methods they used to create a new reference genome for the Syrian hamster, which is an important model for respiratory disease pathogens. They used several different sequencing technologies to generate the contigs and scaffolds for their new assembly, and achieved a relatively continuous end product. The analysis is suitable for the "genome report" style format (with one omission detailed below in my comments); however, the manuscript suffers from some awkward phrasing and grammar errors in the results and methods. I list my comments below in the relative order in which I encountered them in the manuscript. Since the authors did not provide line numbers in their submission, I provide my comments as a block listing of questions/suggestions/critiques. Section titled "oxford nanopore long-read sequencing": The description of the shearing is awkward. I recommend revising the first sentence to state that the genomic DNA isolates were sheared to three lengths (without providing these lengths in the sentence). In subsequent sentences, provide the lengths in situ with the methods used to prepare them. Also, it is unclear why three different fragment lengths were used here for oxford nanopore sequencing. Given that these fragment lengths are relatively similar in size (e.g. not disparate lengths similar to recent ultra-long nanopore read preps of >100kb), it would be very helpful to the reader if justification was given for this approach. Section titled "Genome assembly": This entire paragraph is awkwardly phrased with numerous past- or present-tense changes. Additionally, the reference to the Pilon polisher needs to be cited, and details need to be provided on what settings were used for Pilon polishing (it is often recommended to correct only indels and to omit gap-filling) and how many iterations of polishing were used. Details are missing on how BioNano optical maps were generated, and what DNA was used as input in the process. Also, what software was used to compare BioNano optical maps, and with what settings? Finally, it appears that the RNA-seq data used by NCBI for annotation was used in another study. Citation to that study would be required so that the reader is aware that the data resulted from different individuals other than the reference individual sequenced in this analysis. Section titled "Assembly Comparisons": What is the expected c-value of the Syrian Hamster genome? Also, what is the karyotype count? Are any of the chromosomes metacentric or acrocentric? Were any satellite regions identified and annotated in this assembly? Finally, I would have preferred that assembly comparisons be conducted with feature response curves, such as those produced by the program "FRC_align" as this provides a useful metric to assess assembly "correctness" by length. Section titled "Transcript and protein alignments and annotation comparisons": How many INDELs were identified in the alignments of RNA-seq transcripts to the BCM_Maur_2.0 assembly? Was this count different from those discovered in the short read assembly? Section titled "Interferon type 1 alpha gene cluster": Were there any gaps that spanned the gene cluster or flanked it? Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    1. Findings

      Reviewer name: Boas Pucker (revised version)

      The authors further improved the quality of this manuscript and responded to all my comments. My concerns were addressed and several comments were solved by extensive analyses (e.g. #7). Although some opportunities for further investigations were left for future studies, I still believe that this work is very important for the community. The quality of this Ensete glaucum assembly appears very high. I would like to congratulate the authors on this excellent work and recommend its publication in GigaScience. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. Background

      Reviewer name: Ning Jiang

      In this study, the authors described the generation of a high-quality reference genome of Ensete glaucum, which is one of the most cold-hardy species in the Musaceae. It is also well known for its drought tolerance. The authors compared the expansion and contraction of gene families and the composition of repeats among related species. The genome assembly, analysis, and annotation are certainly useful for comparative genomic studies as well as future breeding practice. Everything seems to make sense to me. Certainly, the results are descriptive, but this is more than sufficient for a data note. Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    3. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Boas Pucker

      Wang et al. generated a chromosome-scale genome sequence assembly of Ensete glaucum based on ONT long reads. This is a valuable resource for comparison against various Musaceae species. This assembly will certainly help to identify genes underlying agronomic traits in Musaceae. Important data sets are already well integrated into the banana genome hub and available to the community. The authors harnessed this highly contiguous assembly for analyses of synteny against Musa acuminata and for the investigation of repeats/TEs. Overall, the quality of this work is high and the manuscript is well written. I am not sure why this submission is classified as a data note, because it could also pass as a research article. I noticed a few issues and provided some specific comments that might be helpful to further improve the quality of this work: 1) There are many numbers in the abstract. I would recommend to reduce this to the most important ones. For example, the BUSCO results could be removed. 2) There is only one short paragraph about existing genome sequences. I would recommend to extend this and to mention the banana genome hub as the central community resource. 3) Please indicate if the coverage estimations are based on the haploid or diploid genome size (Table 1). 4) Please provide additional details about the BUSCO results (C, S, D, F, M) in line114 and/or in Table 2. 5) I find the sentence in line 120/121 confusing when reading for the first time. This suggests to me that more sequence was anchored than present in the initial assembly. The sentence is correct, but it might be better to present the total assembly size first and to describe the anchored proportion in a separate sentence. 6) It would be helpful to clearly distinguish between the genome (DNA) and the genome sequence (the assembly). That would make it easier to understand the discussion of differences between both (e.g. collapsed repeats). 7) Genome size estimation is always tricky. I would recommend to run several tools and to provide the estimated range (findGSE, gce, MGSE, GenomeScope, ….). It is also important to run the k-mer-based approaches with different k-mer sizes. Apparently, GenomeScope was used for the heterozygosity analysis, but not for the genome size estimation. That is surprising. 8) Statistics about the pseudochromosomes in Table 2 could be removed. For example, it is not necessary to say that the L50 number of 9 chromosomes is 5. 9) Please explain the difference in BUSCO results between predicted genes and BUSCO run in genome mode. Which genes are missing in the annotation? Table S3 suggests that the automatic BUSCO annotation (genome mode) is superior to the annotation generated in this study (analyzed in transcriptome mode). 10) Some statements about the CENs and telomeres would be interesting. These could give a good impression of the assembly results. Estimating their copy numbers could help to explain the difference between assembly size and estimated genome size. 11) Are there any genetic markers that could be used to check the assembly accuracy? 12) In my opinion, the section "Gene distribution and whole-genome duplication analysis" could be removed. Genes are never equally distributed across a genome and repeats/TEs are usually clustered around the centromeres. Therefore, this part does not add any novel insights. The second paragraph comes to the conclusion that all Musaceae share the same WGDs. This seems obvious to me. Was there a different expectation? 13) Orthogroup identification could be complemented with a synteny analysis. A comparison to Musa acuminata (https://doi.org/10.1038/s42003-021-02559-3) could help to check the accuracy of the orthogroups. 14) The statement "Genes with Ka/Ks > 1 were under positive selection (Supplementary Table S6)." does not fit well to the rest of this paragraph. Given that there are >35k genes, some would show values >1 by chance. Some statistical test would be needed to find out which genes are actually under positive selection. What is the conclusion from the identification of such genes? Any enrichment of particular functions? 15) The statement about the sugar transporters is interesting. This would be a good chance to connect these comparative genomics results with the transcriptome analyses. 16) Transcription factor families are mentioned, but not discussed. It is not surprising that MYBs are the largest TF gene family. However, it would be interesting to know if there are any striking differences compared to M. acuminata (https://doi.org/10.1371/journal.pone.0239275). Some MYBs like the anthocyanin regulators respond to sugar treatments. Is there a connection to the large number of sugar transporters? Any duplications/deletions compared to M. acuminata? This could be another opportunity to better connect different aspects of this study. 17) It is interesting to read that head-to-head and tail-to-tail repeats appeared collapsed. Previous studies identified that these arrangements of repeats are associated with low local read quality (e.g. https://doi.org/10.1093/nar/gkaa206, https://doi.org/10.1186/s12864-021-07877-8). I would not expect that both strands of the DNA molecules are sequenced. The authors might want to check this and provide additional explanation. 18) I am surprised that TEs were the most abundant class of repeats. Could this be caused by treating at all the different TEs as one group? CENs should appear with a much higher copy number than individual TEs or TE families. 19) The centromeric patterns could be compared to the situation in Arabidopsis thaliana: https://www.science.org/doi/10.1126/science.abi7489. 20) Are SSR less frequent around the centromeres and on the NOR chromosome arm or is this just a lack of detection in these regions? 21) Why is AG/CT more abundant than other SSRs? This could be compared to other species. 22) References for the length of 45S rDNA length in other species are missing. 23) How many 45S rDNA copies can be inferred from the ONT reads. The coverage is way higher thus this estimation should be more reliable. 24) NOR chromosome arm is depleted of protein encoding genes, but there should be plenty of rRNA genes. Please specify this in the sentence. 25) The synteny section is lengthy. The statements in context of previous studies are good, but removing some purely descriptive parts might make it more interesting. The corresponding figures show everything and could stand on their own. 26) What is the value of genotyping-by-sequencing if not combined with GWAS? 27) Which ONT flow cell type? Which Guppy version? 28) It does not become clear how the Hi-C library was prepared (line 562). What is the improvement? Please explain this here. 29) Please add the detailed parameters of the assembly and polishing. 30) BWA reference is missing. Why was BWA not used for the mapping of the Hi-C reads? 31) The statement in line 592/593 suggests that Hi-C was used for validation. However, it was also used for correction in the previous step. Anyways, this result should be moved from the method to the result section. 32) Trinity assembly and PASA steps lack details. 33) Parameters of STAR mapping and gene prediction steps are missing. 34) There is some discrepancy concerning the Musa acuminata genome assembly versions. It seems that v2 is used in some cases and v4 in others. Please check this. 35) Please make the customized script available via github (line 732) if this is different from the one mentioned in line 737. 36) Are the TE results consistent if a different 2Gb subsets of the illumina data are analyzed? 37) How were the centromere positions determined? I think that I have missed that in the method section. It must be connected to the CEN repeats, but the precise approach could be explained in more detail. 38) The read data sets are not released thus I cannot check if all raw data sets were included. It would be particularly important to have the FAST5 files of the ONT data to study base modifications in the future. 39) The link to the banana genome hub appears to be broken in the data availability statement. The data sets on the genome hub look fine. 40) The terms "core" and "pseudo-core" in Fig. 3 are not frequently used in the literature. These genes seem to have different degrees of dispensability and might be conditionally dispensable (https://pubmed.ncbi.nlm.nih.gov/24548794/; https://doi.org/10.1186/s13007-021-00718-5). 41) There seems to be some variation in the genome size estimation. I would recommend to present the results of multiple k-mer sizes (e.g. 17-25). The distribution of the resulting values might help to estimate the true genome size. JellyFish (k=17): 563Mb findGSE (k=21): 589Mb GenomeScope (k=21): 489Mb (this is smaller than the actual assembly size) 42) The presented sugar transporters are not among the top enriched GO terms (S2). Therefore, I am afraid that this analysis is not very informative. Could it be that the "enriched" GOs are just a "random" set? 43) Why is E. glaucum not presented as S5C? A direct comparison would make more sense. 44) S10: I would recommend to identify the precise break points. Next, it would be good to validate the accuracy of the assembly by finding individual reads that actually support the situation in E. glaucum. This would help to exclude an assembly artifact as reason for the difference. 45) It might be better to use a three letter abbreviation of the species ("Egl" instead of "Eg") in the gene IDs to avoid ambiguities in future genome sequencing projects. 46) The method section states that short DNA fragments below 12kb were removed. S11 suggests that two libraries were sequences: one with depletion of the short fragments and one without it. Please check this. Generally, I would recommend to try a different gDNA extraction protocol and to use SRE instead of BluePippin. 47) The north of eg06 looks suspicious in the Hi-C analysis (S12). There is also no substantial synteny with any of the Musa chromosomes (S8). Could this be an indication that there are errors in the assembly? 48) Table S1: What is the point in showing that all contigs are larger than 1, 2, and 5kb? 49) 445 bHLHs in M. acuminata is almost twice the number of bHLHs detected in E. glaucum. Some other TF families also show this large difference, but orther families show almost equal numbers. It could be interesting to further investigate this. The HB-KNOX value of M. acuminata is missing. Minor comments: line 70/71: Some countries are named multiple times. Please change this. line 113: chromosomes > pseudochromosomes line273/274: Please check this sentence. line 428: Please rephrase "translated proteins" and SynVisio should only be named in the method section. line 436: "protein-coding genomes" ? line 464: "second (right)" … should be replaced by north/south or q/p nomenclature. This also affects some following sentences. line 625: "Musa acuminata" is a species name line 639: blast > BLAST line 731: of of > of line 811: RNA-sequencing > RNA-seq (I have not seen a section about RNA sequencing) S10: "E glaucum" > "E. glaucum" Level of Interest Please indicate how interesting you found the manuscript: Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    1. ML

      Reviewer name: Gael Varoquaux (revision 1)

      I would like to thank the authors for the work done on their manuscript, in particular adding the experiments that enable linking to sparse-recovery theory. In my opinion, the manuscript brings a lot of value to the application community and is pretty much complete. A few details come to my mind that could help its message be most accurate. Because of my suggestions, the authors have used an l1 penalty in the SVC. This worked well in terms of prediction. However, it is not the default. I think that the authors should stress this and be precise on the peanlity each time they mention the SVC. In addition, I think that there would be value in performing an additional experiment with an l2 penality (which is the default) to stress the importance of the l1 penalty. The message should stress that the penality (l1 vs l2) is importance, but less the loss (log reg vs SVC). As a minor detail, I would invert the color scale of one of the plot plots on figure S12, S13, to stress the parallel between the two. Finally, I think that it is important to stress in the conclusion that all the results build on the fact that the predictive information is sparse (maybe putting this with words more familiar to the application community). Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    2. Results

      Reviewer name: Filippo Castiglione

      The article "Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification by Kanduri1 et al. describes the construction of suitable reference benchmarks data-sets to guide new AIRR ML classification methods. The article is interesting and potentially useful in defining benchmark data sets and criteria for constructing specialized AIRR benchmark datasets for the community of researcher interested in AIRR. The authors following previous indications about model reproducibility and availability also provide a docker container which include all data and procedures to reproduce the study. The article is sufficiently well written although at time a bit full of details which perhaps could be synthesised further (this has already been done in pictures and tables). I don't have major concerns. Only a couple of notes. Would be good to have a figure or diagram showing an example of bags containing receptors and associated witnesses. It could illuminate the reader not familiar with Multiple instanvd learning. Would be good to have line commands for the generation of data sets (in the case, for instance, of use of Olga). I understand these are inside the docker container but the reader that is not interested in the whole container might find useful to have access to pieces of the pipeline so to use this or that tool (being it in immuneML, in Olga, etc.). Curiosity: why have the authors used Olga and not the mate Igor? Why is the performance metric in model training the accuracy and not, for instance, the F1-score? Any particular reason? Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    3. Background

      Reviewer name: Enkelejda Miho

      General opinion: approved with minor changes Comments: The manuscript profiles machine learning methods for AIRR T-cell receptor dataset immune state label prediction to establish the baseline performance of such methods across a diverse set of challenges. Simulated datasets with variable properties are used to provide a large amount of benchmarking datasets with known immune state signals while reflecting the natural complexity of experimental datasets. Their results provide insights on the current limits posed by basic dataset properties to baseline ML models and establish a frontier of improvement of AIRR ML research. The manuscript is understandable and well structured in the approach to comparisons as well as solid conclusions. The graphics are clear and consistent and support the manuscript. Very interesting insight into the importance of single individual variable parameters such as sample size or witness rate on the general accuracy. The advantage of the results to the scientific community is that it offers an evaluation of classical ML methods, provides large and specialized AIRR benchmark datasets, and allows further development and benchmarking of more sophisticated ML methods. The manuscript is overall well-written and we endorse it with minor changes: In paragraph Impact of noise on classification performance (page 14) the sentence "but enriched above a baseline in positive class examples" should be corrected with "but being enriched above a baseline in positive class examples" In paragraph Machine learning models (methods section, page 21) "lasso" should be corrected with "Lasso". In paragraph Machine learning models (methods section, page 21) " '- ' " should be corrected with "'-'" and "ð•‘‹jdenoting» with "ð•‘‹j denoting». In the discussion the sentence "which aligns with the observations that that the majority of the possible contacts between TCR and peptide" should be corrected with "which aligns with the observations that the majority of the possible contacts between TCR and peptide" Keep comparisons like size>500 and size > 500 concise Check for missing whitespace as in the description of the figure 1(b): …(5 x 105 % of sequence.. Same in cases like ≈90% | ≈ 90 % or n=60 | n = 60 Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. Enkelejda Miho owns shares is aiNET GmbH. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    4. Abstract

      This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Gael Varoquaux The manuscript by Kanduri et al benchmarks baseline machine-learning method on simulated sequencing data of adaptive immune receptors to predict immune states of individuals by detecting antigen-specific signatures. Given that there is a volume of publication using a wide variety of different machine learning techniques with the promise of clinical diagnostics on such data, the goal of the study is to set baseline expectations. From an application standpoint, I believe that the study motivated and useful to the communitee. From a signal processing standpoint, many aspects of the study are trivial consequences of the simulation choices: sparse estimators are good for prediction when the signal is generated from sparse coefficients. Though I do not know well this application community, it seems to me that the manuscript is valuable because it casts this knowledge in a specific application setting, however it should discuss a bit more the fundamental statistical reasons that underly the empirical findings. I give below some major and minor comment to help make the study more solid. 1. Plausibility of the simulations The validity of the findings relies crucial on the simulations, in particular the hypotheses of extreme sparsity. These hypotheses need to be discussed more in details, with references to back them. The amount of sparsity as detailed in table 1, is huge, which strongly favors sparse models. 2. Another baseline, natural given the sparsity I do realize that the goal of this study is not do an exaustive comparison of all machine learning methods --an impossible task--, however for someone knowledgeable about sparse signal processing, In particular, the study begs the question of whether univariate tests on appropriate k-mer can be enough, an alley suggested by the authors on page 7. This option should be studied empirical, as it would provide important practical methods. 3. Link to sparse model theory A vast variety of theoretical results state that a sparse model will be successful for n proportional to s log(p) where n here would be the number of samples in the minority class, s would be the number of non-zero coefficients. A good summary of these results can be found in the book "Statistical learning with sparsity: the lasso and generalizations T Hastie, R Tibshirani, M Wainwright - 2019" It would be interesting to see how these theoretical scaling match results, for instance those on figure 3. 4. Accuracy and class imbalance It seems to me that in parts of the manuscript (fig 4.a for instance) accuracy is compared across different scenarios with varying class imbalance. However, accuracy is not comparable when class imbalance varies: for instance with 90% positive class, a classifier that always choose the positive label will have .9 accuracy. In this light, I don't understand fig 4.a, in which even for large class imbalance accuracy goes to .5. In addition, the typical good practice is to use a metric for which decision under chance are not affected by class imbalance, such as area under the curve of the ROC curve. 5. Comparison with SVC The manuscript mentions that a Support Vector Classifier is also benchmarked, however it does not give details on which specific SVC is used. A crucial point is the kernel used: with a linear kernel, the SVC is a linear model, while with another kernel (RBF kernel, for instance), the SVC is a much more complex model and is not expected to behave well in large p, small n problems. Also, I suspect that the SVC is used with the l2 regularization. A linear SVC with l1 regularization would likely have similar performance as the l1-penalized logistic regression, as it is a model of the same nature. These details should be added; ideally, if the model benchmarked is not a linear SVC, a linear SVC should be benchmarked, to give a baseline (though the default l2 regularization can be used, to stick to common practices). 6. Wording in the conclusion The conclusion starts with "To help the scientific community in avoiding futile efforts of developing...". The word futile is too strong and the phrasing will not encourage healthy scientific discussion. I try to sign my reviews as much as possible. Gaël Varoquaux Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

    1. Functional

      Reviewer 3: Chris Armit

      This Data Note describes an Open CC0 neuroimaging dataset of 15 subjects (young adults) who underwent simultaneous BOLD-fMRI and FDG-fPET imaging. FDG-fPET ([18]-fluorodeoxyglucose positron emission tomography) measures glucose uptake in the human brain, whereas BOLD-fMRI (blood oxygenation level dependent functional magnetic resonance imaging) captures the cerebrovascular haemodynamic response. FDG-PET data was acquired using three different radiotracer administration protocols - bolus, constant infusion, and 50% bolus + 50% infusion - and each administration protocol was applied to 5 subjects. BOLD-fMRI and FDG-PET was acquired while participants viewed a checkerboard stimulation, which was used to trigger dynamic changes in brain glucose metabolism.

      This neuroimaging dataset allows researchers to explore the complexity of energetic dynamics in the brain using multimodal imaging data analysis. In addition, this neuroimaging dataset includes structural MRI data for each of the subject, including T1 and T2 FLAIR, enabling neuroanatomical correlations to be explored. The neuroimaging data are available from OpenNeuro [http://doi.org/10.18112/openneuro.ds003397.v1.1.1] and the authors are to be commended for ascribing a CC0 Public Domain Dedication to this dataset. Importantly, the authors highlight that consent was obtained from participants to release de-identified data. I downloaded a small number of image files from this dataset and I confirm that the de-identified NIfTI (Neuroimaging Informatics Technology Initiative) format files can be opened using Fiji / ImageJ.

      This neuroimaging dataset has immense reuse potential and I recommend this Data Note for publication in GigaScience.

    2. Background

      Reviewer 2: Nicolas Costes

      Jadamar et al present a database of limited size, but of a rarity which amply justifies its interest. This is a combined dynamic FDG PET (fTEP) and fMRI study performed in three groups of 5 subjects for whom 3 different modes of FDG administration were used: bolus, infusion and bolus + infusion. The statistical analysis resulting from this study is also of limited scope due to the low residual degree of freedom of the design, but nevertheless makes it possible to confirm the expected characteristics of the shape of PET kinetics; It confirms the superiority of the bolus + infusion protocol ensuring maximum sensitivity to highlighting the neural circuits involved in the visual flickering task performed during acquisition. The interest of the study lies in the free provision of the whole data that can be used, as it is argued, as a demonstrator for the development of methods for correcting, processing and analyzing data. A multivariate analysis combing PET and fMRI taking advantage of the simultaneous recording is not accired out: a simple GLM voxel-to-voxel analysis makes it possible to expose notable differences between the 3 methods of administration of FDG. However, the provision of data opens the field for future exploitation. The fact that raw data before PET reconstruction is provided is relatively new and opens up the possibility of extending the field of their exploitation to methods of correction and reconstruction. Respecting the BIDS description format as much as possible is also a plus. These data are of undeniable interest to the community and therefore the description of their content and the exhaustive provision of all the demographic and physical parameters of their realization deserve their publication. Some following remarks should be considered before publication. p7. [18F]-FDG 18 should be in upper script p9: raw PE data are in the original format exported from the siemens console: is there a distinction between list-mode file exceeding 4 Gb, as it is the case on the Siemens console? In which format the raw data will be provided? Results: Figure 2: A. Please specify if plasma curves are corrected for 18F radioactivity decay at the time of injection. Figure 3. Why was the correction applied for Zcorr? FWE? FDR? Figure 4. How exactly « percent final change » is computed: is it an average of the active periods compared to rest period? Is it computed from the beta regressor or directly on signal change? In the later case, on which interval? Figure 5. A well the average accros all protocols is provided in Fig3.D to serve as a reference, could you also provide the average accros References Please review references: check for incomplete references (2., 8., 21. for example), uniformity of format and provide DOI as it is already done for the majority of your them.

    3. Abstract

      Reviewer 1: Antoine Verger

      Review on "Data Note: Monash DaCRA fPET-fMRI: A DAtaset for Comparison of Radiotracer Administration for high temporal resolution functional FDG-PET" This article is an important contribution in its field. This study is an open access dataset, Monash DaCRA fPET-fMRI, which contrasts three radiotracer administration protocols for FDG-fPET: bolus, constant infusion and hybrid bolus/infusion. The Monash DaCRA fPET-fMRI dataset is the only publicly available dataset that allows comparison of radiotracer administration protocols for fPET-fMRI. Even if the provided dataset is useful for the scientific community, the validation part needs some explanations.

      Comments: - Shame that this dataset is not available also for rest fPET-fMRI images. Indeed, most of the studies are also performed at rest (connectivity of neurodegenerative disorders for example) and should need some controls. Please discuss the opportunity to provide such databases. - Was the administered FDG dose unique for all patients or adapted to the body weight? Please detail. - The authors should discuss the gender variability across the 3 groups. Metabolism and radiotracer uptake is dependent of gender. The authors should at least include this covariate in their group analyses. - Of course, raw data are available. I have nonetheless one question: what is the interest of using PSF and after a Gaussian filter in reconstructed images? Why using PSF in dynamic PET (noisy) images? Please, can the authors justify the 16sec of frames for reconstruction of their images? Was it justified by any optimization? - The authors further applied a filter of FWHM 12 mm after having previously reconstructed their images with a Gaussian filter? They should choose one of these two filters. If not, smoothing of PET images is too important. - For the validation set at the group level, is the PET intensity normalization based on proportional scaling? It is particularly important to understand how the authors have obtained the grey matter mean signal. - How was the grey matter mean signal obtained? From a grey matter MRI mask? - Could the authors develop the way to have access to open access reconstructions algorithms? Particularly if images have been obtained with Biograph Siemens. They mention STIR and SIRF: please develop: is it able for anyone who has no access to a Siemens reconstruction algorithm? Is a specific PSF reconstruction for Siemens is implemented? - "there has not yet is not yet agreement in the best way to manage" : please rephrase. - Figure 1: Please include the conventional MRI sequences at the beginning of the acquisition. - Figure 2: Please provide units for signal intensity? It would be also more comfortable to provide elements to distinguish the tasks from the rest periods. - Figure 2: is the grey matter signal obtained for all the grey matter or only for the occipital cortex? Should the authors discuss the higher variability observed between patients for methods with bolus? Is it linked to the different sex ratio between the protocols? Discuss Why one patient in the infusion protocol has a truncated time-activity curve? - Figure 3: the authors should explain the variability of fMRI patterns in GLM albeit the same protocol was performed. Is there an influence of the coupled glycolytic metabolism? - Figure 3: how the authors explain the absence of correlation with task in the infusion protocol? (this was not observed in the 3 phases of the protocols for infusion in Figure 5). - Figure 4: define how the increase in signal percentage was calculated? How was the grey-matter normalized at the group level? Proportional scaling can be source of false positive abnormalities. - Figure 5: Can the authors display the changes in connectivity of the occipital area between the 3 phases for each protocol? (by adding a supplemental part at the bottom of the Figure).

  3. Jan 2023
    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.75) and has published the reviews under the same license. These are as follows.

      Reviewer 1. Ned Peel

      Is the source code available, and has an appropriate Open Source Initiative license (https://opensource.org/licenses) been assigned to the code?

      Scripts have been made publicly available on GitHub (https://www.github.com/phiweger/adaptive) under an OSI-approved BSD-3-Clause license.

      As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?

      No.

      Is the code executable?

      Unable to test

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      Not applicable.

      Additional Comments: Sent authors accompanying file with comments

      https://gigabyte-review.rivervalleytechnologies.com/journal/gx/download-files?YXJ0aWNsZT0zNjkmZmlsZT0xMzcmdHlwZT1nZW5lcmljJnZpZXc9ZmFsc2U~

      Reviewer 2. Julian Sommer

      Is the code executable?

      The code used for analysis of the data has been published on the corresponding github page. Although, a link on this page for downloading data from a public database does not work at the time of testing. (Resource deleted). Also, most parts of the code are executable, the generated data and figures resulting from the code does not reproduce the figures from the publication.

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      Yes. The code placed in the github repository can be executed mostly, but require basic knowledge of coding in the used programming languages. However, for the data presented in this work, I do not see the need for more detailed instructions.

      Is the documentation provided clear and user friendly?

      Only partly

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      Only partly. However, I do not see the need for further instructions.

      Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?

      The data is available from the stated accession numbers, but an additional data link on the github page does not work and might be necessary to test the complete code.

      Additional Comments:

      The study compared three methods of oxford nanopore-based longread sequencing for detection of antibiotic resistant bacterial pathogenes. Therefore, the authors used cultivation based detection of carbapenem-resistant bacteria from a rectal swap and subsequent singe isolate sequencing. This technique was compared to an adaptive sequencing approach using a database of antibiotic resistance genes for adaptive sequence enrichment during the sequencing run facilitation oxford nanopore sequencing. The underlying technology is a unique approach, made possible by the oxford nanopore real-time sequencing technology and is of great interest for future applications in clinical microbiology diagnostics. Therefore, this study is of great importance for this field in general. As additional method, the authors performed metagenome sequencing of the rectal swap without culture, which is a completely different technique with unique advantages and drawbacks, compared to culture-based sequencing methods. This study is important for the development of real time sequencing and adaptive sequencing for the detection of antibiotic resistance genes and in future potentially other genes. It focusses on the adaptive sequencing approach, analysing in detail the factors influencing the performance of this new approach. The number of experiments is limited, as stated by the authors, but the data is nevertheless valuable for future projects. For further improvement, I have some suggestions for the manuscript. 1. The comparison of the three methods is quite complex and one of the main goals of this paper, illustrating, that low-cost sequencing devices (Flongle) can be used for detection of antibiotic resistance genes applying adaptive sequencing. Therefore, the description of this comparison and figure 1C is essential for understanding the data of this comparison of methods. However, figure 1C is hard to read and the represented data is not easily accessible. To clarify, I suggest including additional information. Does the “Set size” and “Intersection Size” describe absolute number of detected antibiotic resistance genes? This information could be included. To achieve additional connection from the legend of figure 1C, the absolute numbers of detected genes could be included to the text, supplementing the already stated relative detection numbers (lines 51-54, 137-142). Since this figure part is essential for the understanding, a larger version of this representation would be nice. 2. Figure 2 is essential for interpretation of the presented data on variables influencing the adaptive sequencing performance. a. Figure 2A is not easily accessible, in fact I am not sure, what information about the data is represented in this part of the figure (data throughput?). The figure legend does not explain, what is shown. I suggest clarification or, if applicable, deletion of this subfigure, for increased readability of figure 1B-D. b. Figure 2D: The meaning of the “log median read length” is not explained in the text or the figure legend and should be clarified. c. Figure 2E: Same as for Figure 2D. In line 119, the absolute read length (3 kb) is stated, but this number is not visualised in this figure. I suggest adding additional information to the text, to make the representation of the data in the figure easily discoverable. 3. Discussion: In my opinion, the discussion part has some potential for improvement. a. Line 158 – 162: The authors argue that selective cultivation and subsequent adaptive sequencing for antibiotic resistance genes leads to rapid results, important for public health responses. Metagenomic sequencing on the other hand needs at least the equal time and is not cost effective. However, might the combination of metagenomics sequencing without culture and adaptive sequencing decrease the turnaround time even more without significantly higher costs? Although, experiments on this are not in the scope of this study, the authors could discuss this for future applications. b. Line: 165: “[…] reads were detected for all resistance genes known to be present […] This result does not match the results stated in line 141 “57.9 % of the resistance genes found” and line 184 “nearly two-thirds of all resistance genes”. This should be clarified or the corresponding data should be referenced in the discussion for readability. c. Line 169: Since the identity of sequencing results and hit to the database is important for detection and overall performance of the adaptive sequencing approach, I suggest discussing, if future improvement of sequencing accuracy (basecalling algorithm, pore design) might influence the performance of this approach, as only shortly mentioned in line 190. d. Line: 190 “variable sequencing yield of this new flow cell type”: This aspect is solely introduced in the conclusion and should be mentioned and discussed beforehand.

      Minor comments: 1. Figure 1 description: “[…] carrying nine plasmids and four carbapenemases genes […]”. In line 12, the Raoultella isolate is described carrying three carbapenemases. The OXA-1 beta-lactamase pictured in figure 1A is not a carbapenemase. The correct number should be three carbapenemases. 2. Line 67: Flongle flowcells were introduced in 2019. I suggest to delete “recently introduced”. 3. Line 210: The link is not correct. 4. Line 244: “Community standards”: It would be nice to add an additional reference. 5. Line 255. Reference is missing. 6. Line 283: This step

  4. Nov 2022
    1. sequencing

      Reviewer 3. Murukarthick Jayakodi

      Aury et al have assembled the French bread wheat cv. Renan using Oxford Nanopore long read technology, optical map and Hi-C. They achieved a decent N50 of 2.2 Mb and constructed pseudomolecules with reference-guided approach. The assembly was corrected with Hi-C map. They annotated ~ 84% of repeats and projected gene models from previously assembled Chinese Spring reference genome. The assembly quality was validated with standard approach. The Renan assembly showed good collinearity with existing short-read wheat assemblies and pinpointed some large (1 > Mb) inversions. There is a potential to catalogue structural variants i.e. large INDELs. However, many false-positives are expected when long and short read assemblies are compared. Nevertheless, they compared a complex tandem repeat region. They used appropriate tools for assembly and downstream analysis. This is an improved additional genome resource for wheat community.

    2. The

      Reviewer 2. Gabriel Keeble-Gagnere

      The authors report on a new assembly of a French wheat variety, Renan, using Oxford Nanopore sequencing technology combined with short read polishing, Bionano optical maps and Hi-C to validate chromosome-level ordering after anchoring to IWGSC RefSeq v2.1. This is the first study I know of to use Oxford Nanopore to assemble a complete wheat genome, and the results demonstrate that this technology (together with short read polishing, Bionano, Hi-C, etc) can be successfully applied to such a complex genome. Evidence is presented to support the quality of the assembly, but it is mostly at the global statistics level (eg: contig N50, total size of gaps) or macro-scale (whole chromosome dotplots). One detailed comparison between Renan and Chinese Spring of a biologically important region is presented. The assembly is clearly of a high standard and is a valuable addition to the growing set of wheat varieties assembled to chromosome-scale. However, given the high quality of the IWGSC RefSeq v2.1 assembly (Zhu et al. (2021)), the claim that this assembly "achieves higher resolution for research and breeding" is quite strong and needs to be supported by more evidence. Given what is presented here, a more accurate statement might be "achieves higher contiguity and local completeness". The high contig N50 of 2.2Mb is highlighted but I feel that more work is needed to demonstrate that the sequence is free of artefacts. The authors show in Figure 2 that this assembly has the lowest (though only slightly) complete BUSCO score out of the wheat genomes they compare with. Is it possible that some regions cause problems for the Oxford Nanopore technology and are either fragmented or completely absent from the assembly? Bionano maps were used but no evidence is presented to show the level of agreement with the assembled sequence and Bionano maps, as is done in Zhu et al. (2021).

      In summary I think there are two key things to address: 1) More evidence supporting that the assembly is locally accurate, especially validation with alignment to Bionano maps; 2) Some results presented to relate this assembly to the existing chromosome-scale assemblies of wheat genomes.

      To address these points, I think the following would greatly enhance the paper:

      a) Using any method (eg: the method in Brinton et al. (2020)), identify identical-by-state haplotypes between Renan and Chinese Spring and the chromosome-scale assemblies from Walkowiak et al. (2020). This analysis would essentially produce a table which would be valuable supplementary data. A figure similar to Figure 3 (b) from Walkowiak et al. (2020) for a single chromosome, showing the regions of the existing wheat genomes sharing haplotypes with Renan would help place this genome into context.

      b) This then defines large regions of the Renan assembly that can be directly compared at the base level to other assemblies. Select 2 or 3 examples to show how the Renan sequence compares to the equivalent region in other assemblies, and show the Bionano validation of Renan sequence together with presence of genes and gaps in each assembly being compared. Since the sequences being compared here should be the same (based on the previous step above), the genes from the Renan annotation can be mapped across and directly compared. This would provide direct evidence for the higher quality assembly being claimed. Figure 5 is a good comparison of a biologically important region, but it is unclear if the region in Chinese Spring and Renan is the same haplotype or not. This needs to be clarified at the start of this section. If the same, then the comparison is of two regions expected to be basically identical (and could be one of the examples used in the proposed comparison analysis above); if different, then that needs to frame the discussion since the region in Chinese Spring could theoretically contain different genes or more repeats, for example.

      Centromeres are not mentioned, though it is known to be a particularly difficult region in wheat genome assemblies. How do the centromeres look in this assembly and how do they compare to previous wheat assemblies? Do the Bionano maps validate the assembly in the centromere region? The analysis in point a) above would identify centromeres in common with other assemblies. Likewise, the distal ends of chromosome arms, including the telomere sequences, are known to cause problems for Hi-C ordering and orientation. Again, the Bionano alignments demonstrating correct ordering would be particularly valuable.

      Figure 2 should be a supplementary figure.

    3. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac034 and has published the reviews under the same license. These reviews were as follows.

      Reviewer 1. Sean Walkowiak

      First review: Comment 1: The authors could more clearly and accurately present and discuss sequencing and assembly approaches, including the advantages and limitations of the ONT assembly presented here

      While the standards of 'quality' for assemblies are evolving, there are standard sets of 'science-based' criteria for considering the quality of a genome, such as the 14 criteria listed in the manuscript here: https://www.nature.com/articles/s41586-021-03451-0#Tab1. Many of these criteria are ambitious, particularly for wheat due to its size and complexity, and many criteria are not met using previous assembly approaches, or the approaches used in this study. It is true that CS and 10+ Wheat Genomes do not use long reads; however, these assemblies are valuable and have been rigorously validated using 10X Genomics, Hi-C, and long read data. They also perform well for TE content, BUSCO (as outlined by Tables 1 and 2 and Fig 3 in this manuscript), and they were actually used in this MS as a reference for guiding the ONT assembly. I would also expect that they have a better base pair accuracy than the assembly presented here. I therefore suggest that the authors revise their statement "these assemblies have been produced using short-read technologies and are therefore not up to the quality standard of current genome assemblies". If the authors wish to discuss assembly quality, which I recommend they should, I suggest focusing on advantages and limitations of each technology and assembly approach in a constructive way, perhaps with a stronger focus on the value of the ONT resource developed here. In regards to base pair accuracy, ONT is at a disadvantage to short reads or to PacBio. This is particularly true in the context of HiFi reads, which have increased accuracy over ONT and Illumina and have greater lengths than Illumina, but PacBio and HiFi are not discussed. This is not to say that PacBio is superior in every way, the reads from ONT are longer and these hold a significant value. As an example of differences between PacBio and ONT that might provide useful context to describe the differences between ONT and PacBio approaches, please see: https://pubmed.ncbi.nlm.nih.gov/33319909/, for differences between short read (TriTex) and PacBio, please see https://www.nature.com/articles/s41586-020-2947-8 . All of these approaches are valuable but have both advantages and limitations, with ONT also having many clear advantages and disadvantages. But these need to be clearly communicated and supported, either through the results of this study or through the literature. For example, in the discussion, the authors state that "ONT devices HAVE a real advantage over other long-read technologies". There is only one other long read sequencing technology, so are if you saying that ONT HAS a 'real advantage' over PacBio based on read length, this is valid, but can be stated more explicitly and with examples of the read lengths from this study and the literature. It is then stated that the "error rate is drastically reduced for nanopore", again this valuable and a great advancement in regards to ONT, but it would be wise not to dismiss that this error rate is still higher than PacBio HiFi, which again can be stated explicitly with support from the literature. While both of these concepts are important, after they are stated, they are not actually discussed or framed to highlight the work from this study. The true advantage of ONT, even over PacBio HiFi, is that the long reads can resolve more complex regions that span greater distances, which are abundant in wheat (see reference from above). The authors are presenting an exciting and valuable resource with this genome assembly and this assembly has advantages due to the application of ONT, for the reasons mentioned above regarding long complex regions, but these are not fully highlighted and the authors do not take full advantage of what this assembly has to offer. I think the authors should provide additional context and support related to the value and drawbacks of their ONT assembly. The advantages are discussed superficially at the gene level through a couple of examples (Fig 5), though none of these examples are supported with any significant biological data or validation analysis. There are many interesting features of genomes that are captured by ONT that are not captured well by short reads or PacBio, and it is unfortunate that these are not explored in any significant depth in the manuscript.

      Comment 2: Some of the 'highlighted features' in the manuscript could be better selected/executed

      This comment relates to the previous comment on having little detail on what the ONT genome is uniquely capable of providing over other approaches. Instead, the authors focus on some anomalies in the D genome as well as differences in the nanopore software for base calling. It is unclear to me what the objective is of the report on the D genome. I suspect that this may be due to differences in repeat content between D and the other subgenomes, or an artifact of the tools and analyses used. Page 6, Figures S1 and S2, may be a consequence of poor read filtering for reads that align ambiguously - i,e perhaps reads from A and B may crossmap at a greater likelihood than those from D due to differences/similarities in repeat content between subgenomes. Once reads are aligned, the alignments should be properly filtered using standard 'best practices for NGS'- I do not see that any filtering or analysis of cross mapping was performed, but I may have missed it. Once the alignments are filtered, read coverage dips and peaks can then be assessed statistically using tools such as CNVnator and cn.mops, which are designed specifically for comparative read depth analysis since depth may not be normally distributed, rather than arbitrarily looking at 2 times the median. There may be differences between genes and intergenic regions in terms of mapping accuracy, so it may be ideal to interrogate read depth for those separately. The increased gaps is also interesting and I wonder if this could be due to the read accuracy of ONT and read mapping and assembly biases when having similar subgenomes.

      Nevertheless, the results and discussion on the D genome are interesting but distracting and likely reflect that the authors should take more time to explore their data and its biases before presenting this information. In summary, I believe that additional work is needed to bring value to the read depth and D genome analysis should the authors choose to include this in the manuscript. While I agree that it would be useful to communicate that a significant gain was observed when basecalling using the more accurate basecaller, the emphasis on this is disproportionate to its value in the manuscript. The observation of a better assembly when using reads from a more advanced basecaller is not something new. As for the error rate of the ONT between organisms (yeast and wheat), with a sample size of 2, I do not think that this is worth presenting or discussing in any detail. While this may just be an artifact of the DNA quality itself from two experiments, I suspect that this may be a valid result from the manuscript and due to sequencing repeats, which are more abundant in wheat, in combination with how these basecallers self train to be more accurate. While this is certainly valid, it is not novel or interesting. This result comparing species was not tested with sufficient scientific rigor/evidence, it distracts from the central result of the manuscript, and just reaffirms something that we already known about the basecalling software and challenges of sequencing homopolymers and the importance of getting accurate reads using the more advanced basecalling methods.

      Comment 3: Why Renan? This comment relates to the other two comments on the selected areas of focus. The biological story, which was on gliadins, was of some value and highlighted some of the advantages of an ONT assembly, but this was not supported by any significant biological data. Renan is a well-known cultivar with abundant genomic data, mapping populations, trait data for diseases, etc. It is unfortunate that the authors could not use the genome to dig deeper to more thoroughly demonstrate the value of this assembly specifically in the context of ONT and genomics of wheat or the biology of wheat and Renan, specifically. With abundant QTL data available specifically for Renan, these could have easily been anchored to the assembly to highlight novel transcripts from the RNAseq from this study, just as an example. Even the comparisons of the Renan assembly to other available assemblies was mostly superficial and did not highlight in significant detail the value of having an ONT assembly or the value of having data specifically for Renan. While a detailed 'biological story' may be beyond the scope of this manuscript, there was minimal effort to highlight the value of the assembly, and this comment is more of a larger reflection that more could have been done to highlight the value of the genome to support the author's vague claims that the genome "will benefit the wheat community and help breeding programs".

      Minor Comments The absence of numbered lines made it difficult to provide more detailed feedback, but there are minor items throughout, so I suggest numbering the lines and also giving the manuscript a thorough review. I appreciate that the authors present and suggest methods for future assembly of complex genomes using ONT, but unlike the abstract states 'we also provide the methodological standards to generate high-quality assemblies of complex genomes'. I would argue that the standards used for ONT assembly are known and are not established here. I would also suggest caution when stating that the methods here should be considered the 'standard' for the reasons indicated in Comment 1 regarding other approaches used to assemble complex genomes, such as PacBio/HiFi, and the lack of a proper investigation/discussion/comparison of assembly quality.

      Page 2: last line - what is the abbreviation ca. ? Table 1: Busco is presented twice with different values. Table 1 and 2 use different versions of RefSeq, I would stick to one version. It is unclear to me what trend or result is that the authors are trying to present in figure 1, which I would say is common for circos plots. Presenting data 'for the sake of presenting it' is not terribly valuable and I would encourage the authors to use the figures to present a trend or result that is impactful. In addition, the data that is presented is not presented clearly, and is cryptic. The roman numerals in the figure caption for Figure 1 are not actually in the figure. The caption also indicate that the dots indicate lower and higher values, but not of what - perhaps density of gaps? The color scales are not presented for each track. Two of the color scale pallets also look similar.

      Page 6: 62% of exons were identical, which means 48% had SNPs, so the authors argue that SNPs are therefore rare at 48% of exons? I do not think that 48% of exons having SNPs is rare, I think it that this would mean that nearly half of exons have SNPs, so this is therefore common. Perhaps this statistic is misleading and the focus should instead be on the 0.7% divergence. How does this value compare with other within species comparisons of gene content and could this be an artifact of ONT accuracy? This question relates to a general comment that the authors could do better at bringing relevant comparisons or parallels in from the literature throughout the manuscript to bring value to any findings or insights they are presenting. Particularly in the context of other ONT assemblies.

      Page 7, capitalize the T for technology, it is part of the name of the company and is a proper noun. This is repeated elsewhere.

      Page 7: 'on wheat'? this statement could be written more clearly The way that the text is worded, it sounds like the basis for selecting the SmartDenovo assembly was the number of unknown bases, when I suspect it was actually a multitude of factors (BUSCO, gene or TE content, assembly stats, etc). While I do not question the selection of the assembly, I do suggest a clearer presentation of the information. I appreciate that the authors presented the data from multiple assemblers, one of the concerns with ONT is that the read accuracy is low and this may lead to issues in assembly of complex polyploids with similar subgenomes. I suspect that based on this study, it is clear that this is a valid concern for some assemblers, but may have been overcome in others. Though none of this is explored or discussed. Again, is there any information in the literature contrasting assemblers that could provide insights into what you observed?

      Searches at 90% identify and coverage for genes and TEs is not strict, especially with genomes that have highly identical subgenomes. If you reduce your thresholds enough, all features will map to your genome.....

      The choice of language is often objective or not representative of the results. For example, the 'extremely' similar TE content between Renan and CS. Why not say it is similar and actually report a value or a % difference. This would be more concise and informative than using vague and overzealous language. Page 8, short reads (dash or no dash?) The font sizes in Figure 2 are very small.

      The RNAseq is not really presented at all, except in the Materials and Methods. I thought the genes were ab initio predicted until I saw RNAseq in the materials and methods. I suggest at least making a note of RNAseq into the results and/or discussion since this additional effort does bring added value to the annotations and the manuscript. The discussion says de novo annotations, but I suggest explicitly stating that RNAseq was performed.

      Figure 3 C and D do not have horizontal axis labels, the top should be labelled as subgenome, bottom as chromosome, and the vertical axis (not the top) should be labelled as number of gaps and chromosome length. Same comment for labelling of vertical axis for panels A and B, horizontal axis should be labelled as genome assemblies, which are reflected in the pallet/legend. Note that many of the colours in this pallet are similar and difficult to differentiate, it may actually take less space to label the bars with each wheat line to make it less cryptic.

      How were the dotplots in figure 4 generated? Perhaps I missed it in the materials and methods. Also one of the axis have labels or units, etc.

      Much of the text in Figure 5 is too small and illegible.

      Page 10: The discussion is superficial and vague and should provide an accurate and pragmatic discussion of the results in the context of the literature. For example, the manuscript boasts a 'higher resolution'... but of what? Perhaps 'complex repetitive regions'? To reiterate my previous comment on the lack of literature support throughout the manuscript - Were these 'higher resolutions' of <complex repetitive regions> comparable to what was observed in the literature when ONT was applied to other systems? Again, these advantages of ONT and the assembly could be more thoroughly

      Re-review:

      The revised manuscript addresses the major concerns/comments. The assembly and its report are an exciting new resource for the wheat community. I only have one very minor comment below:

      When writing variety names in text and figures, it is important to be exact because there are many varieties with similar names internationally. CDC Stanley, not "Stanley"; CDC Landmark, not "Landmark"; "LongReach Lancer", not "Lancer", not "LongRead Lancer" - typo on line 308. I suggest performing a thorough check throughout.

    1. ABSTRACT

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.73 and has published the reviews under the same license. There is also a Spanish language version of this preprint available in SciELO preprints:

      Description

      The peer reviews are as follows.

      Reviewer 1. Joseph Lopez

      This is a review for manuscript “The first complete mitochondrial genome of Diadema antillarum (Diadematoida, Diadematidae) Majeske et al. DRR-202205-01 This is a very interesting topic. The methods and results are clearly explained. The original figures are very good and descriptive. The authors have competently analyzed the data and written a succinct manuscript. Marine biologists understand the legacy and impact of the Diadema epidemic from the 1980s. Therefore, it is important to help bring this species back to from the brink, if not dominance, in the Caribbean again. This could possibly happen with more systematic and molecular genomic characterizations such as this study. Was this project part of larger project to sequence the whole Diadema genome? If so, the authors could state this and not be penalized. Due to the large number of mtDNA molecules, assembling the mitochondrial genome is commonly done in whole genome projects. Having the mtDNA properly assembled is now a great asset for conservation and population genetics.

      Reviewer 2. Remi N. Ketchum

      Are all data available and do they match the descriptions in the paper?

      Yes. The GitHub is up to date but I cannot yet access the NCBI databases although numbers are provided (likely submitted but not publicly available).

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. I would suggest that the authors also make their alignments available to the public.

      For additional comments see: https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMzQzL1Jldmlld19HaWdhQnl0ZS5kb2N4

      Reviewer 3. Andreas Kroh

      Are all data available and do they match the descriptions in the paper?

      No. The data was not provided together with the manuscript, so I am unable to check this. The manuscript, however, states that the data will be deposited in GenBank

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      No. Locality details missing, Voucher specimen number missing, Repository institution for voucher specimen not identified.

      Is the data acquisition clear, complete and methodologically sound?

      No. See details below.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. See details below.

      Is there sufficient data validation and statistical analyses of data quality?

      No. Unclear - some detail is missing in the methods section to allow judegement - see details below.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. See above and below - voucher specimen number is missing, some methodological information is missing, references to original papers providing sequences used in the analysis are missing, etc. - see details below.

      Additional comments.

      The manuscript by Majeske et al. on the mitogenome of Diadema antillarum is an interesting contribution to the phylogeny of Echinoidea. There are, however, a number of issues which should be addressed in a revised version, in my opinion. 1) Please provide coordinates for the sampling site (and a locality name) instead of a general region 2) Please provide the repository number and institution where the voucher specimen has been deposited 3) Did you verify the identification and made sure that this is D. antillarum rather than D. africanum (which allegedly has repopulated some D. antillarum habitats in the Caribbbean and GoM) – for a morphological comparison see: Rodríguez, A., Hernández, J. C., Clemente, S. & Coppard, S. E. 2013. A new species of Diadema (Echinodermata: Echinoidea: Diadematidae) from the eastern Atlantic Ocean and a neotype designation of Diadema antillarum (Philippi, 1845). Zootaxa 3636, 144-170. 4) Please report the insert size that has been targeted during library prep. (typically either 350 bp or 550 bp for the kit mentioned) 5) Explain why the S. purpuratus mitogenome was uses to map the reads rather than one of the diadematid mitogenomes 6) Please explain why the custom assembly pipeline was used rather than one of the well-established assemblers like SPAdes, Abyss, Velvet, etc. 7) Please provide a coverage graph 8) Position of the non-coding region is given in # bp – but without information which feature is considered as zero in a linearized version of the circular sequence the position is useless 9) Please explain what exactly was used for the analysis – the full nucleotid sequence including non-coding regions, just the CDS of the protein coding genes, or …? 10) Please add the reference to original papers that published the sequences you use in the tree 11) Please explain the choice of the model used in the analysis – was some Modeltest run? 12) Please provide the fasta file together with a revised version to allow checking the quality of the annotation etc. 13) Fig. 1: please provide some information on the photo shown – is this the specimen that was sampled, add this info and the locality in the caption 14) Fig. 2: add the accession numbers in the tree and highlight the new sequence 15) Please see additional minor comments in the annotated version, which is attached Summing up, I recommend acceptance after major revision. Kind regards Andreas Kroh, NHM Vienna, 10/7/2022

      See following file. https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMzQzL2d4LURSLTE2NTM2ODM4NTRfQUsucGRm

      Re-review: The revised manuscript of Majeske et al is much improved in comparison to the initial submission. Some of the questions raised in the previous review, however, remain open and other new aspects have appeared. Open issues: 2) Please provide the repository number and institution where the voucher specimen has been deposited --> this issues has not been addressed in the revised version; it is unclear if a voucher specimen has been deposited or not, where it is stored and which inventory number it has; if the specimen has not been retained, this is unfortunate, but not a huge issue - it still needs to be clearly/openly stated 3) Did you verify the identification and made sure that this is D. antillarum rather than D. africanum (which allegedly has repopulated some D. antillarum habitats in the Caribbbean and GoM) --> this issue too has not been addressed; at the very least I would expect a statement that the authors were aware of this second Atlantic Diadema species and how they made sure they really had D. antillarum 7) Please provide a coverage graph --> the coverage graph is mentioned in the text, but not provided in the paper 9) Please explain what exactly was used for the analysis - the full nucleotid sequence including non-coding regions, just the CDS of the protein coding genes, or ? --> this is still unclearly formulated in the paper - I assume the whole mitogenome sequence was used, but the wording is very ambiguous; this needs to be very clearly stated in the material and metods section 11) Please explain the choice of the model used in the analysis - was some Modeltest run? --> this information is still lacking

      New issues: A) The description of the assembly process is still rather unclear - this needs to be better explained. For example, was any kind of preprocessing (read triming etc.) done? Which parameters were chosen for the various programms employed? How did the two-stage read extraction process really work - the wording in the manuscript is very unclear regarding this aspect B) The raw data need to be deposited in the GenBank Short Read Archive (SRA), in the Github repository only the extracted mitochondrial reads are available - this is insufficient to repeat the assembly process and analyses carried out in the present manuscript C) The fasta file included in the Github repository has 23 positions that are redundant (overlapping with the start of the sequence) - they need to be removed before submisson D) There is some inconsistence on the length of the mitogenome, the text says 15,708, the figure says 15,707 - the latter, judging form the files in your Github repository, is correct --> please make sure the information given is consistent E) No information is given on the reason for chosing the particular evolutionary model that has been used in the phylogenetic analysis F) The phylogenetic analysis has been done by NJ-methods, which are fast but can subject to a lot of problems, it would be better to use MAximu Likelihood (or Bayesian) methods G) The authors have made an important discovery in relation to the mitogenome deposited as "Echinothrix diadema" in GenBank. Rather than to speculate on the reasons that is the sister of D. antillarum in their analysis the authors should simply which of their hypotheses (AT-bias vs. misidentification) is correct. All the tools that are needed are already available in Genbank! There is an extensive dataset of three mitochondrial markers (12S, ATP6, ATP8; https://www.ncbi.nlm.nih.gov/popset/?term=MW329515 etc.) available for Echinothrix, which includes hundreds of sequences and encompases material from the complete geographical range of the genus (Coppard et al. 2021 https://www.nature.com/articles/s41598-021-95872-0). In addition, there are 16S sequences available for D. savignyi, the suspected candidate of the misidentification. I have downloaded these sequences and run preliminary analyses with with a subset of the sequences. These clearly show that the "E. diadema" mitogenome has nothing to do with true E. diadema and that it is a Diadema. While the data basis for Diadema is less extensive than for Echinothrix there are 16S sequences of D. savignyi (GenBank PopSet: 673458050) that are identical to part of the 16S sequence of the alledged "E. diadema" mitogenome. Thus I am convinced that the second hypothesis (misidentification) of the authors is correct. This is an important finding that should be discussed in depth in the manuscript. I am including the alignments and trees that I made in the attachment - similar analyses and trees should be included in the manuscript. Link to download the attachments: https://we.tl/t-y7ypbnZYPQ

      Summing up, I recommend acceptance after major revision. Kind regards Andreas Kroh, NHM Vienna, 11/9/2022

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.72 and has published the reviews under the same license. These are as follows.

      Reviewer 1. Jeffrey West

      This is a very nice & useful extension to PhysiCell, in order to model PK/PD dynamics in agent-based simulations. Overall, the description of the software is good and easy to follow, but I offer a few suggestions for clarity:

      1. In "Statement of Need" -- the phrase "how much gets to the cells and what they then do to the cells" is vague and casual -- maybe use standard terms like drug exposure & response to describe PK/PD relationships
      2. Final sentence in "Statement of Need" that says "Substrates can target any cell type with PD dynamics" -- can you elaborate? Does this indicate that every cell type can have unique PD dynamics?
      3. In "Implementation" authors refer to Figure 2A and 2B but figure 2 only has one panel -- perhaps this should be figure 1A/B?
      4. In "Pharmacodynamics" -- "the list of PK substrates and the list of PDsubstrates need not have any relationship" -- this is slightly confusing. I assume that every substrate can have associated PK dynamics without having an PD dynamic, but is the opposite true? If so, how what is the drug dispersal / decay rate?
      5. Finally, the discussion section is focused mainly on future steps. I think it would be helpful for the discussion to focus more on current advantages and functionality. This is the publication record for this software, and as is often the case, future steps may be subject to change.

      Reviewer 2. Boris Aguilar

      Is the code executable?

      This code can not be in an. executable form as is an extension to PhysiCell

      Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?

      I am not familiar with running PhysiCell

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      Author claim this is the first time PKPD module has been added to PhysiCell.

      • I think there is mistake in calling Figure 1 in Installation sections, should be Figure 1.
      • Reference to PhysiBoSS missing
      • Figure 1 - I think there is mistake in calling Figure 1 in Installation section, should be Figure 1.
  5. Oct 2022
    1. biomonitoring

      Reviewer 4. Christina Lynggaard

      This manuscript assesses the variation in arthropod communities in three ecoregions in Canada. The study is well done, and the sampling was very thorough with a big sampling effort. I only have minor comments. Specially I consider that the aim can be focused on the ecoregions instead of the feasibility of the method, as this has already been shown. In addition, it would be nice to have more details in certain sections in the data analyses and in the results. I have addressed these comments below. -I am not sure why the title "Message in a bottle". -Line 65- Could you specify which indicator species have been targeted? Or cite studies that target those species? - Line 96- Based on the limitations of the ecoregions, it is not clear why ecoregions are an obvious candidate. -In line 104 seems that your aim is to demonstrate how feasible is to use metabarcoding for large-scale monitoring and that you use the ecoregions to prove that. However, showing the feasibility of this method for large-scale studies has already been done (e.g. Svenningsen et al 2021, Detecting flying insects using car nets and DNA metabarcoding; Bush et al 2020, DNA metabarcoding reveals metacommunity dynamics in a threatened boreal wetland wilderness). I suggest keeping it focused on the need to apply this method in different ecoregions. -In the Data description section, you mention that you examined phylogenetic diversity, but in the Analyses section you vaguely mention it. The phylogenetic diversity findings are discussed later on, but it is difficult to follow the discussion when the results were not presented previously. In addition, the authors use the findings in phylogenetic diversity to support the idea of a structure in the ecoregions, so I suggest making more emphasis in this in the results section. -Line 189. I agree that the higher number of BINs could be due to eDNA, but couldn't another reason be that the BINs were oversplit during data analysis? -Line 215-217. Has this been found previously in other studies using Malaise trap? If so, please reference to those findings. -Line222- This is a brief discussion about temporal turnover. However, these results are not presented previously, or at least not clearly enough. -Line 266-267- Yes, you showed compositional shifts using metabarcoding in bulk arthropod samples, but the way this sentence is structured it sounds like you are the first to show this. Compositional shifts in arthropods have been shown previously in other studies using metabarcoding. -Line 321- Did you have negative PCR controls? In line 326 you mention negative controls, but I assume you refer to the extraction negative controls. -Line 340- It is not clear why you queried the data against a bacterial library. -Line 348- What was the reason for choosing "at least three reads"? and the same for line 350 where you cluster sequences with a minimum of 5 reads per cluster. -Line 357- If you see tag switching in your negative controls that means that most likely you have it in the rest of the data. How did you ensure that the rest of the data did not have that? You may have tags switching in sequences not found in the negative controls but found in your samples. -Line 369- As you used the Bray-Curtis index in this metabarcoding data, did you convert your data to presence/absence? It is known that for metabarcoding data the use of read numbers for community analysis is not adequate (see Nichols et al 2018 "Minimizing polymerase biases in metabarcoding") .

    2. Traditional

      Reviewer 3. Kingsly Beng

      Steinke et al used DNA metabarcoding of malaise trap samples from 52 protected areas spanning three Canadian ecoregions to assess the spatial patterns of arthropod biodiversity. The research question is relevant and interesting, the study is well designed, data collected are comprehensive, and manuscript is well written and easy to follow. I enjoyed reading it and would like to thank the authors for such a great contribution. My main concern is that the temporal aspect of the study was not explored even though it was mentioned as part of the research objective. Specific comments L60-62: These reductions are not only for abundance but also for diversity, at least based on the fourth reference cited here. I would therefore include "diversity" or "richness" in this statement. L63 & L105: The authors use biosurveillance in some places in the text and bio-surveillance in others. Isn't it better to stick to the same spelling all through, at least for consistency? L132: I am a bit confused here. Are these "Analyses" or "Results"? The whole subsection from L133-L176 read like results to me. L329: "of" omitted! Five samples were available from each of the other 22 sites... L332-334: The first "following" in this sentence can be either omitted or that part of the sentence completed using "manufacturer's instructions" L345-346: "Reads were trimmed 30 bp from their 5' terminus with a set trim length of 450 bp". Perhaps this needs more clarification. The amplified length was 463 bp, trimming 30 bp gives 433 bp. How then can set trim length be 450 bp? L348-349: What was the criterion for using "at least three reads matched an OTU in the reference database"? I mean why not at least two or at least four reads? If this was arbitrary please clarify. L349-350: Same question as above, why use "a minimum of five reads per cluster"? It would be nice to indicate if any benchmarking was applied a priori or if this was set arbitrarily. L346-349: Since the authors were mostly interested in arthropods, were reads that matched sequences from bacteria (SYS-CRLBACTERIA), chordates (SYS-CRLCHORDATA) and non-arthropod invertebrates (SYS CRLNONARTHINVERT) discarded or retained? This should be mentioned here and estimates of the number of reads, BINs or OTUs matching each of these categories should be provided. L149-153: These are interesting results. It would be nice to present them graphically, at least in the supplementary. The aim of the study was "to assess spatial and temporal variation in species richness and diversity in arthropod communities from 52 protected areas spanning three Canadian ecoregions" but the temporal aspect of the study was not fully explored. Although it is stated that "trap catches were harvested every second week from early May through September", this information has not be used in the analysis. Should the aim of the study be redefined and restricted to just spatial patterns then? L152-153: Without any table or figure to support these results, why not provide the actual number or proportion or percentage of BINs for each arthropod order in the text? L157-158: Please add some symbols (e.g. asterisks , , **or alphabet a, b, c) to Figure 3b to represent significant differences. Looking at the present figure without referring to the text does not tell the reader if the differences are significant. Besides, the authors only report a single p value (p < 0.003) which probably means at least one of the groups is different from the others but failed to report the pairwise multiple comparison tests that tell the reader which pairs or groups (e.g. ECF vs EGL, ECF vs SGL, EGL vs SGL) are significantly different. L159: Are the patterns similar if you control for the total number of sites per ecoregion? For example, taking 12 sites per ecoregion and resampling them 100 or 1000 times, similar to the approach used for beta diversity. It could be that one site is driving this pattern, as shown in Figure 2b and reported in L141 "...with more than a third (9,301) found at only one site (Figure 2b)". L164-166: Please provide the full PERMANOVA results in a table in the text or supplementary and reference it here. It is not clear what "decreased site elevation (R2 166 = 0.035, P = 0.03)" means. L168-171: Do these patterns change or remain the same if the same number of sites per ecoregion is used? This needs to be tested given that one site (probably from ECF or EGL?) is disproportionate species-rich and SGL has the lowest number of sites. L173-176: What about levels of turnover across time? Were they any temporal trends in alpha and beta diversity? Was the temporal dropped from the study objective and why? L221-223: Same question as above, were temporal changes in species composition considered? Which results, tables or figures point to this or how did the authors arrive at these statements.

    3. Background

      Reviewer 2. Shanlin Liu

      Steinke et al. used a metebarcoding method to investigate the species compositions for 410 insect bulk samples collected in 3 ecoregions. The manuscript is well written, all the materials and methods were clearly described, I think the manuscript should be accepted for publication after addressing several minor issues as follows: 1. Line 126, as Ion torrent is not widely used nowadays, may the authors add some words regarding its sequencing length, error rate, throughput et al. 2. Please unify the format of chao 1 (or chao-1). 3. A rarefaction curve for each sample may need to check whether the species diversity is well represented by its raw reads. 4. Line 187 - 191. This BIN number inflation may also boil down to sequence errors introduced during PCR amplification or sequencing. 5. Please pay attention to the citation format. For example, in line 202, reference # 40 should follow the first author's name. 6. Line 226 - 227, please add some words to better explain the speculation of "passively transported by wind".

    4. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac040, and has published the reviews under the same license. These are as follows.

      Reviewer 1. Camila Duarte Ritter

      The manuscript is very well written and a great contribution to the field. However some analytical aspects need to be better described. Also, it would be great the authors provide their R-script in the supplementary material. Below my comments. Line 166: R2 = 0.035 is very low, it needs to be better considered. Lines 168-171: The alpha diversity comparison was based just in visual inspection or any test was made? Lines 173-176: There was any test to significance? It need to be reported. Lines 213-219: It is a nice discussion about local versus regional diversity, but very speculative, need at least some citations to support it. Lines 357-358: It reduce background contamination, you never can remove all. Lines 365-367: How the distances were controlled, any analysis of spatial correlation? Lines 367_370: The NMDS was with abundance or presence/absence data? If it was abundance, any correction was applied? Lines 374-376: How the author checked the quality of the tree as it was made with very short fragment? the blackbox toll set all parameters on the model? Line 382: Was there any correction to BINs table? Rarefaction, Shannon entropy? It is very necessary to metabarcoding data. Also why just BIN richness, other diversity measures may be included as Shannon or Fisher diversity on phyloseq, or the effective number of BINs with entropart. Figure 1 needs a reference to Canada to better understand where the region is.

      Re-review:

      The study is very well designed and written, with good and clear results. The author had considered all my comments from before, just some additional minor comments are below. Lines 118-119: species (bin) richness is a measure of alpha diversity and change in community composition a measure of beta diversity. Lines 112-122: Malaise-traps collect some random local no flighting insects, while discuss that it represent local population is ok I miss the part of the random sampling and that the lack of such insects in the samples does not exactly mean the non-presence of these insects. Lines 243-246: The sentence "Although current metabarcoding protocols cannot estimate the abundance of each species" is not completely right. Currently many metabarcoding studies estimate abundance/biomass of species, some discussion of it is necessary. Some examples (among several others):

      Elbrecht, V., & Leese, F. (2015). Can DNA-based ecosystem assessments quantify species abundance? Testing primer bias and biomass sequence relationships with an innovative metabarcoding protocol. PloS one, 10(7), e0130324. Thomas, A. C., Deagle, B. E., Eveson, J. P., Harsch, C. H., Trites, A. W. (2016). Quantitative DNA metabarcoding: improved estimates of species proportional biomass using correction factors derived from control material. Molecular ecology resources, 16(3), 714-726. Di Muri, C., Lawson Handley, L., Bean, C. W., Li, J., Peirson, G., Sellers, G. S., ... & Hänfling, B. (2020). Read counts from environmental DNA (eDNA) metabarcoding reflect fish abundance and biomass in drained ponds. Metabarcoding and Metagenomics, 4, 97-112. Ershova, E. A., Wangensteen, O. S., Descoteaux, R., Barth-Jensen, C., & Præbel, K. (2021). Metabarcoding as a quantitative tool for estimating biodiversity and relative biomass of marine zooplankton. ICES Journal of Marine Science, 78(9), 3342-3355.

      For the figures comparing the ecoregions, as they are just three I would recommend a color blind safe palette, orange, yellow and green is not nice.

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.71), and has published the reviews under the same license. These are as follows.

      Reviewer 1. John Hamilton

      Are all data available and do they match the descriptions in the paper?

      Yes. Downloaded and checked from the Gigabyte FTP site

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. I was unable to check sequence data deposited in the SRA.

      Is the data acquisition clear, complete and methodologically sound?

      Yes, but summary tables are missing for the Illumina WGS and RNA-Seq sequencing in the manuscript.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. Some parts of the manuscript are very good in this respect and some parts (esp, annotation) are missing parameters and core details.

      Is there sufficient data validation and statistical analyses of data quality?

      No. Especially the analyzing of the quality /completeness of the genome annotation.

      Is the validation suitable for this type of data?

      Yes. Where it is not missing, it is suitable.

      Additional Comments:

      In this manuscript, Canales et al. present the long-read based assembly and annotation of the genome of fever tree (Cinchona pubescens), well known as the source of quinine alkaloids traditionally used to treat malaria. This will be a genome of interest and welcome resource for the community. I enjoyed reading this manuscript about this interesting species and I have several comments: 1. There is not a summary table for the Illumina WGS and the three RNA-Seq libraries. This should be added. 2. Since you have Illumina WGS short reads, it would be informative to add a Genomescope kmer plot (http://qb.cshl.edu/genomescope/) as an additional estimate of genome size and heterozygosity to section 1.3 3. The BUSCO metrics for the assembly are lower than expected. I believe this is due the lack of sufficient genome polishing. Refer to the Solanum pennellii genome paper (https://doi.org/10.1105/tpc.17.00521) where they used a similar assembly strategy and discuss the need for adequate polishing (see “Prior to Polishing, Genome Error Rate Is Substantial”). 4. Section 1.6 – It is noted that PASA describes transcript evidence as ESTs which is a legacy from the time it was developed, but then the RNA-seq transcript assemblies are also described as ESTs later in the section which is incorrect and confusing. 5. There is not an assessment of the annotation, just a statement of the number of CDSs predicted. This is an issue as the number of CDSs is far higher than reported in related species. There is not a discussion of repeat masking the genome assembly so I am assuming AUGUSTUS was run on the unmasked assembly with no downstream filtering or refinement. Doing this increases the number of TE-related gene models and annotation artifacts. As this is a data note/data release there should really be at a minimum: a. A table summarizing the annotation in the manuscript b. An analysis to identify models with evidence support c. BUSCO results for the annotation
      

      Re-review: I’ve read the author’s responses to all the reviewer comments and read the updated manuscript and I am satisfied with the changes made.

      Reviewer 2. Bing Bing Liu

      I was very pleased to read your article on Fever tree's genomes, and I think it is a very valuable foundational work. The assembled genome recovered ~85% (903M or 904M, table1) of the estimated genome size (1.1 Gb/1C) with an N50 = 2802128 bp; 72,305 CDSs were annotated and 83% (or 87.6%, line 207) of BUSCOs were recovered, but there is a lack of clarity around these statistics in the study. And it is necessary to provide the repeat annotations, function annotations and non-coding RNA annotations. Besides, the BUSCOs recovered is no more than 90%, you should give your explanation.
      

      Minor comments Lines 34-43, you should add the plastid genome results here. Lines 38, check the genome size. Maybe 904M? Lines 41 and 207, you gave two different percentages of BUSCOs, please check. Lines 144,145 and 149, the numbers of reads and bases are non-correspondence, please check, as the read length is 150bp. Lines 172, I doubt about the overall mapping re (7.34%). Lines 198, you should add the description about the genome produced by RACON. Lines 204 and 223, why do you use the different version of BUSCO software? Lines 206-207, why you did not give the result of mapping rio. Lines 243, you should provide the BUSCO result of proteins (CDSs). Lines 257 and 280, why you use different version of MAFFT? Lines 289, check the sentence ‘had a BS of 100%’.

    1. interactive form created by the code and frictionless data package presented alongside this work [40  Reference40DongR, CameronD, BedoJ Supporting data for “svaRetro and svaNUMT: modular packages for annotating retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data”. GigaScience Database, 2022; http://dx.doi.org/10.5524/102318.].
    2. The remainder unreported events either had unmapped insSeqs, or undetected bps. In the online version of this paper this is presented in an interactive form created by the code and frictionless data package presented alongside this work

      See more in GiigaBlog on how these were created http://gigasciencejournal.com/blog/frictionless-data-interactive-figures/

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.70), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Surajit Bhattacharya.

      The authors of the manuscript have tried to address a significant problem in genomics study, i.e. annotation of non-coding elements of the genomes. The authors have built two R tools, to capture two non-coding regulatory elements, Retroposed transcripts and Nuclear mitochondrial integrations(NUMT). The authors have illustrated the efficiency of the tools with examples using 2 datasets, and also benchmarked the tools using other available tools. Although the authors have performed validations, there seem to be some points that still needs to be clearly elucidated.

      Minor Points: 1. On line 125, "BEDPE and Pairs [28]", should be written as "BEDPE [28] and pairs". 2. Although, the authors benchmark the two tools, can they briefly compare the time taken to run the ir tools against the tools they are benchmarking with? For example, compare the time between svaRetro and GRIPper and svaNUMT and dinumt. 3. It's not a question, but more of a comment. Is it possible to verify some of the novel variants identified by svaRetro and svaNUMT , using PCR or any other method? This can strengthen the point that svaRetro and svaNUMT, is better than the other tools.

      Reviewer 2. Gargi Dayama

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?

      Yes. Although additional clarification on features of svaRetro can be helpful

      Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?

      Yes. Additionally, it might be useful to state in description on Github, R version required to install the tool (it doesn’t work with versions older than 4.1)

      Have any claims of performance been sufficiently tested and compared to other commonly-used packages?

      No. 1) The authors also need to benchmark their tools against other previously developed tools that they used for comparison (dinumt and GRIPper) using the simulated data. 2) Authors state they found calls that were not found by the other tool. This needs to be further tested to show the results were true positive. In fact, there is no test done to look at the false positives. Therefore, doing a test on their entire results for false positive/ true positive is essential.

      Is automated testing used or are there manual steps described so that the functionality of the software can be verified?

      Yes. But there is a discrepancy for svaNumt. The following command on github “NUMT <- svaNUMT::numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)” doesn’t work. Instead this worked “NUMT <- svaNUMT::numtDetect(gr, max_ins_dist = 20)”

      Additional comments sent in an annotated file to the author.

      Re-review: I feel the authors have addressed my comments. I just have one small comment about their statement in conclusion section line 359-360. They made a statement that “svaRetro and svaNUMT demonstrated good performance on simulation and human cell line datasets similar to - or in some instances outperforming - other methods without re-analysis of alignment and the use of specialized detectors”. While this statement might be all right for simulated data, based on their results in lines 309-319 in cell lines, svaNUMT seems to almost has a 50% false positive annotation rate (although with low confidence). I feel this should be addressed as a caveat in the conclusion and a bit more clearly as false positives in results. Other than that, I do not have any additional comments.

      Reviewer 3. Raniere Gaia Costa da Silva.

      See the CODECHECK Certificate of independent execution https://doi.org/10.5281/zenodo.7084333

      See more in GigaBlog: http://gigasciencejournal.com/blog/frictionless-data-interactive-figures/

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.69), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Dr.Liyi Zhang

      ‘Honeycrisp’ is known for its exceptionally crisp and juicy texture as a source of interesting genetic diversity in apple breeding programs worldwide. In addition, high quality genomes are required for us to understanding the genetic characteristics of a core cultivar, This study presents a fully phased, chromosome-level high-quality apple genome with a higher contiguity and completeness than previously sequenced apple genomes and also reveals 121 ‘Honeycrisp’-specific orthogroups with a large data set, which provide a toolbox for apple genetic research and breeding.

      The paper is well written and the data is convincing. So, I recommend to publish this paper ASAP.

      Reviewer 2. Luca Bianco

      Are all data available and do they match the descriptions in the paper?

      I could not access to the Bioproject data nor see the results files (i.e. fasta, gff,...) but I am confident they will be available once the paper is accepted.

      Is there sufficient data validation and statistical analyses of data quality?

      The only exception is what I mentioned regarding the haplotype separation (see general comments below).

      General comments: This paper describes the genome sequence of Honeycrisp, an important apple cultivar, produced with the latest sequencing technologies and assembled into phased chromosomes. In my opinion, the manuscript is well written, very interesting and certainly worth publication. There are only a few points that I would like to see addressed:

      1) How can you be sure that the two haplomes are a good representation of each chromosome and not a mix of the two haplotypes? In other words, have you checked that the whole sequence of each chromosome represents one phase only? It would be great if you could provide some data (e.g. SNPs,...) to support this and discuss the results obtained in this regard.

      2) Some additional stats regarding the obtained sequence could be added to table 2 and/or table 5 (e.g. number of Ns in the genome, how many telomers were assembled in each chromosome -- if not all telomers were identified, )

      3) The gene family analysis among the different apple genomes is quite interesting but rather superficial. It would be nice to dig deeper into the function of the orthogroups that are unique to Honeycrisp, describe what pathways they are involved in and so on...

  6. Sep 2022
    1. Background

      Reviewer 2. Haris Zafeiropoulos

      I appreciated the opportunity to review your manuscript. Tourmaline aims at facilitating an easy-to-follow architecture for tracking input and output file names, parameters, and commands of QIIME2 runs to enhance meta-analyses. If I am not mistaken, this is the corner-stone of this study so my review is based on that. Running Tourmaline is straightforward and its documentation is exceptional. The video tutorial and the GitHub wiki (https://github.com/aomlomics/tourmaline/wiki) allows non-experienced users to start working their analysis and the containerized version of the tool allows an easy-to-go installation in multiple operating systems without extra effort. The extra visual components provide insight in a nice way and the report returned can provide added value on the runs. However, even if I do share the authors' interest on usability and interoperability and tools could have a great impact in the community indeed, Tourmaline currently lacks any substantial features to be considered as a stand-alone software tool. In addition, there are several issues that it is my belief that need to be addressed (see the following list). Major issues Major Issue #1: The authors claim that "this lack of automation and standardization [in tracking input and output file names, parameters, and commands on QIIME2] is inefficient and creates barriers to meta-analysis and sharing of results. Therefore, what Tourmaline and thus, the manuscript needs to demonstrate, is that meta-analyses are now feasible to a greater extent, thanks to the Tourmaline wrapper. Major Issue #2: Assuming that enhancing meta-analyses is the main contribution of Tourmaline, it is fundamental to consider the minimum information about a marker gene sequence (MIMARKS) standard of the Genomic Standards Consortium (GSC). Rather than just mentioning MIMARKS, Tourmaline needs to explore ways to exploit such standards, i.e. perhaps by adding MIMARKS columns in the config file. Major Issue #3: As QIIME2 has been developed on the basis of a plugin archiiated the opportunity to review your manuscript. Tourmaline aims at facilitating an easy-to-follow architecture for tracking input and output file names, parameters, and commands of QIIME2 runs to enhance meta-analyses. If I am not mistaken, this is the corner-stone of this study so my review is based on that. tecture, it would be highly recommended that such an application could be provided as a plugin too, joining the corresponding QIIME2 library (https://library.qiime2.org/plugins/). Major Issue #4: With respect to the structure of the manuscript, it is my belief that there are sections that should be omitted. Tutorials and "how to" are of extremely valuable but it would be better to be provided either as supplementary material or through repositories, e.g. GitHub wiki, GitHub pages etc, rather than in the main manuscript. The wiki page on Tourmaline's GitHub repository is rather informative. An alternative might be merging the "Overview" along with the "Snakefile", "Config file", "Input files" and "Run the workflow" sections, to describe "The Tourmaline workflow" architecture in a less verbose way, highlighting the role of the "Snakefile" and the "config.yml" files and the architecture that binds them together. "Documentation", "Installation", "Cloning" subsections could/should be omitted too. Major Issue #5: The test dataset does not allow the validation of Tourmaline in meta-analyses. It is rather important to have a testbed dataset to demonstrate "how to run" but a use case of an actual meta-analysis is required to demonstrate how different analyses can be combined in the framework of Tourmaline and provide further insight than those of the initial ones. Major Issue #7: No license has been included in the "Availability of supporting source code" section. On Tourmaline GitHub repo a license (https://github.com/aomlomics/tourmaline#license) is mentioned, yet GigaScience asks for an appropriate Open Source Initiative compliant license (https://opensource.org/licenses/category). In addition, I tried to find if the QIIME2 license is mentioned in a Tourmaline Docker container and I could not; if I am not mistaken that is required based on the QIIME2 license (https://github.com/qiime2/qiime2/blob/master/LICENSE). My apologies again in case of any misapprehension. Major Issue #8: Parameter optimization is indeed one of the greatest challenges in metabarcoding bioinformatics analyses. However, it is not clear to me how by keeping the exact same names in your output files, will you be able to compare the results of the different runs. Major Issue #9: I realise that the authors provide Figures 2 and 3 in a complementary way, presenting the visual component returned after each step. However, having a figure with 16 screenshots makes it hard for the reader to realize what is coming from QIIME2 and what from Tourmaline but most importantly does not highlight the added value that Tourmaline provides to such an analysis. It is my belief that Figure 2 could remain as it, while FIgure 3 should focus on the output components that are not provided by QIIME2 routines, but from Tourmaline functionalities. In case of a meta-analysis, this figure should highlight all the added value that using QIIME2 through the Tourmaline wrapper would provide. Minor issues Minor Issue #1: Please rephrase the Findings section in the abstract, so that it is clear that Tourmaline invokes QIIME2 routines to implement taxonomy assignment, perform analysis etc. It is required to state clearly what QIIME2 does and what are the extra features of Tourmaline throughout the manuscript. Minor Issue #2: The conclusion you mention in the abstract is not in line with the scope of Tourmaline that was described earlier. Tourmaline does not accelerate the performance of QIIME2 routines. Its aim, as mentioned earlier, is to enhance meta-analysis and sharing of results. Minor Issue #3: terms such as "meta-analysis", "reproducibility", "metadata" could be added Minor Issue #4: In line Information gained... resource management it would be nice to add references for the value of the method in each of the various fields mentioned. Minor Issue #5: Usually, (shotgun) metagenome analyses are used to measure diversity in microbiomes, meaning the functional, genomic diversity; the term microbiome has been widely used as the collection of genomes from all microbial taxa present in a sample. It would be better to rephrase this like "popular method of measuring taxonomic/microbial diversity of host microbiome or in environmental samples" Minor Issue #6: As an overall comment, long sentences make the manuscript hard to read. In this case: "PCR primers have been used to generate amplicons of the bacterial 16S rRNA gene in studies of human and animal microbiota [..] among others." should be splitted. Minor Issue #7: "other environmental surveys" please explain. Minor Issue #8: It is not clear to me how the study of Prodan et al. (2020) is related to the standardization of amplicon data analysis. Minor Issue #9: The authors highlight that the standard directory structure enhances data exploration and parameter optimization. A use case to demonstrate this main feature of Tourmaline would be of high value. Minor Issue #10: The purpose of performing amplicon sequencing or metabarcoding is to reveal patterns of diversity in biological systems. That is not the only case, please rephrase. Sincerely, Haris Zafeiropoulous

      Re-review:

      Most of my initial comments have been addressed to some extent. However, it is my belief that there is a major contradiction with this manuscript. As described in the introduction, Tourmaline is supposed to address challenges that make meta-analysis hard for metabarcoding studies. "; this lack of automation and standardization is inefficient and creates barriers to meta-analysis and sharing of results". However, as the authors highlight in their response, there are 5 points that make Tourmaline to set apart from other amplicon workflows. However, only one of them ("3- Snakemake features") is related (to some extent) to the challenge described. The rest are exceptional ways to make things easier for the users to run an analysis but they do not have a direct link with how to enable/support meta-analysis. Therefore, it is my belief that the Introduction section should be revised to better present the actual highlights of Tourmaline or further features (some of them described in my initial review) need to be added to support meta-analysis.

      Other Issues Even though the authors recognize the impact of metadata standards: they do not mention anything on their manuscript about them and their potential I was not able to figure out how "have made the metadata that comes with Tourmaline fully MIMARKS-compliant." If this software is focused on meta-analysis, I would strongly suggest investing more effort on describing how these could benefit the community and the Tourmaline users. In the parallelization section that was added, it is fundamental to mention that this is possible thanks to Qiime2 implementation. Snakemake is working as an interface allowing Tourmaline to support the options of Qiime2. If Qiime2 had no option for running on multiple threads, then Tourmaline would not inherit such a feature. The same applies for the merging step in the meta-analysis case. All Qiime2 commands used as such need to be clear that are Qiime2 commands that are performed in the Tourmaline workflow; otherwise it can be thought that the feature was developed from the Tourmaline team

    2. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac066), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Anna Heintz-Buschart

      Reviewer Comments to Author: Thompson et al. present a workflow for amplicon sequencing analysis, wrapping commands from the commonly used QIIME2 package in the commonly used workflow manager Snakemake. The manuscript is clearly structured and contains figures of appropriate quality. Building and testing user-friendly workflows that facilitate the use of existing software is an important task for the research community. The chosen existing softwares (namely Snakemake as workflow manager and QIIME2's calls to DADA2 and deblur, as well as QIIME2's visualisations) are trusted and often used by the community. Next to manuscript, I have inspected the GitHub/wiki page of the proposed workflow and tested it on the provided test data as well as an independent data set. I have run into some issues, which I will put forward below, together with some comments on the content of and omissions from the manuscript.

      Manuscript: 1) Overall, the reads more like a manual or tutorial than a methods description. The point-by-point description of the outputs may be a bit lengthy. 2) The manuscript is missing information on runtimes and hardware requirements. This is in particular a pity because the workflow does not make use of parallelisation of the called tools. It might be pretty slow on large datasets? 3) There is also no justification for the choices of the analyses that are done and the defaults that were chosen. 4) In the introduction, other published amplicon sequencing workflows are cited and dismissed as not all well documented. Other than maybe not being so well documented, there are differences in scope between these workflows and the one described here. It would be very helpful for readers to be informed on how the described workflow is set apart from those workflows. And also from QIIME2 (all of the images in figure 3, for example, are QIIME2's visualisation work and not part of the workflow's report). Finally, tagseq, which also wraps QIIME2 commands in snakemake, is not mentioned. From my point of view, the workflow still requires the user to be able to do quite a lot of data setup in the command line environment and requires knowledge of QIIME2, while resolving relatively little by wrapping the commands in Snakemake (also see my next point). Clearly, it would be helpful to discuss what existing problem the workflow overcomes that the others (and QIIME2) don't. 5) In the same paragraph of the introduction, it is mentioned that the workflow might evolve with QIIME2. However, it makes use of only a small part of QIIME's options/commands - is there a plan to widen the scope? How will continuous support be done? Is there a plan to integrate the workflow into the QIIME software ecosystem? Similarly, the workflow is not using Snakemake to its full potential, e.g. it does require several manual installation steps instead of making use of Snakemake's conda integration; it doesn't make use of Snakemake's reporting ability, which might be interesting together with QIIME2's data provenance. So, is there a plan to improve this? Also for developing it towards better usability on (cloud) cluster structures?

      Testruns: What worked: Overall, I could install the software on a linux machine by following the description on the GitHub webpage. The test run worked as expected. The data was accessible and I could visualise it using the QIIME2 online viewer. I could run the workflow on an unrelated dataset.

      What could be improved:

      a) The setup of the input was a bit annoying, because the names and paths to the inputs and outputs need to be set at various places. The fact that existing and non-existing inputs have to be defined in the config confused me at first. The error messages that ensued from not doing this right were uninformative (these cases could be caught by the Snakefile with or without the help of a scheme). b) while the workflow is very well documented, the settings for the individual demonising / taxonomy steps are not. The links to the QIIME2 documentation don't point to the current version.

      What didn't work: Running a small dataset - the workflow expects to be able to do statistical test with groups and replicates. However, only a late step checks if the data set is suitable, so there-s a failure after considerable running time, which is annoying. While this kind of analysis may be the most common application, it's not the only one. It would be good if those parts of the workflow that require certain dataset structures could be switched off.

      minor: i) As a very irregular user of QIIME2, I find the QIIME2-jargon difficult to understand (e.g. artefacts and artefact equivalents, manifesto, and the QIIME2 names of the DADA2 and deblur steps, emperor plot...). It would be better, if these were defined (and maybe not all discussed in detail). ii) Personally, I would like to have a primer removal step in the workflow. But that's a design decision that can be discussed. iii) "the fungal internal transcribed spacer (ITS) of the rRNA gene (Abarenkov et al. 2010)" - the internal transcribed spacer is not within an rRNA gene, but the different ITS regions are found between rRNA genes.

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.68), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Zhizhi Wang

      Is there sufficient data validation and statistical analyses of data quality? No.

      Comments and Suggestions for Authors: In this manuscript, Schultz et al reported the sex-biased transcriptomes of E. suzannae, whose reproductive system was manipulated by the bacterial endosymbiont, Cardinium. Despite de novo assembling, this paper also aimed to annotate sex determination genes and venom proteins to better understand the biology of E. suzannae. However, there are several issues with this manuscript:

      General comments It seems that the theme of the background is about bacterial endosymbiont and its function in host reproduction while the data annotation is not so relevant. Of importance, there is also no further information, e.g. gene lists, about the annotated sex determination genes and venom proteins in the manuscript. The sex determination systems in hymenopteran insects are diverse and complex, while the mechanism of two species, Apis mellifera and Nasonia vitripennis, are well characterized. Authors should also point out if there are sex determination homologies shared between E. suzannae and these two model species, such as csd and wom. For the annotation of venom protein, authors should note that the predicted venom protein is not reliable without venom gland transcriptome or venom proteome data.

      As stated, the transcriptome data have been published elsewhere focusing on the expression profile of Cardinium, it would be interesting to show potential endosymbiont response genes or pathways in E. suzannae.

      My other concern is that the number of E. suzannae coding sequences is twice as many as that of E. formosa, which leads me to doubt the purity of the assembled transcriptome. The authors use a mapping-and-removal approach to filter contaminant reads from several endosymbionts of E. suzannae and its host Bemisia tabaci, while one could not exclude the possibility that other foreign contaminants could present in the raw data. Instead, foreign contaminants can be detected—and optionally removed—using a short-read taxonomic classifier by software.

      Minor: Line 41 their use in biological control of... Line 50 B. tabaci Line 51 …or other aphelinid parasitoids… Line 54 ... they can directly damage plants by feeding… Line 84 ..lifestyle... Line 122 …mapped to the male… Line 187….the public version….

      Major Revisions.

      Re-review: The authors have addressed most of my concerns, so I would like to recommend the paper for publication.

      Reviewer 2. Shaoli Wang

      Is there sufficient data validation and statistical analyses of data quality?

      Schultz et al., used previous available RNA-seq data from E. suzannae cultures as data source. They filter possible sequence reads of symbionts and host whitefly by the method of mapping-and-removal approach to eliminate known contaminants, and assembly the rest reads to unigenes as insect E. Suzannae transcriptome data source, further annotate these unigenes. I have no more constructive comments to this paper, just some suggestions below to improve this MS, even though the authors wrote the MS very carefully. 1. Line 33, it would be better to add some words about other results from annotation and transcriptome comparison parts, after the sentence “Benchmarking Single-Copy Orthologs (BUSCO) results indicate both assemblies are highly complete”. 2. Line 123, “Reads that did not map to any of these bacterial genomes with greater than 94% identity……”~ After gradient screening to get this 94% identity? If yes, add some words to explain why select this value. Same question as the 97% in line 124. 3. Line 194, Summarize a table to show annotation results about sex determination and venom proteins related genes. 4. Line 246, “Quality control and data validation” should be moved into the front of the annotation part. 5. The format of references cited in the MS are different, which should be revised according to the requirement of the journal.

      Even though the data presented are not very informative, there is no any obvious flaw throughout the MS.

    1. ABSTRACT

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.67), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Alison Gould

      Is there sufficient data validation and statistical analyses of data quality? Not my area of expertise.

      Other comments:

      This was a very clear and well-written manuscript presenting a whole genome assembly for the giant trevally. This will serve as an important resource for future researcher interested in this and other closely related species of fish. I. only have a few minor suggestions but overall, found the paper to be of high-quality. -It would be helpful to include the estimated genome size and BUSCO score in the abstract -Include the species name on the X-axes of each column in Fig 5. -Several of the tables (Table2, Table6, for example) don't seem necessary in the main text as they are not really discussed in the paper and could included as supporting material.

      Reviewer 2. Yue Song

      Are all data available and do they match the descriptions in the paper?

      Yes. The description of the data in the article is generally correct, but there are some inconsistencies. e.g. in line 186, the author used single-copy orthologs from the actinopterygii set of OrthoDB (v10) to assess assembly completeness but using vertebrate set for the comparison with other fish genomes. All the other species are also fish genomes, so why not use the same database (e.g. actinopterygii)?

      Are the data and metadata consistent with relevant minimum information or reporting standards?

      Yes. it is best to provide relevant information about protein-coding genes i think.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. (1) It is better to provide detailed software parameters and the description of how to assemble contigs into scaffold is not clear enough. (2) The method of how to identify single-copy orthologs is not clearly described.

      Is there sufficient data validation and statistical analyses of data quality?

      No. I don't think it's enough to just rely on single-copy orthologs and/or synteny blocks to assess the genome quality, maybe it would be better to add some others, e.g. reads mapping?

      Other comments:

      (1) In line 223, the authors just provide how many scaffolds there are in the final assembly version, but how many chromosomes are assembled and how about the proportion of scaffold or contigs which have been located into the chromosomes. These information is not found in the MS. Note that there is already published a genome for this genus in NCBI, but only at the contig level, if using the Hi-C data could provide the chromosomal level one, I think it would be more useful. (2) In line 252, I noticed there was not mentioned about gene sets, especially the protein-coding genes, how many coding genes are there in this genome? (3) In figure 4, many cross-linking intensities are not obvious, which may be related to sequencing depth of Hi-C data, I can't figure out how many chromosomes are there in the final assembly from this diagram. (4) Some minor bugs, in FIGURE CAPTIONS, figure 6, here the author used the ray-finned fish, right? I think it is a mistake here, cause in line 186, the author mentioned vertebrata set.

      Re-review: The author has responded to the corresponding questions, recommended accepting the manuscript.

  7. Aug 2022
    1. Background

      Reviewer 2. Alexandre R. Paschoal

      The authors present a "Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants". It seems to be an application of the GOMAP pipeline in 14 species which sounds to be interesting. However, it lacks polish, thats why I list some suggestions to help the authors improve it. * Major: 1-) It is not clear the state-of-art tools in this topic (this is not detailed in introduction, which is a serious gap in this work), including similar or same tools/methods for the same purpose. Keep this in mind, please, compare against tools from literature in the same issue please. I am not an expert in the GO topic, but as far as I know, there is Blast2GO, and I found others: https://www.mdpi.com/1999-5903/13/7/172 https://academic.oup.com/nar/article/49/D1/D394/6027812?login=true https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13285 https://david.ncifcrf.gov/ blast2go etc 1.2.) Please, compare against these tools, or those that make sense, if not, why not? Clarify and make clear this, please. PS: Re-write the introduction to address these points. 2-) The idea is to do a large-scale analysis of several plant species (in case 14) using the GOMAP pipeline. Is it? 2.1-) Please make clear how many and each of the contributions in the abstract and introduction. 2.2-) We now have more than 70 plants species in Ensembl Plants, for example. Why do not use all of them as much as possible for a real large-scale analysis? 3-) If I understood, the GitHub (https://github.com/Dill-PICL/GOMAP-Paper-2019.1) and the https://dill- picl.org/projects/gomap/ gomap-datasets/ contains all the data results from this report, is it? 3.1.) Both (and mainly GitHub) are far from being user-friendly. It seems the author put the information and that it. Be more clear, what, how, why this information is there, and how to use it. Also, is it clear all commands used (maybe provide a manual on what you have done in a tech aspect)? 3.2.) Is there any visualization table, I mean an easy output produced by this analysis? If I want to use this data for my new genome etc, real case, how to use it? how to compare? where? sequence? GO terms, etc? Clarify this, please. PS: Imagine that is a biologist that wants to use your approach. 4-) For me is not clear why do not also put Ensembl Plants in this report analysis, and only Phytozome and Gramene. Please, include and compare all these databases. 5-) Authors mention that they will make available the final results in Zenodo after this revision. Please, make all data, FASTA, trees etc available. * Minor: - How often do you expect to update this tool? Make clear this point, please - Could you clarify all the diff. among your work and Zhu et al. work? - Did you expect to have any significance (bootstrap) on the trees fig.? - Page 5/6, there are zero space lines in section D., and some ?? reference in fig and reference, please, correct this issue.

    2. Abstract

      This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.65), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Leonore Reiser

      Reviewer Comments to Author: The authors present a detailed assessment and creative analysis of computationally predicted functional annotations for 18 plant genomes. First they applied their GOMAP pipeline to annotate the genomes, and compared those outputs against a 'Gold Standard' of Gramene annotations (minus those inferred from Electronic Annotation) and electronically inferred annotations from Gramene or Phytozome. They then used the GOMAP annotation set in an interesting way to perform a sort of phylogenetic reconstruction.

      First, I applaud the authors for presenting a manuscript that is a paragon of data FAIRness. The data is findable, accessible, well annotated with metadata and certainly looks reusable (what a pleasure to have the option to download as a CSV.) Bravo! Brava!

      The idea of recapitulation of phylogenetic relationships based on GO annotations is an interesting one and while authors do a good of addressing some of the caveats and limitations of their analysis I do wonder if there are other things that they may want to consider. For example a lot of plant annotations are based off of Arabidopsis experimental annotations, which means that some aspects of plant biology that are unique to specific clades may not be well represented in the ontology because those processes have not been annotated or the terms may not even exist in the ontologies yet. Also, at least for Arabidopsis, many of the included annotations come from PAINT which is a phylogenetic based annotation method (the IBA annotations) so transferring IBA annotations from Arabidopsis to other plant species might add a certain phylogenetic flavor the GO MAP results.

      Specific comments on the text. 1.Please clarify what the sets of terms used were and what is meant by ancestors? The MS states, granular terms were mapped to higher order terms used for comparison- how were those ancestor terms selected? is there a list of these common (S) terms that were used to generate the trees available somewhere? If so, that subset should be made available (or maybe it is but I could not tell.) I think this selection of terms for use in the analysis is really important but could not find any data for this- if the data is available it is not obvious.

      2.Annotations with modifiers that were removed- can you clarify what is meant by that , are those 'Not' annotations?

      3.One expects a high level of granularity for manually curated gene functions (that is very specific terms) how are annotations harmonized across the different prediction methods used for GO MAP since presumably some of the methods employed provide less specificity in their annotations?

      4.For the comparison, was there any manual inspection of presence or absence of terms? Was there any correspondance with anything known biologically? That is for certain term character states, were there any unexpected or inconsistent with biology?

      5.The phylogenetic analysis seems to factor in all 3 GO aspects, have the authors compared results using just a single aspect (process function or component?) Process is notoriously noisy and annotations can be subject to a lot of interpretation. It is also probably the most incomplete data set,

      Specific comments on the figures. 1. Panel b. Gramene -IEA is confusing here in the figure and when described elsewhere. I suggest that in the figure ,and the text, using less confusing nomenclature such as Gramene (IEA only) and Gramene (no IEA) for gold standard. To me I read Gramene-IEA as Gramene minus IEA annotations and not Gramene's IEA annotations only.

      2.Supplementary Figure S1. I wonder if there is a more effective way to visualize this data. I think there is a lot of interesting information here but it is hard to follow , especially the third graph. Another improvement to readability would be to make the text font darker (not sure why it is light grey.

    1. Abstract

      Reviewer 3. Mary Ann Tuli

      This manuscript describes the reannotation of the Heterorhabditis bacteriophora, an entomopathogenic nematode widely used to control insect pests in horticulture. A previous study was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes. This study asked whether these unusual characteristics were biological or methodological in origin.

      The work was carried out in the spirit of data improvement, rather than a rebuttal, and while it is not a genome paper as such, it does reanalyse a genome using new data and different tools. It is very suited to the GigaScience philosophy and readership due to the repeatable side and open access component.

      I have checked that the Methods described and the Resources used meet the minimum standards reporting check list. I note that data has been submitted to the publicly available repositories (SRA and INSDC) but that the data is not yet available, thus it cannot be reviewed at the moment.

      I have looked at the files in https://github.com/DRL/mclean2017 There are 9 supplementary files of annotation, analyses and annotation pipelines which look thorough and complete. The repository also include splice site files. The manuscript states that all custom scripts developed for this manuscript are available at in this repository but I see only a single script in the /analysis folder. Is this right?

      The gene prediction and protein orthology analyses and discussion were thorough and fully explained, as well as future work (expanded transcriptome and comparative data work) described.

      My recommendation is that this manuscript be published as a research article.

      I have some minor typos and suggestions which are probably more pertinent for a copy editor to spot but include them here since I noted them down.

      105 BUSCO; see below). Another unusual feature of the H. bacteriophora gene set was the -> 105 BUSCO; see Table 2). Another unusual feature of the H. bacteriophora gene set was the

      107 Most nematode (and other metazoan) genomes have low proportions of non-canonical introns (less than 1%), [Reference needed]

      137 from the new Illumina data and sequence similarity from the NCBI nucleotide database (nt) -> 137 from the new Illumina data and sequence similarity from the NCBI nucleotide (nt) database

      371 The assembly scaffolds were aligned to the NCBI nucleotide (nt) database, -> 371 The assembly scaffolds were aligned to the NCBI nt database,

      397 version of the assembly. Hard masking was for known Nematoda repeats from the -> 397 version of the assembly. The assembly was hard-masked for known Nematoda repeats from the….?

      [Hard masked / hard-masked Soft masked / soft-masked check for consistent use]

      406 bacteriophora annotation was identified from the general feature format file, and then-> 406 bacteriophora annotation was identified from the general feature format (GFF) file, and then

      407 selected from the protein FASTA files. The general feature format file (GFF) for -> 407 selected from the protein FASTA files. The GFF file for

      415 from the general feature format file as exon features -> 415 from the GFF file as exon features

      423 bacteriophora. Intronic features were added to GFF3 [Explain what GFF3 is]

      [Check consistent use of GFF (line 415) / GFF file / GFF format (744, 749) Should be GFF file]

      424 gff3 -sort -tidy -retainids -fixregionboundaries -addintrons') and and splice sites were -> 424 gff3 -sort -tidy -retainids -fixregionboundaries -addintrons') and splice sites were

      445 the 23 Clade V nematodes were downloaded from WBPS8 (available at: 446 http://parasite.wormbase.org/index.html) [Suggest link to ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS8/)

      358 Parasite (WBPS8) [34]. [This is the first mention of WormBase Parasite so should include the home page rather than line in 446]

      478 using MAFFT v7.267 (RRID:SCR_011811) [50], and the alignments trimmed with NOISY [Reference needed for NOISY.]

      480 v8.1.20 (RRID:SCR_006086) [51] with a PROTGAMMAGTR [Reference needed for PROTGAMMAGTR]

    1. ABSTRACT

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.66), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Linzhou Li

      Are the data and metadata consistent with relevant minimum information or reporting standards? No. Geographic location (country and/or sea, region, latitude and longitude) is missing, as well as environmental context.

      Is there sufficient data validation and statistical analyses of data quality? No. The genome size and gene number of Dendrobium hybrid cultivar ‘Emma White’ differ greatly from the published Dendrobium genomes (e.g. Zhang et. al Scientific Reports 2016, Zhang et. al Horticulture Research 2021, Han et. al Genome Biology and Evolution 2020...). Specifically, the authors assembled a smaller genome and predicted a larger number of genes compared with the previous study. Therefore, I strongly suspect that the assembled genome is incomplete and fragmented, resulting in more fragmental genes.

      Is the validation suitable for this type of data? No. There's not enough raw data (~24Gb) to assemble a 600Mb (or ~1.2Gb from the previous study) genome. I highly recommend the authors get more raw data and do a genome survey.

      Additional Comments: The Complete BUSCOs only account for 16.6% which is quite low. The authors explain that the large loss of BUSCOs is due to the fact that the mutant genome has a lot of specific sequences, but these genes are very conserved in plants and should not be easily mutated.

      Reviewer 2. Stephanie Chen

      Is the language of sufficient quality? No. Most of the manuscript is written in a sufficient quality, but there are certain parts that require revision to improve readability. Please see detailed comments on the Word document.

      Are all data available and do they match the descriptions in the paper? No. The SRA link is coming up as a permission error, but I assume it will be released once the paper is available. There is no information on where to access the annotation file.

      Is the data acquisition clear, complete and methodologically sound? No. The contiguity (635,396 contigs, N50 of 1,620 bp) and completeness (16.60 %) of the genome is quite low and this may limit its downstream uses. It would be good to incorporate some long-reads or increased sequencing coverage to improve your genome. There are a number of chromosome-level Dendrobium genomes that are available (e.g. D. chrysotoxum and D. huoshanense) and scaffolding off these may be attempted to improve the assembly. Scaffolding from existing assemblies may be a good option if generating more sequencing reads is not feasible.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Some details on the DNA extraction and library preparation steps are missing. In the methods section, there are also missing details for multiple programs in terms of the version and parameters (e.g. BUSCO version and database used, QUAST version, AUGUSTUS version, details on adapter removal and trimming). It is mentioned 'similarity score and description of each gene was filtered out using in-house pipeline'. The script and details of the pipeline are not provided; please add a reference or details in the manuscript e.g. link to GitHub repository.

      Is there sufficient data validation and statistical analyses of data quality? No. The reporting and interpretation of BUSCO results ('BUSCO version 5.2.2 analysis reveals 913 (56.57%) single-copy orthologs doesn’t match with any data bases indicates the unique and possible uncharacterized sequences in mutant genome of Dendrobium hybrid cultivar') needs to be revisited. There needs to be additional validation of the gene annotation (e.g. BUSCO, comparison with existing Dendrobium annotations) and also some validation of the genome size (e.g. GenomeScope and comparison with reported flow cytometry measures).

      Is the validation suitable for this type of data? Yes. The type of validation in the manuscript (BUSCO) is suitable to assess genome completeness, but reporting and discussion of the results needs to be revised. Some additional validation is also needed (see box above).

      Additional Comments: In this manuscript, the authors provide a draft genome of a gamma-ray induced mutant of a Dendrobium hybrid cultivar using Illumina sequencing that will assist with future breeding efforts and studies. However, I am not convinced of the genome's usefulness in its current form. There are some methods that need to be described in more detail to be reproducible. Revisions will also help improve the readability of the manuscript. As page and line numbers are not provided on the manuscript, please find additional comments directly added to manuscript file attached.

      https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMzA2L1Jldmlld19TQ185Njc2XzA1MDIyMl9HaWdhYnl0ZV9HYW1tYSBXR1MgZGF0YW5vdGUgKDEpLmRvY3g~

      Re-review: Thank you to the authors for addressing the previous comments on the manuscript. I generally find the revisions satisfactory, although have some follow up comments. The addition of details on the genetic origin of the Dendrobium ‘Emma White’ hybrid cultivar and requested details on bioinformatic tool versions/parameters have strengthened the manuscript. The authors have not followed up on the suggestion to improve the genome via scaffolding, but provide an explanation that existing chromosome-level assemblies/sequencing data of Dendrobium species are not suitable as they are not related to the hybrid cultivar the authors studied, implying that they are highly diverged and scaffolding would not meaningfully improve the genome. Given this information, I think the Dendrobium ‘Emma White’ hybrid cultivar genome can still be useful for orchid breeding efforts despite low contiguity and completeness. However, I do not agree with the author’s point of, “Third, we used low coverage genome analysis with short reads of gamma mutant Dendrobium hybrid cultivar, as it was the first case study and obtained SRA, genome assembly and TSA accessions from NCBI. The genome assemblies of Dendrobium species from earlier studies used both long reads and short reads in their study. Construction of scaffolding from such database species using our contigs may be skewed and shall give unreliable data based on above points mentioned. Hence, I opinioned that suggestion given by Reviwer 2 on scaffolding suggestion may not be correct point.” Even if different types of sequencing technologies were used in comparison to Emma White genome, the availability of a contiguous closely related reference genome would still be useful for reference-guided scaffolding of the draft genome and well as comparative analyses. Lines 107-109: Reorder sentence to make the order of the steps clear i.e. adapter removal and quality filtering before assembly with MaSuRCA. Also, on the MaSuRCA GitHub (https://github.com/alekseyzimin/masurca), it says “Avoid using third party tools to pre-process the Illumina data before providing it to MaSuRCA, unless you are absolutely sure you know exactly what the preprocessing tool does. Do not do any trimming, cleaning or error correction. This will likely deteriorate the assembly.” Did the authors find that the pre-processing meaningfully improved the quality of the assembly, compared to if the raw reads were input straight into the assembler? Please justify the preprocessing of reads. Suggest to reword lines 137-139 “BUSCO version 5.2.2 analysis reveals 913 (56.57%) single-copy orthologs doesn’t match with any data bases indicates the impact from evolutionary development of hybrid cultivars and influence of gamma radiation. It is because, the genome of ‘Emma White’ hybrid cultivar of Dendrobium derived from five unique different species is complex genome and continuously hybridized repeatedly 11 times over a period of 68 years with selection process for economic trait improvement” to make the explanation clearer and also to include the number and/or percentage of complete BUSCOs. This was flagged in the previous comments, but not fully resolved and would benefit from revisiting the interpretation of BUSCO results. There are a large number of missing BUSCOs in your assembly, likely related to low contiguity (as well as radiation which is mentioned). Can you discuss if/how this may be a limitation for using this genome in further studies? You suggest that the BUSCOs are not found in the assembly due to many rounds of trait selection and radiation. It is possible that some of the BUSCOs are indeed missing from the particular plant sequenced, but how can you be certain that this is due to the breeding history and radiation applied as implied in the text, and not low genome contiguity? Some papers which characterised gamma irradiation-induced mutations in plants (DOIs: 10.1093/jrr/rraa059, 10.1186/s12864-019-6182-3, 10.1534/g3.119.400555) indicate that it is unlikely as many as 913 BUSCO genes have been affected. Even with stronger doses of radiation than used on the orchid, the number of mutations/genes affected is much lower. The genus name needs to be consistently italicised throughout the manuscript.

      Re-re-review: Thank you to the authors for addressing the previous comments on the manuscript. The authors have followed up on the suggestion scaffold the genome by using the published Dendrobium huoshanense genome to scaffold their draft genome using RagTag. This is an appropriate tool to use and has improved the contiguity of the draft assembly which is good to see. In the methods, the version of RagTag is missing, as are the parameters used to run the program. Please also provide specification on the specific RagTag utilities used (correct, scaffold, patch and/or merge). The authors have added genome statistics for two other orchids and the scaffolded assembly in Table 1, however, have not added BUSCO results for their scaffolded assembly in Table 2. Also, can the authors provide a comment on if the low BUSCO values may be related to the fragmented assembly as brought up in the previous round of review? It will be interesting to see if BUSCO has improved with the scaffolding. BUSCO results for the other two species, D. catenatum and D. huoshanense, would also be a good point of comparison and this is relatively simple and quick to add. The authors could consider concatenating Table 1 and 2 in this case. The draft assembly has improved, and the authors should report numbers on the final version of the assembly presented in the paper (i.e. the scaffolded assembly) in terms of the analysis they have run. In the results and discussion section, it appears some of the statistics (e.g. 96,529 genes, 216,232 SSRs) still refer to the first draft assembly. The authors have clarified that raw reads were used as input into MaSuRCA (line 111) and have now included the necessary detail for the input and parameterisation of the program. Line 157-159: “Taxonomical analysis of mutant Dendrobium at raw sequence data also revealed limited synteny with its closest Dendrobium catenatum species at below 9% and genetically heterogeneous with outcrossing nature”. Details of how this analysis was done is missing from the methods. It may be more appropriate to perform synteny analysis at the genome level and compare the published D. catenatum genome with the scaffolded Dendrobium hybrid genome.

      Editors comment: Additional Editorial Board assessment and feedback was received during the review process.

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.65), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Takeshi Takeuchi

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. 1) How many high-quality reads/nucleotides were retained after filtering and applied to the Falcon assembler? The authors also need to describe the parameters for the Falcon. 2) How did the authors manage the duplicated contigs from different haplotypes in the assembly? 3) In Table 2, stas for "scaffolds" are shown. But there is no description of the scaffolding process.

      Is there sufficient data validation and statistical analyses of data quality? No. 4) In Table 3, the authors should not compare transcriptome (refs 32, 55, and 34) and gene models (this study). Did the authors produce transcriptome assembly from the RNA-seq data in fact? If so, please describe the method for the transcriptome assembly. 5) Results of BLAST2GO and InterProScan were not described.

      Additional Comments: The number of gene models (64,636) is much higher than those of other Porites species (30,000-40,000). The number of exons per gene is considerably lower than others. These results indicate that the gene models are fragmented, possibly due to insufficient gene model prediction. This issue needs to be discussed. In the Abstract, the genome size "667 Gbp" should be "667 Mbp." In Table 2 and the main text, the assembly size is 678Mbp. Which is correct?

      Re-review: I appreciate the authors’ effort to address all referee comments. I believe the data will be valuable for the research community.

      Reviewer 2. Jong Bhak

      Additional Comments: Porites astreoides is an important coral species and this reviewer thinks all the major reference construction parameters have shown a high quality assembly. Predicted gene number, 64,636, is a bit too high. This needs to be checked and improved. (This number has been fluctuating. Not critical, though)

  8. Jul 2022
    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.64), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Peter Mulhair

      Is the language of sufficient quality?

      This manuscript is clear and concise. However, there are some issues with consistency in species names used throughout the manuscript. First, on line 99 Eubasilissa regina should be italicised. Secondly, I would recommend after the initial use of the full names of the species (Plodia interpunctella and Eubasilissa regina) that these be referred to as P. interpunctella and E. regina in the rest of the text. There is inconsistent use of full species names, shortened species names and genus name alone which may cause confusion. Please read through and correct these inconsistencies throughout the manuscript text.

      Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide

      No. Missing items from the metadata checklist include (1) Coding gene annotations (GFF), Coding gene nucleotide sequences and Coding gene translated sequences (fasta) and (2) Full (not summary) BUSCO results output files (text).

      Is the data acquisition clear, complete and methodologically sound?

      Yes. Is there a specific reason why fifth instar larvae were used for RNA sequencing of silk glands of P. interpunctella? If this stage is biologically important than it may be worth stating why this specific stage is used.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. However, the code used for Heavy fibroin gene annotation could be made publicly available to enable reproducibility of this analysis (using other species for example or to annotate other repeat rich genes). This could be uploaded to the rest of the relevant code at https://github.com/AshlynPowell/silk-gene-visualization

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. One point worth making is that on Line 161 you state that "The assembly for E. regina is the most contiguous Trichoptera genome assembly to date.". However, there are currently 3 chromosome level assemblies available for Trichoptera on NCBI. I would recommend removing this statement, or changing it by also pointing to these other genomes available.

      Other comments:

      This work was carried out to a very high quality and I am particularly happy to see more high quality genomic and transcriptomic data for these groups of insects. I also think that annotation of the Heavy fibroin genes is of particular importance and relevance to researchers interested in silk evolution and evolution and annotation of repeat rich proteins.

      Recommendation: Minor revision

      Reviewer 2. Reuben W Nowell

      Are all data available and do they match the descriptions in the paper?

      No. I wasn't able to access the data with the FTP link provided.

      Additional comments:

      A very nice piece of work, I have only a few minor comments:
      
      • Line 140: "with the k-mer length set to 1" - do you mean 21?
      • Line 164: great that you provide a link to the GenomeScope html but I recommend to add these kmer plots as additional supplemental figures, they are extremely useful. Just a screenshot of the GenomeScope plot would be fine.
      • Line 164: in relation to the kmer distributions, in fact both plots look a little bit multimodal to me... especially the Eubasilissa, with peaks at 1n (20x), 2n (40x) and 4n (~80x) coverage. This might indicate tetraploidy, which might explain the large increase in genome span and gene number for this species too. You could run OrthoFinder and look at the distribution of OG membership size, for diploid assemblies it peaks at 2, but you might find a peak at 2 and 4 for Eubasilissa if it is tetraploid.
      • Line 167: how many contaminant contigs were identified, and where did they come from? - Line 168: the coverage for both species is roughly the same, but the species with the much larger genome is the more contiguous one - any ideas why this is the case?
      • Line 184: maybe this is a silly question, but how do you know they are full-length? Based on the B. mori BAC sequence?
      • Line 192: a unit for molecular weight, Da?
      • Line 224: would be useful to know how many genes are in the Insecta core BUSCO db (i.e., where the 95% comes from).
      • Line 233: is there a possibility that RepetModeler has also classified the repeat-rich fibroin genes as 'repeats', and so these are masked in the assemblies?
      • Line 243: this is a huge difference in gene number! Why? Is the E. regina assembly actually a diploid assembly? Or ploidy > 2? [See above comment on kmer plots].
      • Line 265: "insects have generally been neglected with respect to genome sequencing efforts" - quite a bold statement and I'm not sure I agree, there has been a huge focus on lepidopteran genomics and much of the early sequencing from initiatives such as Darwin Tree of Life have been on insects (also i5k).
      • Line 457: Table 2: any idea why the P. interpunctella HiFi assembly is ~60 Mb shorter than the two Illumina assemblies?
      • Line 475: Figures 2 and 3: these are nice figures but I don't quite follow what the two coloured panels on the left are showing, specifically, why are there two panels? A bit more clarification in the legend needed perhaps.
      • Line 476: N and C capitalised

      Recommendation: Minor revision

      Reviewer 3. Martin Pippel

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. (partly) : To make the study fully reproducible the authors need to upload the PacBio HiFi data (e.g. to NCBI). Otherwise the genome assemblies cannot be reproduced with the available raw data in GenBank.

      Any additional comments:

      The manuscript entitled “Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes” from Kawahara et al. describes the de novo assembly and gene annotation of two silk-producing insect species Plodia interpunctella and Eubasilissa regina. The manuscript is well structured and written. Sequencing data, assemblies and genome annotations are publicly available and can be reused by the scientific community. Both contig assemblies show a very high contiguity and good BUSCO scores. Indeed, several from the 118 P. interpunctella and 53 E. regina contigs show telomere repeat sequence at both ends indicating that those represent full chromosomes. Furthermore, the authors showed that even long repetitive genes such as silk fibroin genes were gapless assembled. I consider the manuscript as a valuable contribution for the scientific community and do only have some minor comments and suggestions:
      

      line 129: - which CCS version was used? line 140: - k-mer length was set to 1? Not 21? line 148: - Typo: obd10 reference endopterygota. - In order to make the Busco scores better comparable to other recent Lepidoptera assemblies it would be better to provide the BUSCO scores for P. interpunctella based on the lepidoptera lineage line 158: CCS data should be added to GenBank as well. Usually the raw data (subreads.bam) is lossy converted into fastq files from NCBI, which makes it impossible to reproduce the consensus step with pbCCS or even the assembly. line 159: Both read coverages are quite high and the heterozygosity rates are with 0.7 (Eubasilissa) and 0.36 (Plodia) high as well. I was wondering if the alternate assemblies were also of a decent quality and if those are published as well? line 265: As of today, there are at least 3 other HiFi assemblies available: (GCA_917563855.2, GCA_929108145.1, GCA_917880885.1) line 457: Table 2 states that E.regina was assembled into 53 contigs. However the assembly available at NCBI GCA_022840565.1 has 123 contigs!?

  9. Jun 2022
    1. Background

      Reviewer 2: David Reshef

      General comments---This manuscript introduces an open-source implementation of two measures of dependence, MICe and TICe, which together provide a combination of both statistical power and equitability for identifying associations in large data sets. The implementation provided by the authors is a valuable contribution to the community that allows for the easy computation of these measures of dependence, and I'd recommend its acceptance after the authors make the minor edits listed below.Minor Comments---A few minor comments that the authors should be made aware of (but that I didn't want to be public given how minor they are): 1) There are a few small type-o's to correct (e.g. coniugate on Pg. 1, line 31; expenses on pg. 2, line 15). 2) I would suggest the authors soften the language around the fact that "an implementation of these two measures and of a statistical procedure to test the significance of each association is still missing." The authors who developed MICe and TICe are simply waiting to post their implementation of MICe and TICe at www.exploredata.net along with the official publication of the most recent paper analyzing these measures in the Annals of Applied Statistics (https://www.e- publications.org/ims/submission/AOAS/user/submissionFile/29563?confirm=583655c8). That said, the implementation in this manuscript submitted to GigaScience is still a valuable contribution as it is open-source (the implementation AOAS will post is not) and provides a more comprehensive procedure to test for significance. 3) On Pg. 1, line 31, "which coniugate computational efficiency with good bias/variance properties", isn't quite accurate. I'd change this to "which combine computational efficiency with superior bias/variance properties". 4) On Pg. 2, line 5, "has been shown to satisfy the equitability requirement" should be changed to "has been shown to have good equitability" to reflect the fact that equitability is not a binary property, but a continuous one that a measure of dependence can have more or less of. 5) On Pg. 2, line 6 - MIC doesn't actually suffer from lack of power, and this fact has been corrected in the literature, so I would recommend using softer language. It was shown in ref. 12 that was cited by the authors that the original perceived bad power of MIC was due to incorrect parameter settings by those who drew that conclusion. When used with appropriate parameters for independence testing, MIC has decent, but not state-of-the-art, power. What is accurate, however, is that MICe and TICe improved upon the power of MIC, and that TICe has state-of-the-art power. 6) On Pg. 2, second column, line 23, regarding the sentence beginning with "With regards to the number of permutations..." (and elsewhere): the number of permutations necessary t operform for any given analysis scales with the number of tests one must correct for (i.e. the number of variable pairs for which a measure of dependence was computed), as the FDR accuracy is inversely proportional to the number of permutations used to compute it, so I'd be careful about saying that a specific number is generally enough for data of any dimensionality.

    2. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: Simone Romano

      In this paper the authors describe and analyse a series of tools to find complex associations in large omics data sets. At the core of these tools lies the measure of association Maximal Information Coefficient (MIC) which recently received a lot of interest in data mining community. Other than presenting the first publicly available implementation of MIC to date, the authors make available the code for a complete pipeline to identify statistically significant associations between the features in a data set. This involves:- Computing the Total Information Coefficient (TIC) for each pair of features- Computing their p-value using a permutation test with Monte Carlo simulations- Select the significant pairs using statistical correction for multiple hypotheses- Rank the statistically significant associations according to MICMoreover, the authors analyse the results of their pipeline on synthetic and real data sets.I commend the authors for providing the community with a well-tested implementation of MIC (and its more recent version MIC_e) in various programming languages including C, Matlab, and Python. I also really appreciate publishing a full pipeline to identify associations between features written in Python, which is probably the most popular language in the data science community. Moreover, the paper is well written and the analyses about the effectivity of these tools are convincing. The paper should be accepted for publication in the GigaScience journal. There has been so much discussion about the merit of MIC in the past years since its publication in 2011. I am honestly impressed by MIC's authors efforts to shed light on the theoretical and empirical properties of MIC.

      Their effort recently found venue in prestigious journals such as the Proceedings of the National Academy of Science (PNAS) in 2014, the Journal of Machine Learning Research (JMLR) in 2016, and the Annals Of Applied Statistics (AOAS) in 2017. The main criticism about MIC has been its similarity to one of the many estimators of mutual information. Even though MIC exploits mutual information, MIC has been shown to not be the same as estimating mutual information [Measuring dependence powerfully and equitably by Reshef et al. in JMLR 2016]. Nonetheless, what strikes me the most is that: in many empirical studies no estimator of mutual information has the same performance of MIC in terms of equitability. Being equitability a very intuitive property, I do understand why researchers and data mining practitioners value MIC.I have only one concern about the methodology of screening associations with TIC and ranking only the selected ones with MIC. Possibly if we are interested just in equitability, MIC should be the only association measure to be employed in the analysis. However, given than TIC shows to have more power the MIC [An Empirical Study of the Maximal and Total Information Coefficients and Leading Measures of Dependence by Reshef et al. in AOAS 2017], I guess that the associations that MIC would deem as significant would be a subset of the significant associations for TIC.Minor comments:- It would be great to describe the Storey's method to control the FDR in the paper to make it self-contained; It would be also great to briefly describe the procedure to control the FWER; - A table describing the difference between the data sets SD1 and SD2 would be informative. Possibly a line describing the Madelon semi-synthetic data sets would be useful too;-

      The authors discuss a great insight on MIC when they say that: "associations between informative/redundant and redundant/redundant variables were significant also for a lower number of samples". It would be nice to have a visual example about these type of associations;- Figure 4 b. I guess discussing a decreasing FN is the same as discussing increasing power. Changing the FN plot in a power plot would make the paper more coherent: e.g. as in Figure 2 a;- "coniugate" in the abstract -> conjugate. Maybe better to reformulate this sentence as it is not very clear; Simone Romano

    1. Image

      Reviewer 2: Christian Fournier

      Reviewer Comments to Author, version 1: The authors investigate the ability of deriving plant biomass (both fresh and dry mass) from 2D image-based features acquired with visible, fluorescent and NIR multi-view imaging systems operating on an automated high throughput phenotyping platform. In a first part, several multivariate statistical models are compared for their ability at predicting biomass for two treatments within a single experiment, on three independent datasets, detailed results being presented for one experiment. One of the best model, the random forest, is then further investigated for its capacity at making prediction across experiments, being trained on one experiment at a time or on one treatment of one experiment at a time. Finally, the relative importance of individual image-based traits in the prediction of either fresh or dry weight is presented for two treatments of one dataset.

      Models and methods for model evaluation are clearly presented, and the overall quality of the text and Figure makes the paper easy to follow. The inclusion of other than visible images, the objective selection of image-based traits, the comparison of models and the use of 3 independent datasets clearly distinguish this paper from previous publications on the same subject. It provides the reader very valuable information on the current prediction capacity of the approach, together with a consistent methodology for analyzing other related practices.

      However, I have two major concerns on the current version of this manuscript.

      First, I think that some conclusions highlighted in the abstract or in the text are not completely in line (or at least sufficiently tempered) with what is demonstrated in the text or shown on the figures. In the abstract (line 19-20), it is highlighted that 'The results proved that plant biomass can be accurately predicted from image-based parameters using a random forest model'. To me this conclusion is clearly supported by data in the case of within experiment predictions, but not fully in the case of the cross experiment test (i.e. quite opposite to what is stressed line 21). My impression, given results presented Figure 5, is that in one case out of two, a model trained on one experiment alone could not accurately (or at least with not the same accuracy) predict the biomass, despite a repeated protocol. This result is per se very interesting, as it demonstrates an important limitation of the approach. It can however not be summarized by what is written line 19-21, 201-202, 209-210 or 253-257. On another occasion (line 148 and line 248), I found the conclusion ('the RF model largely outperformed other models') a bit exaggerated, as, on Figure 3, depending on the criteria, RF model performs very similar to MARS model for example.

      Second, I did not manage to test the models, nor to reproduce the analysis with the provided data and source code. Concerning the data, image traits are provided for all experiments, but manual measurement on Dry Weight are missing. Concerning the code, the R-script provided does not fit to the provided dataset, thus making it difficult to test. More important, model code runs with errors at runtime ('not defined' errors). I also think, but this is only a suggestion, that, in addition to raw image files, providing binary masks of plants, that are of high importance for all traits analyzed here, could improve the re-use of this nice dataset,.

      Other minor points or comments for specific parts of the texts are provided bellow:

      Line 72-74: I think this sentence would be better be placed in the Potential application section Line 85: Do you mean that some image traits are more sensitive to physiological traits ? I do not see why Fig 1B is illustrative for this point. Line 98: In the context of phenotyping, it might also be useful to add Spearman rank correlation to the assessment Line 108: Fig 1B is only a heatmap image. May be a list of traits should be provided, or a reference to the supplementary data should be added here. Line 117: Figure 2B is poorly informative as traits are not identified. This figure is also not commented in the text, I suggest removing it. Line 144: I would find useful to make here perfectly clear that all the models were trained on the control + stress plants, to avoid any confusion with the 'cross treatment test' later on (Figure 6) Line 146-151: I found the analysis a bit confusing as, in the details, the ranking of the different methods varies, and I do not clearly see why RF 'largely outperforms' other methods (especially MARS). Line 152-155: The comparison with the widely used 'single feature' method is very interesting. Can you consider to add its score/line on the R2 and RMSRE ? Line 178: May be it is also worth noting in the text that geometric + color traits trust 13 out of 15 (FW) and 15 out of 15 (DW) first places, as these two types of data are widely available among phenotyping platform and yet not so often used in biomass predictions. Line 201 - 211: The text seems to me a bit too optimistic regarding the cross experiment predictions. Exp3 clearly shows a non-conservation of the relationship obtained in Exp1 or 2, and a clear loss of predictive power compared to within experiment training. Line 281: typo: sophisticated Line 349: could you give an idea of the amount of such filled missing values? Line 400: the formulation is a bit strange as it sounds like a conclusion already. Line 426: DW data are missing. Line 535: legend of figure 5 did not really apply to these figures. A complete legend should be added.

      Re-review:

      I thank the authors for the work done on the new manuscript and on Github, that address most of the concerns I raised in my first review.

      The pipeline published on GitHub now works nicely and allows to reproduces the different analyses. I only had to install manually two packages (earth and e1071). They could be easily added to the list of dependency in the R script to completely automatize the installation. The authors also clarify their analysis of the comparison of models, and the overstatement concerning the RF model has been corrected.

      I however still think that the abstract should be amended to better match the conclusions of the cross experiment test. The author acknowledged, in their response and in the text (line 226) that one cross experiment test leads to a loss of predictive accuracy.

      It seems also obvious, from Figure 5, and this should probably be added to the text, that this loss of accuracy is not linked to a greater random dispersion of the points, but to a systematic model bias. I agree with the authors that this may be due to some changes in the experimental conditions. My point is that these changes are not completely captured by the model, even with the inclusion of non structural traits. I therefore still think that there is some overstatement/ambiguity in the abstract, in particular in the sentence' The high prediction accuracy based on this model, in particular the cross experiment performance, will contribute to relieve the phenotyping bottleneck in biomass measurement in breeding applications' . This may however be easily fixed.

    2. Abstract

      This paper has been published under an Open Access CC-BY 4.0 license in the journal GigaScience, which includes Open Peer Reviews published under the same license. These are as follows:

      Reviewer 1: Malia Gehan

      Reviewer Comments to Author, version 1: Image datasets are available and are a valuable community resources. The code is available, which is great. While I definitely appreciate the authors work, I don't think the data support some of the statement throughout the paper, especially when it comes to the wording regarding MLR vs other models, unless further clarification can be provided (Figure 3). In some of the conditions (stress for example) MLR looks better than the other models. The inclusion of color, NIR, and Fluor traits into models is interesting.

      Lines 14-15: I think this statement needs to be qualified by saying that it is a challenge to find a predictive biomass model across experiments, not that it is a challenge to find a biomass model 'in the context of high-throughput phenotyping', which is vague and I don't think accurate without further clarification considering the number of previous papers that model biomass from images with high correlation to ground truth measurements.

      Lines 34 to 40: lacking in citations of literature. Introduction in general needs improvement in terms of the previous literature that it cites.

      The second paragraph of the intro is a very limited short review of the literature but there are a number of papers that model biomass using ht-phenotyping that are not represented including Yang et al 2014 (nature communications), Montest et al. 2011 (Field Crops Research), Fahlgren et al. 2015 (Molecular Plant) to name a few.

      Line 45: "On the other hand, to produce reliable assessments, suitable model types needs to be established and model construction requires integration of many components such as efficient mathematical analysis and representative data." Very vague.

      Line 58: Please clarify this statement: "Another concern is that the number of traits used in these studies were quite limited and perhaps not representative enough. Therefore, a more effective and powerful model is needed to overcome these limitations and to allow better utilization of the image-based plant features which are obtained from non-invasive phenotyping approaches." Not sure what this means exactly, very vague considering that the papers mentioned do have models of biomass that are not 'perfect' but do have high heritability and correlation with ground truth measurements.

      I think the authors need to adjust the justification of their research to stress that there needs to be biomass models that can be used across experiments/environment/treatments, which they do say, but needs to be stated more clearly. In general, many of the justification statements, which are pointed out in points 3 and 4 above are obscure to the point that they lose meaning.

      Line 146 : "Although the performance of these models was roughly similar, RF, SVR and MARS methods had better performance than the MLR method for prediction of both FW (Fig. 3B) and DW (Fig. 3D), implying a nonlinear relationship between image-based phenotypic profiles and biomass output." This doesn't seem accurate, it looks like MLR has just as good predictive power in many of the situations presented. I don't think you can say that MLR and the others are roughly similar and then say that this implies a nonlinear relationship. Can this conclusion be clarified? It seems like there are only small differences between the models.

      Regardless of whether or not random forest is the 'best' model, the data doesn't seem to support the statement that the RF model 'largely' outperformed the other models. This only seems accurate under the control condition, can this be clarified?

      Line 238: "Although previous attempts have been made to estimate plant biomass from image data, most of these studies consider only a single image-based feature or very few features in their models which are often linear-based, ignoring the fact that the phenotypic components underlying biomass accumulation are presumably complex. Accurately predicting biomass from image data requires efficient mathematical models as well as representative image-derived features." I disagree with the authors on this point, if biomass can be modeled with a few features with high correlation why does it matter if they presume that it is complex? Their more complex models were still decreased in R2 with environmental differences and between experiments and I don't find the data suggesting that RF model outperforming other models (particularly MLR) convincing without further clarification.

      Re-review: Chen et al, appear to have addressed each reviewer comment, below are some minor language changes for the revised sections.

      Minor changes (language changes) 1. Line 47: remove "some other traits" seems unnecessary 2. Line 64: change "they" to Buesmeyer et al. 2013, and change "make it a question" to "question" 3. Line 73 change besides to "Further" 4. Line 75 change to "due to a lack of datasets for assessment"

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.62), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Kapeel Chougule

      Are all data available and do they match the descriptions in the paper?

      No. I typed Bioproject ID provided under supporting data and materials: PRJNA795201 but could not see any information or data. Either the authors have put it on hold or data have not be submitted. Please request the authors to release data.

      Additional Comments: The manuscript titled "A chromosome-level genome assembly and annotation of a maize elitebreedinglineDan340" provides a good overview of genome assembly construction and structural annotation of maize elite breeding line Dan340. The authors have presented correct methods in construction of genome assembly and annotations. Although the paper provides elaborate methods for genome construction the manuscript lacking to demonstrate the value of the genome as a resource. More specifically the authors describe this line as elite with having desirable characters such as disease resistance; lodging resistance and so on. The focus of the manuscript mostly on methods without significant examples to demonstrate the value of the resource. The authors could characterize some disease resistance genes or genes affected by structural variations in the Dan340 line and compare it to other maize lines. Major Comments: 1) I typed Bioproject ID provided under supporting data and materials: PRJNA795201 but could not see any information or data. Either the authors have put it on hold or data have not be submitted. Please request the authors to release data. 2) The authors build repeat lib using ab inito and homology based methods and masked 66.09% of the Dan340 genome; this when compared to other reference lines esp B73 from Hufford et al (https://www.science.org/action/downloadSupplement?doi=10.1126%2Fscience.abg5289&file=science.abg5289_hufford_sm.pdf) Table S5 is significatly lower i.e B73 genome is 85% masked. 3) To assess the quality of genome assembly the authors use LAI index. The reported LAI for B73 in Hufford et al ( same as above Table S2) is 27.84 where as the authors report B73 LAI 16.79 which is incorrect. Is the B73 version used for comparison v4 or v5??. Can the authors provide a pairwise comparison of genome sequence using dot plot betwee Dan340 line and other maize lines to visualize assembly artifacts like inversions deletions or gaps in the assembly. 4) The authors use transcription data from six tissues ( stem,endosperm,embryo,bract,silk,ear and tip) for alignment. There is no mention in the manuscript how these were generated. Are they also submitted to NCBI? Minor comments: 1) Line 6; rephrase sentence what the authors meant: There are more than 50 Maize hybrid breeds derived from Dan340 since 2000. 2) Table 2 : use full genus specie names in column 1.

      Reviewer 2. Georg Haberer

      Are all data available and do they match the descriptions in the paper?

      No. I could not check the data availability, the provided project number was not at the NCBI sequence archive. The authors should ensure that both raw genomic and RNAseq reads are uploaded there. Also, the final genome sequence and gene annotations should be available to the community.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. The methods how structural variants were detected is very fuzzy. For example, L283: “Then, SAMtools v0.1.1 (Li et al., 2009) was used to assign the structural variations from the bam file to each chromosome or scaffold.” Samtools per se does not call any SVs, it is unclear how this assignment was actually performed: did they filter SV here? Did they use the full samtools+mpileup+vcftools pipeline? Also, did they really apply v0.1.1, this is an extremely outdated version (current version 1.15, probably dozens of updates).

      Additional Comments:

      The authors report sequencing and assembly of Dan340, a highly important founder line for maize breeding in China. They complement the genome sequence by gene and repeat annotations and a preliminary study of structural variants between their and three other lines. The obtained genome sequence is of excellent quality and evaluated by several independent statistics (BUSCO, CEGMA, LAI). Repeat and gene predictions seem to be done by state-of-the-art methods, and the reported numbers and proportions are similar to previous reports and comparative analysis in maize. In summary, the manuscript provides a highly valuable genomic resource for maize biologists and breeders and complements the increasing number of maize pan-genomes by a major Chinese germplasm. I have only a few comments for the authors: see my points above about data availability and methods to call SVs, in addition: - Table 2: they provide here only abbreviations of species, they have to spell out these abbreviations in the table legend, for exampe: Hvu: Hordeum vulgare. Also it may be difficult for readers to understand to what species/lines ‘ZmaL’ and ‘Zma’ point. - L31/L67: “… and so on.” not an appropriate closure of a sentence in scientific texts. - L95-103: this part can be left out, the authors do not have to, and should not describe or even justify here the CCS technology. Just mention what has been done.
      

      Recommendation: Minor Revision.

      Reviewer 2. Xupo Ding

      1. The gene numbers and repeat percentage should be presented in the abstract.
      2. The potential functions of three secondary metabolite processes in the abstract might be inferred. It will be showed specifics of Dan340, such as related to disease resistance or others?
      3. Line 79-82, this sentence is same with the conclusion in the abstract, please extending.
      4. The depth or data size of CCS and Hi-C should be added to corresponding the Illumina data description.
      5. Line 128-131, before assembly assessment, the quality might not be steerable. Please consulting: The assembly was performed in a stepwise fashion with PacBio HiFi reads and Illumina short reads and Hi-C technology.
      6. Line 211-213, insert the description about comparative data of LTR in Dan340, B73, Mo17, and SK.
      7. Line 241, describe how many genes or percentage of protein coding genes were supported by RNA-seq.
      8. Fig.4A, the line name of four maize might be out of Veen.
      9. Fig.4B and Fig.4C were not cited in the data description.
      10. Line 264-270, insert the pathways description fronting five or ten in the GO list and infer the function of special three secondary metabolite pathway in Dan340.
      11. In conclusion, the contributions should be deepen discussion,at least not exactly same with details in abstract and Line 79-82.

      Recommendation: Minor Revision

    1. We present LT1

      Reviewer 2. Professor.Gong Zhang

      Is the language of sufficient quality? No.

      Are all data available and do they match the descriptions in the paper?

      No. Hi-C data was not deposited.

      Is the data acquisition clear, complete and methodologically sound?

      No. The quality of the nanopore sequencing datasets was not evaluated. The error correction using short-read sequencing was not clear. It seems not necessary to use Hi-C data for the assembly.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. Error correction was not clear.

      Is there sufficient data validation and statistical analyses of data quality? No.

      Is the validation suitable for this type of data?

      No. No validation of the variants was performed. The authors used multiple SNV detection algorithms and got quite different results. They should experimentally validate which one is better.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. It is difficult to reuse it. There's little annotation done.

      Additional Comments: I don't understand why the authors chose to sequence a woman. As a reference of a certain ethnic, complete chromosomes are needed, which means a man (XY) is necessary.

    2. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.51), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Dr.Giulio Formenti

      First review: Language: A few minor typos to correct, highlighted in the revised manuscript

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes, but please revise as per my comments

      Additional Comments: see comments here https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMjg3L0xUMV9NU19HaWdhQnl0ZV8yMDIxMTEyM19IU19HRi5kb2N4

      Decision: Minor Revision.

      Re-review:

      I am happy with the changes and I think the article is worth publishing in Gigabytes. However, I think one main point needs further clarity. Since this is mostly about a new dataset and assembly, the authors should make it very clear to the reader what they did. I think the title is still misleading in this respect. In it the authors refer to an "assembly with short and long reads combined with Hi-C data". This is not how one would generally refer to such an assembly in the community as it reads as if a short-read based assembly was complemented with long reads (gap filling?) and hic reads (phasing?). I suggest rephrasing as "a ONT long-read-based assembly scaffolded with Hi-C data and polished with short reads". The confusion/ambiguity about this is further reinforced in the text. I think the authors should make and extra effort reading the text to make sure the genome assembly terminology is consistent with the state of the art and therefore very clear to the reader. For instance, in the abstract the authors say that the assembly was constructed using 57x ultra-long nanopore reads. I think this is incorrect. Ultra-long nanopore reads are usually defined as reads >100kbp. I don't think the authors filtered their dataset for ultralong and this should be corrected. Indeed, it would be interesting to know what fraction of ultralong reads are available in their 57x dataset.

  10. May 2022
    1. Now published in Gigabyte doi: 10.46471/gigabyte.49 Teresa Shippy 1 KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Teresa ShippyPrashant S Hosmani 2 Boyce Thompson Institute, Ithaca, NY 14853; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Prashant S HosmaniMirella Flores-Gonzalez 2 Boyce Thompson Institute, Ithaca, NY 14853; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mirella Flores-GonzalezLukas A Mueller 2 Boyce Thompson Institute, Ithaca, NY 14853; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas A MuellerWayne B Hunter 3 USDA-ARS, U.S. Horticultural Research Laboratory, Fort Pierce, FL 34945; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Wayne B HunterSusan J Brown 1 KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susan J BrownTom DeliaFIXX 4 Indian River State College, Fort Pierce, FL 34981; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tom DeliaFIXXSurya Saha 5 Boyce Thompson Institute Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Surya SahaFor correspondence: ss2489@cornell.edu

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.49), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Hailin Liu

      Is the data acquisition clear, complete and methodologically sound? No.

      Additional Comments:

      Minor revision please: 1. This manuscript needs to be reorganized, as the methods, results and discussion were somewhat mixed. 2. Line 125, were these data newly got? How much data you used should also be presented. 3. How do you make sure that the hox genes you find were complete or exact? Was there any validation?

      Recommendation: Minor Revision

      Reviewer 2. Mary Ann Tuli

      Are all data available and do they match the descriptions in the paper?

      Yes. he author states, "Reciprocal BLAST was used to confirm orthologs for all D. citri genes", and has explained (through the pre-review process) that these were performed manually on the NCBI website over a period of months by different authors and thus cannot be easily reproduced. I think it could be made more clear that this is in line with manual curation and the accession numbers are all provided in the paper.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      Yes. See above comment regarding reciprocal BLAST.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. It does meet reuse criteria, but will be more reusable once the data is available from the Citrus Greening website.

      Recommendation: Minor Revision

    1. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.48), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Mary Ann Tuli

      Are all data available and do they match the descriptions in the paper?

      No. The author states, "The gene models will also be part of an updated OGS version 3 for D. citri". I am wondering when this updated version will be available. In addition the author states, "the data is also available through NCBI (BioProject: PRJNA29447". It would be good to include the GenBank accession numbers of the 27 D. citri genes.

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      Yes. It does meet reuse criteria, but will be more reusable once the data is available from the Citrus Greening website.

      Reviewer 2. Ruihan Li

      Are all data available and do they match the descriptions in the paper?

      No. No new sequencing data were generated in this paper. The author only modified the previous gene sets. Citation of raw data from the evidence supporting annotation mentioned in table2 should also be given.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. The method section of this paper seems too brief and should be supplemented in detail.

      Additional Comments:

      The author of “Annotation of Putative Circadian Rhythm-Associated Genes in Diaphorina citri (Hemiptera : Liviidae)” manually annotated 27 genes related to circadian rhythm in the genome of Diaphorina citri. This study summarized the circadian model in D. citri which may help people to protect citrus from the citrus greening disease. But I have some comments that would need to be taken into consideration. General comments: The method section of this paper seems too brief and should be supplemented in detail according to these studied rhythm genes. No new sequencing data were generated in this paper. The author only modified the previous gene sets. Citation of raw data from the evidence supporting annotation mentioned in table2 should also be given. Specific comments: Page 2: “…will allow future molecular therapeutics…” the molecular therapeutics for what? The object should be specified. Page 3: The author only said "Based on the critical importance of the genes identified" in the section of Introduction, but did not introduce the circadian physiological habits of this insect, which like whether they stay in a tree all the time. Another problem is that the gene functions in table1 seem to be mostly related to development, reproduction and even death, but not directly related to circadian rhythms. The relationship between these genes with rhythms needs to be supplemented. Table 1: Details of these references should be placed at the end of the article instead of in the table. Functional descriptions of these genes should be supplemented. Page 6, “…, but also makes the D. melanogaster model different from non-dipterans due to their possession of cry2.” The D. Melanogaster model and the Drosophila Model are mentioned several times in this paragraph. Do they express the same meaning? Please use the same expression to avoid misunderstanding. Table 2:“X” and “space” should be explained in the table notes. In addition, the analysis methods of “MCOT, ISO-SEQ, RNA-Seq and Ortholog” in the "Evidence Supporting Annotation" are not described in this article. Are these results collected from an existing online database or generated from new analysis that author done in this study? The section of Method is surprisingly simple. The necessary steps should be supplemented clearly. Figure2: If the author quote someone else's picture, please cite the sources of these references. Figure 3 is meaningless and suggested to be removed. Page12: How were “genome assembly errors” discovered? Did the author compare transcriptome data with genome data? Please supplement the method clearly. RNA-Seq experimental analysis is mentioned in the conclusion, but there is no relevant experiment in this paper. Why included this sentence in the conclusion?
      
    1. numbat

      Reviewer 2. Xu Wang

      Are all data available and do they match the descriptions in the paper? No. PRJNA786364 cannot be found at NCBI. The numbat assembly can be assessed at AWS https://threatenedspecies.s3.ap-southeast-2.amazonaws.com/index.html#Myrecobius_fasciatus/.

      Is the data acquisition clear, complete and methodologically sound? Yes. In this manuscript, Peel et al. sequenced and assembled a draft genome of Myrmecobius fasciatus, a termitivorous marsupial species known as the numbat. A total of 94% sequencing reads could be mapped to the draft assembly, and 96% of RNA-seq reads from three tissues can be aligned. This is the first genome assembly for this species and provides the necessary genome resource for molecular study of the interesting characteristics in this species.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. One major concern is the completeness of this assembly. The BUSCO analysis revealed a completeness score of 83%, missing one-fifth of mammalian BUSCO genes. The missing gene models in this assembly may affect the inference of gene gain/loss and gene family expansion. The authors should discuss this, so the audience can be aware of this when using this assembly for comparative genomic analysis. Please find the specific comments below: 1. To determine whether this assembly is incomplete or not, a genome size estimation analysis should be performed based on K-mers or other approaches. 2. How many chromosomes are their in numbat genome? Could the authors describe the katyotype if it has been characterized in the previous literature? 3. Line 73, 82.8% complete BUSCO: could the authors break this down to single-copy, multiple-copy BUSCOs? 4. Line 95, 10x Genomics does not support genomes larger than the human genome. How much input DNA the authors used for the 10x library prep? 5. Line 190, Table 1: please add the Monodelphis and Tammy wallaby genome assembly statistics, which are the first two assembled marsupial genomes. 6. Line 193, the numbat repeat content (47.6%) was compared to antechinus (44.8%) and koala (47.5%). I suggest that the authors add the two species in Table 2, to check if any specific classes of repetitive elements were enriched in numbat. 7. Did the authors assess the level of potential contamination from microbial species? 8. I suggest the authors talk the top 20 largest scaffolds, and perform a syntenic analysis compared to tammar wallaby or the koala genome. 9. Line 295. Vomeronasal receptors: marsupials have a large number of olfactory receptors. Have the authors check the total number of olfactory receptor genes in the numbat, and how does it compare to other marsupial species?

      Is there sufficient information for others to reuse this dataset or integrate it with other data?

      No. PRJNA786364 cannot be found at NCBI. Please release the dataset after acceptance.

      Recommendations: Minor revision.

    2. Abstract

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.47), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Charles Feigin (reviewer), Elise Ireland (assisted in proofreading)

      Are all data available and do they match the descriptions in the paper? I selected "yes" to the question "Are all data available and do they match the descriptions in the paper?" as the genome, annotation, transcriptome etc are currently available through the author's institutional/consortium AWS link. However, so far as I can tell the BioProject number they provide is not yet public. This would contain the same data, but a release in the public repository is a more secure way ensure data is permanently accessible to the community.

      Is there sufficient detail in the methods and data-processing steps to allow reproduction? No.

      I have requested in my comments for a few program version numbers, parameters, sequence accession numbers to be included and suggested a few points of clarification on the methods. This should be very straightforward for the authors to address.

      Additional files: https://gigabyte-review.rivervalleytechnologies.com/download-api-file?ZmlsZV9wYXRoPXVwbG9hZHMvZ3gvRFIvMjkwL051bWJhdCBHZW5vbWUgUmV2aWV3ZXIgQ29tbWVudHMgLSBDWUYgKDIpLmRvY3g~

      Recommendation: Minor Revisions

    1. Chitinases

      Reviewer 2. Hai-Zhong Yu.

      Additional Comments. The manuscript presented by Shippy et al. revealed that chitinase family genes in Diaphorina citri. Chitin is widely distributed in nature and serves a variety of functions. In insects, chitin is a major structural component of the cuticle and peritrophic membrane, and plays an important role in molting; thus, chitin metabolism related genes can serve as a desired target for pest control. As described in background, chitinase plays an important role involved in digesting the polysaccharide polymer chitin. In the current study, the authors identified and annotated 12 chitinase family genes from D. citri and performed phylogenetic analysis. Additionally, the structural domains and expression patterns of D. citri chitinase genes were analyzed. In general, the manuscript can provide some useful information for D. citri control. This manuscript can be accepted after solving the following questions. 1. According to Table 1, 12 chitinases were identified, including CHT3, CHT5-7, CHT10-1, CHT10-2, CHT11, IDGF1-3, ENGase and CHTPE. However, CHT1-2, CHT4 and CHT8-9 seem to be missing. Please give a proper explain. 2. I suggested that the author should verify the expression levels of these chitinase genes by qPCR or Western blot.

    2. Abstract

      This work has been published in GigaByte under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.46), and has published the reviews under the same license. These are as follows.

      Reviewer 1. Mary Ann Tuli

      Are all data available and do they match the descriptions in the paper?

      As with the other manuscripts, OGS v3 is mentioned, but this is not get available from the CGEN. The data underlying table 4 and Fig 3 are available.

      Additional Comments: This manuscript is a comprehensive description of the manual curation of the chitinase family genes, with clear aims and methodology.

  11. Apr 2022
    1. ABSTRACT

      This has been published in GigaByte Journal under a CC-BY Open Access license (see https://doi.org/10.46471/gigabyte.44). The open peer reviews have been unpublished under the same license and are as follows:

      Reviewer 1. Changxu Tian In this paper, a high-quality genome of the Roundjaw Bonefish was successfully constructed, and population structure for Albula glossodonta or any bonefish species were well investigated with high-resolution genomic data. It serves as a valuable resource for future genomic studies of bonefishes to facilitate their management and conservation. Authors have presented the data in a meaningful way, I recommend the manuscript is publishable upon the following minor concerns are well addressed: 1. In the Tissue Collection and Preservation, why not use the same individual sample to complete DNA sequencing, but use the heart tissue of another individual for long-read sequencing and Hi-C sequencing. 2. In the Illumina RNA of Read Error Correction, why use the original read sequenced not filtered? 3. In the discussion section, it is suggested to add a discussion on the genomic results of this species.

      Reviewer 2. Shengyong Xu. In the present study, the authors reported the genome assembly of bonefish Albula glossodonta, as well as population genomic analyses using ddRAD-seq. These genomic data should be useful for management and conservation of this species. Some comments are as follows. 1. The authors should show us the line numbers in their manuscript. 2. In Abstract and Result, the authors should provide fundamental genomic information such as genome size, heterozygosity ratio and repeat ratio, so we can have a better understanding of Albula glossodonta genome. 3. Also, the authors should provide the information of final genome assembly of this fish species, i.e. total length of genome assembly, the number and N50 of scaffolds, and among others. 4. What’s the meaning of NG50, LG50, and auNG in the manuscript? And what’s the difference between NG50 and N50? The authors should interpret why using these statistical data in the description of genome assembly part. 5. With an annotated genome assembly as reference, I suggested the identified SNPs should be annotated using SNPEff or annovar softwares. 6. Population genomic approach can uncover population divergence at a fine spatial scale. In this manuscript, relative high levels of genetic differentiation were detected between Mauritius and other three groups based on neutral SNP dataset, suggesting possible local adaptation in Mauritius population. I suggest the authors can further analyze population structure by using outlier dataset to reveal the influence of local adaptation on population differentiation.

    1. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Mudra Hegde

      Summary:

      In this manuscript, Poudel et al. present a software, GuideMaker, to rapidly design sgRNAs targeting non-model genomes. Various input parameters such as PAM motif, guide length, length of seed region for off-target searching and so on can be toggled to design a panel of sgRNAs for pooled screening projects. The tool also helps pick control sgRNAs to include in the sgRNA pool. To benchmark the computational performance of their tool, the authors used GuideMaker to design sgRNAs targeting E.coli, P.aeruginosa, Aspergillus fumigatus and Arabidopdis thaliana. They also compared GuideMaker to the existing design tool, CHOPCHOP and reported that the targets identified by GuideMaker were mostly similar to those identified by CHOPCHOP. This tool can be used as a stand-alone web application, command-line software or in the CyVerse Discovery Environment.

      Overall, the tool is very well documented and easy to use. In the current version of the manuscript, GuideMaker does not show a clear improvement over the state-of-the-art design tool, CHOPCHOP. The authors do not implement any existing on-target scoring methods to determine the targeting efficacy of the picked sgRNAs. This can lead to picking guides that are highly specific but not effective enough.

      Major points:

      1. Implementing on-target scoring methods, at least for the Cas enzymes that have on-target efficacy information, can help improve the process of picking sgRNAs. This tool will probably be used more often with standard Cas enzymes and it will be useful to have on-target efficacy scores attached to the guide RNAs.

      2. The authors do a thorough analysis of the computational performance of GuideMaker with various genomes and Cas enzymes but including a comparison of the computational performance of GuideMaker vs. CHOPCHOP will strengthen the manuscript.

      3. The authors define the PAM sequence of SaCas9 to be NGRRT whereas the canonical PAM sequence of SaCas9 is NNGRRT. This should be modified throughout the manuscript and analyses involving SaCas9 should be redone.

      4. A good addition to the tool would be to output a file with all the sequences that were designed targeting the region of interest with the specific PAM sequence. This gives the user a sense of the universe from which the final guides were picked.

      5. Another useful input parameter would be to specify a target region that the user wants to focus on such as letting the user input genomic coordinates or a gene name or locus tag. For example, CRISPy by Blin et al., 2016 takes a GenBank file as input and allows the user to input features specific to the uploaded genome.

      Minor points:

      1. "CyVerse" is misspelled as "CyCVerse" in multiple places in the manuscript.

      2. Reference Figure 2 in Line 92.

      3. Line 154: "Ratios between tools were calculated by dividing the number of gRNA identified.."

      4. In Supplementary Figure 3 "wit haVX2" should be "with aVX2".

      5. GitHub link in Line 336 does not work.

      6. Line 225-226: "GuideMaker also creates off-target gRNAs for use as negative controls in highthroughput experiments." "Off-target gRNAs" is misleading in this context.

    2. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Wen Wang

      The author developed a software, GuideMaker, for designing CRISPR-Cas guide RNA pools in non-model genomes. Three bacterial genomes, a fungal genome, and a plant genome were used in performance benchmarking, which proves that the software supports the design of gRNAs in non-standard Cas enzymes for non-model organisms at the genome-scale. However, the advantages of this software are not well estimated nor presented compared to other tools like CHOPCHOP. Also, the software was mainly evaluated in three bacteria genomes, one fungus and Arabidopsis genome. There are no tests for non-model plant or animal genomes. Therefore, the "non-model genomes" in the title are exaggerated. I list more problems as follows.

      Major comments:

      1. The authors did not compare the computation resources and performance (running time, memory) with existing softwares like CHOPCHOP. Also, the authors need to compare the score rankings with CHOPCHOP to present the relative power of GuideMaker. Is there any score rankings concerning efficiency or off-target possibilities for the designed Guide RNAs

      2. It is better to add support for gff formated annotation input files since many non-model species do not have GenBank annotations.

      3. The authors mentioned GuideMaker can design gRNAs for any small to medium size genome (up to about 500 megabases). The maximum genome used in the article was Arabidopsis thaliana (114.1MB), which is obviously smaller than the described (up to about 500 megabases). We couldn't find the description whether the authors had investigated the larger genomes. Therefore, the detailed analysis or discussion of this problem is needed.

      4. The authors stated GuideMaker to design CRISPR-Cas guide RNA pools in non-model genomes. Arabidopsis thaliana is a model organism and test in a non-model plant genome will be highly valuable.

      5. It is also stated that GuideMaker can design gRNAs for any PAM sequence from any Cas system but the results of SaCas and StCad was described in only one sentence.

      6. The source of the genomes was missing in the manuscript. In particular, some species have multiple genome versions in the same database. Therefore, to make the results more repeatable, the specific website and version number for each species are needed.

      Minor comments: There are many typos. I give some examples here.

      1. Line 11, "bacteria" should be "bacterias".

      2. Line 38, delete the", including non-model organisms",prokaryotic and eukaryotic organisms include the non-model organisms.

      3. Line 111, "candidates guides" should be "candidate guides".

      4. Line154, "gRNA identify with GuideMaker" should be "gRNA identified with GuideMaker".

      5. Line 195, "The second way GuideMaker reduces…" should be "The second way that GuideMaker reduces…".

      6. Line 204, "and", no need for italics.

      7. Line 207, "gRNA's" should be "gRNAs".

      8. Lines 209-210, "we anticipate performance will…" should be "we anticipate that performance will…".

      9. Figure. 1. It seems that the font size of the description of Control gRNAs is inconsistent with others, please check.

      10. Line 22,55,98,159,175,187,219 and 247, "Guidemaker" should be "GuideMaker".

      11. Line 262, "CAS" should be "Cas".

      12. Supplementary Figure 4. Grammar mistake in sentence "the different number of logical cores with or without AVX2 settings are available". It should be "the different number of logical cores with or without AVX2 settings is available".

    3. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Kornel Labun

      In this study, "GuideMaker: Software to design CRISPR-Cas guide RNA pools in non-model genomes", Poudel et al. provides a software for sgRNAs design, focusing on genome wide screens. The tool uses the original strategy of finding off-targets with the use of Hierarchical Navigable Small World graphs trying to provide fast running times for the all vs all comparison. Additional novelty is introduced with proximity filters towards features of interest, and filters for restriction sites inside the guide RNA. What's more, the tool creates control guide RNAs which is mandatory for pooled screens. I applaud the selection of the license as all versions of the GuideMaker are available under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Below I list some of my comments and suggestions. Comments & suggestions:

      1. I tested the website and the tool, not finding any bugs and errors. Website is well made, congratulations!

      2. Name of the tool: GuideMaker is not self-explanatory for what it is specialized for, which is pooled design. In the future consider naming your tools more distinctly as I am afraid that currently the tool will be buried under hundreds of other GuideSomething tools. 3. Authors also claim to support Cas13 (page

      3 line 65), but don't mention anything more specific about it. I mention that because design for RNA is vastly different from design for DNA and it should be explained how the tool designs for RNA.

      1. From my understanding the tool offers highly discriminatory settings towards off-target search for a quick resolution of the all vs all comparison problem, however authors ignore that CRISPR off-targets are not defined by the hamming distance, but levenshtein distance. This was proven already by many studies e.g. Tsai et al. 2015. I recommend that authors embrace this issue in the paper and explain why their design may be suitable, and for what kind of studies it would be alright to use hamming distance vs levenshtein distance instead of ignoring the problem.

      2. Study could gain prominence by showing a couple figures and describing how the grid-optimization parameters were selected. This would be especially important for everyone that wants to use this tool for nonbacterial gnomes (page 6, lines 128-131). Although script for optimization is included, it would be good to see what are the tradeoffs.

      3. I believe that Figure 4 and all other AVX2 vs nonAVX2 comparisons are not interesting enough to include multiple times. AVX2 improvements are nice, but the tool is already plenty fast, and running time of 250 vs 220 seconds does not matter for normal users. Similarly the number of cores does not seem to influence tool speed above 8 cores and one figure should be enough to explain that. Tool claims very fast running times, but does not compare to the running times of other similar tools for the design of the pooled screens, this could highlight its superiority.

      4. CHOPCHOP is a general tool for the design of pooled screens while here it is used as a pooled screen tool due to its configurability. Additionally, CHOPCHOP also supports all PAM and all species, but on its python version available here https://bitbucket.org/valenlab/chopchop/src/master/, website supports only some genomes due to slow process of index building for bowtie.

      5. Comparisons to CHOPCHOP focus on the guides found, but I don't understand why consensus ratio between the tools should matter. What is more important is whether GuideMaker does indeed not filter any guides that are preferable for each gene (e.g. by CHOPCHOP ranking) and whether its hamming based filter is good enough to not cause significant unknown off-target effects (levenshtein distance offtargets not found by hamming distance filter). All it takes is one bulge and the hamming distance will become large, while levenshtein distance can even be as low as 1.

      6. It is not clear to me why the tool can't be used with large genomes, filtering on the 11bp seed and hamming distance should be plenty fast for also very large genomes. Could it be that the tool should support other input, not only genbank file format?

    1. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac028), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: C Robin Buell

      This manuscript describes the sequencing, assembly, annotation and analysis of a cassava genome. The cassava genome has already been published but this manuscript describes the genome of a heterozygous cultivar rather than the slightly inbred cultivar published previously. The authors performed the assembly using a number of assembler programs and benchmarked each assembly. Not surprisingly, they found that hifiasm worked the best with HiFi reads. The authors then did annotation of the genome and performed a set of analyses including allele specific expression and pan-genome analyses.

      The manuscript and its genome will be of use to a range of users in the genomics field. I do feel that the manuscript is exceedingly long and reads more of a dissertation than a research article. A significant portion of the text could be deleted and not impact the take home messages in the manuscript. For example, the analysis of allele specific expression, alternative splice form expression and the pangenome is extremely limited in depth and breadth. If these remain in the manuscript, the authors should perform more extended analyses including examining a wider range of tissues and genomes as there are extensive genomic resources available for cassava. It would be nice to tie this complete, phased assembly with the diversity analyses done previously with cassava that revealed the bases of genetic load.

      De novo annotation of the assembly was not performed. Instead, the authors projected the reference annotation onto their assembly and then did alignments with transcript data derived from IsoSeq. The authors are misinterpreting the pseudogenes. As shown earlier by Gan et al. (2011) with Arabidopsis, projection reference annotation on other genome assemblies fails to capture alternative splice forms and thus, predictions of pseudogenes from projected annotation are grossly in accurate. De novo annotation using cognate transcript evidence should be performed to ensure artifacts are not introduced into the annotation. This also would allow the authors to more deeply investigate the dysfunctional/deleterious alleles that are present in casava, a vegetatively propagated crop.

    2. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac028), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Zehong Ding

      In this manuscript, Qi et al. assembled two chromosome-scale haploid genomes in African cassava TME204, validated the structural and phasing accuracy of haplotigs by BACs and high-density genetic map, revealed extensive chromosome re-arrangements and abundant intra-genomic and inter-genomic divergent sequences, analyzed the allele-specific expression patterns in different tissues, and built a cassava pan-genome and demonstrated its importance in down-stream omics analysis. Overall, this work is of crucial importance and should be sufficient to publish in the GigaScience Journal. However, I found that this manuscript lacks the basic logical and some analyses have major flaws. Please see the details below:

      1) According to Supplementary table10, there were at least 9 different tissues of the TME204 Illumina RNA-seq data. However, when the authors performing analysis of 'Tissue specific differentially expressed transcripts (Line 393)', why just compared between leaf and stem but ignore the remaining tissues? This is illogical.

      2) Two cassava haplotypes (H1 and H2) were constructed in this study. In Table 4 and Supplementary figure 9, why the authors performed analysis between 'TME204 H1 vs. AM560' but did not mention the comparison between 'TME204 H2 vs. AM560' at all? Similarly, in Fig. 8 and Fig. 10c, the analysis was also performed in 'TME204 H1' but not in 'TME204 H2'.

      3) in Fig.7C, ASE should be the expression level comparisons between H1 and H2, why the legends still are H1 (red bar) and H2 (blue bar)? I cannot understand. Also in Fig. 7D, it's very difficult to understand this figure. E.g., what's the meaning of labels (e.g., "leaf_H1" and "Stem_H1; Leaf_H1") on x-axis? Logically, there are "stem_H1; leaf_H1", "stem_H1; leaf_H2", "stem_H2; leaf_H2", then where is the "stem_H2; leaf_H1"?

      4) Fig6d, Line 110-111, "The transcriptome comparison between TME204 leaf and stem tissues identified gene loci with associated transcripts that were differentially regulated in one haplotype only." This statement is not true because the comparison between leaf and stem cannot conclude that the transcripts were differentially regulated in one haplotype only. Thus, the sentences in Line 407-408 also need to be revised.

      Other suggestions to the authors:

      • Fig6a, what's meaning of Het_Uniq, Het_Dup, Hom_Uniq, and Hom_Dup.

      • Fig6d, what's the meaning of legend bar? Log2(leaf/stem) or log2(stem/leaf)? - ref30 cannot be cited because it is still under preparation.

      • In 'Conclusions section', the statement "The haplotype-resolved genome allows the first systematic view of the heterozygous diploid genome organization in cassava." is inaccurate, because two haplotypes in heterozygous cassava genome have already been published in Hu et al. (2021, Molecular Plant, 10.1016/j.molp.2021.04.009)

      • The title is also suggested to be changed because it is not attractive.

      • The citation of 'Figure 10b' (Line 497) and 'Figure 10c' (Line 502) are wrong.

  12. Mar 2022
    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: James J Cai

      This manuscript introduces a k-NN based feature selection method, Triku, as one key step to secure informative features in analyzing single cell RNA sequencing datasets. The authors argue that most of the current feature selection methods bias to the highly expressed genes instead of the actual gene markers defining the cell populations. Instead, they focus on the local signature of gene expression for each gene and compute how each of them deviates from their null distributions. The ranked gene list concerning the deviation will be derived after the median correction. The authors use Silhouette coefficient to validate their conclusion of better modularity by comparing to other methods. Additionally, the randomness and the robustness of the method are well discussed. In general, this article is well-organized and well-written. The examples of artificial and benchmark datasets showing certain aspects of improvements compared to current methods are illustrative. Triku will be a valuable contribution to the single cell analysis field. The reviewer has some minor comments to help improve the manuscript further:

      1. The authors compare Triku to many other widely-used benchmark methods but excluding Seurat. Although Seurat method is adopted in Scanpy, as they claim in the "FS methods", the default flavor of Scanpy is "Seurat" instead of "Seurat_v3", the default feature selection method in the latest version of Seurat. It might be good to make it clear. Also, another alternative yet popular method, sctransform, from Seurat is not on the comparing list.

      2. The evidence of "we observed that in certain datasets the Wasserstein distances tend to slightly increase with the mean expression of the genes" could be shown to introduce the necessity of further correction. And the reason why the median correction outperforms other correction methods is left unexplained. For example, Seurat, which also considers binning correction method, uses mean to control the strong relationship between variability and average expression.

      3. Since the authors integrate into the pipeline the k-NN module, which is considered computationally expensive, it would be great to evaluate the time complexity/running speed compared with other methods.

      4. Triku assumes that the local transcriptomic similarity is more likely to define cell types. Apart from clustering, which might be better-quality after Triku, it would be interesting to show any potential effects to other popular downstream analyses in the single cell field, such as trajectory inference, given that Triku is subject to locality.

      5. Triku builds k-NN graph on UMAP all the way around. To validate the robustness of Triku, one could also discuss alternative low embedding methods like t-SNE in the section of "robustness".

      6. Since Triku is likely to identify locally over-expressed genes, it would be interesting to see the overlap between features selected by Triku and the differential expressed genes, if the setting is possible to arrange to make the two comparable.

      7. In the section of previous work, some claims were made without references. For instance, "Early methods for FS in scRNA-seq data were based on the idea that genes whose expression show a greater dispersion across the dataset are the ones that best capture the biological structure of the dataset". Another example of relevant references missing is https://pubmed.ncbi.nlm.nih.gov/31861624/.

      8. Fig. S2 does not show exact gene names. For artificial data, why those four genes are representative is left unexplained.

      9. The authors classify reference 11, the dropout-based method as "a new generation". As far as I know, the benchmark M3Drop was published in 2018.

      This Reviewer's comments were prepared with assistance from my graduate student Yongjian Yang.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Rhonda Bacher

      The manuscript presents a new method, Triku, for feature selection in single-cell RNA-seq data. Feature selection is performed upstream of tasks such as clustering and differential expression to reduce the effect of genes with noisy expression. Triku uses a KNN approach to identify features that are unexpected within cells that are transcriptionally close. Overall, the manuscript is well-written, presented clearly, and is a promising new method for feature selection. The figures are also very nice.

      Major:

      1. In Figure 4, it is not obvious why different methods would rank so differently between the two datasets. What methods did those papers originally use for feature selection (if available). Does that partially explain the differences?

      2. Figure 6, the left-most plot does not belong? It is not described in the legend.

      3. It would be helpful to note somewhere which category of methods the others belong to (i.e. variance based or distribution based).

      4. Some additional results and discussion on the number of genes selected. 250-500 is quite low and may explain the poor overlap between genes selected. In my experience with commonly used methods from the scran or Seurat package a more typical number of genes selected is around 2,000. What are the typical numbers used/recommended for the other methods compared to here? Does the performance difference remain when expanded to the top 2,000 genes? And is the performance better for Triku on 250 compared to 2,000?

      5. In methods, "By default, the number of features is the one automatically selected by triku." These values should be put into the supplement to get a better idea of how many genes are being selected by default.

      Minor:

      1. In Figure 3, I would label the top and bottom as A and B, I initially misread the legend as top 250 and bottom 500 genes.

      2. What are the approximate run times a user can expect for this method?

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Christoph Ziegenhain

      In the present manuscript, Ascension and colleagues introduce a new feature selection method for scRNA-seq data to increase the relevance of selected genes for downstream analysis such as clustering. I am happy to see that the tool is deposited as an open-source package that seems easy to install and plugs in seamlessly into the commonly used scanpy workflow / AnnData data structure. The documentation is sufficient to get users started. While the method has merit as a smarter approach to feature selection, the manuscript would benefit from some additional work in terms of both text and analysis.

      Major points

      1) While Triku's strategy is being introduced as superior to preexisting methods, it seems that the strong improvements (at least for the NMI summary statistic) in synthetic data turns rather incremental in the real world datasets of Mereu and Deng et al. The authors should discuss reasons for this difference. In the light of small differences and the fact that the performance is only measured in abstract summarized scores, it would be more convincing if the authors presented concrete cases where the application of Triku yields a difference in clustering or downstream analysis of biological relevance. The currently presented Gene Ontology / Geneset enrichment analysis are too diffuse and do not provide the reader with a feeling of the impact Triku could make on their analysis.

      2) Comparison to other FS methods: Currently, the most widely used method would probably be Seurat's FindVariableFeatures. It would be good to run the presented example data also via Seurat and include it in all comparisons (eg. Fig. 3-6).

      3) Precision of text: There are quite a few statements throughout the text that seem slightly inaccurate and the authors should work in their revision on precision and guiding the reader better through the background & performed work with a bit more clarity. Example: discussion of observed zeroes in UMI-data being well described by the Poisson or NB distributions was not realized by Svensson et al but rather had been described several years before. Compare Vieth et al., 2017 Bioinformatics & Chen et al., 2018 Genome Biology

      4) One of the main assumptions of Triku is that import genes get "switched on", ie. change their state from rather not expressed to a relatively high expression level. I am wondering if the authors can comment on the performance of Triku in cases where the main difference between cells is a gradual change in already expressed genes and whether such difference might get lost/masked by the selection performed by Triku.

      Minor points

      5) What is the rationale for selecting the % of zero expression as the descriptive statistics within the knn neighborhood? If a gene occurs in less cells but with higher expression, it's dispersion would be higher too. It would be needed to justify this more precisely and ideally the authors would add a version of Triku that works on dispersion (to show possible differences).

      6) Three main types of feature selection methods are introduced but not defined/explained further (p. 2)

      7) Since Triku performs more calculations/steps than existing methods for FS, the runtime is presumably higher. The authors should compare and comment on runtime.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Gregg Thomas

      This paper presents 17 new insect genomes from the order of caddisflies (Trichoptera). The authors combine these genomes with 9 previously sequenced genomes to analyze genome size evolution across the order. They find that genome size tends to correlate with evolution of repeat elements, specifically expansion of transposable elements (TEs). Interestingly, the authors also notice that TE expansions also correlate with gene copy-number (or gene fragment copy-number), even of highly conserved genes used to assess genome completeness. Overall, I find this paper very well written and easy to follow. The genomic resources and analyses presented provide novel new resources and findings for insects in the order Trichoptera, with potential implications beyond. I have only minor suggestions before publication, outlined below.

      1. Regarding the TE and BUSCO gene fragment associations, while I think this is a really interesting analysis, I found the underlying models a bit difficult to understand. Line 236 reads, "To test whether repetitive fragments were due to TE insertions near or in the BUSCO genes or, conversely, due to the proliferation of 'true' BUSCO protein-coding gene fragments…" Is the idea that a BUSCO gene has been duplicated itself and then one copy is either fragmented by a TE insertion or hitch-hikes with a TE (as mentioned on line 501)? Or are these fragments only of BUSCO genes that didn't match a full BUSCO gene at all, but the fragments that did match had unexpectedly high coverage? I guess I'm just confused as to whether a gene duplication needs to precede the TE insertions/hitch-hiking, which is subsequently pseudogenized either prior to or because of the TE activity, or if these are gene losses. I understand how the TE could inflate the coverage of these fragments, but I guess I'm still not clear on how these fragments arise in the first place. Any clarification would be helpful! Also, if the case is that these are fragments of BUSCO genes that have no full matches in the genome, how might assembly contiguity or quality be affecting these matches?

      2. One thing that I noticed throughout the figures is that branch B1, leading to A. sexmaculata, the branch leading to clade A, and the branch leading to clade B (as labeled in Figures 1 and 2) appear to form a polytomy. I don't find this mentioned in the text and am wondering why this relationship remains unresolved with these data. I don't think this has any bearing on the results, since all analyses are done on the tips of the tree, but I think readers looking at these trees will want to know what is going on at that node.

      3. The authors use custom scripts for their BUSCO-TE correlation analysis and provide a link to a Box folder on line 514. I would request that these scripts be put somewhere more stable and accessible (e.g., github). Not only was I asked to login when clicking the link, but after I had done so that link didn't seem to exist.

      Minor/editorial points

      1. Would the authors be able to report concordance factors for the species tree? I think this should be easy enough with IQ-tree and is something I ask everyone to do. This may also help answer my question about the polytomy.

      2. The authors do a good job of mentioning and citing programs used throughout the manuscript but seem to skip this in the Assembly section (starting on Line 398). "First, we applied a long-read assembly method…" Which one? Same for "de novo hybrid assembly approaches." I see that assembly is covered in detail in the Supplement, but I think naming the main programs used (wbtdbg2 and Masurca) should be in the main text.

      3. Line 281-282: I think some of the brackets and parentheses here are mismatched or un-closed.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Julie Blommaert

      Summary of the paper and overall impression

      In their paper, "Genome size evolution in the diverse insect order Trichoptera", Heckenhauer et al report a 14-fold variation in genome size in caddis-flies. The authors find evidence for increases in transposable elements associated with larger genomes, and report that in caddis-flies living in less stable environments, some genes are replicated in association with transposable elements. Overall, this paper represents a comprehensive collection of data, however, I have some concerns about some of the reporting of methods, some analyses and conclusions. To support some of the conclusions, namely that WGD or large-scale duplications do not play a role in caddis fly genomes, I believe the authors could perform additional analyses. Further, I was left confused by the descriptions of the methods, especially around the replicated BUSCO gene analyses. Please see my comments below.

      Main comments:

      1. The authors report that their gene-age distribution analyses do not support the hypothesis of a WGD, but given previous suggestions that WGD are important in these species, the authors should conduct additional analyses (e.g smudgeplot, minor allele frequency distributions in single-copy genes) to rule out this possibility. While it can indeed be difficult to find a balance between the evidence of absence and an absence of evidence, more effort should go into resolving the matter of WGD in caddis-flies. Some of the genomescope peaks, and some of the coverage peaks from the backmap approach seem to at least hint at large-scale duplications or variations in copy number. Further analyses should also consider if assembled gene copies may be collapsed duplicates.

      2. I admit I am confused by the terminology around the TE-associated BUSCO genes. Are these cases where BUSCO has reported a high number of duplicates? Or where BUSCO annotated regions have a high coverage? Two things need to be clarified here; what made them stick out in the first place (coverage? Duplications?), and what are they really (TEs that BUSCO mistook for BUSCOs? fragments of real BUSCOs attached to TEs?).

      Minor comments:

      1. Lines 53-57: "Genome size can vary widely among closely and distantly related species, though our knowledge is still scarce for many non-model groups. This is especially true for insects, which account for much of the earth's species diversity. To date 1,345 insect genome size estimates have been published, representing less than 0.15% of all described insect species." While I appreciate the authors' point that there is a relatively little data available about genome size and only a small proportion of nonmodel insects in the Animal Genome Size database, this is the case for all groups, and insects actually represent the largest group of invertebrates in the AGSDb. However, this does not mean insects, or chironomid are a poor system to study this in, so authors could reframe this first sentence to justify the study system with something more than highlighting how understudied this is in insects.

      2. Line 76: correct to "In insects, the KNOWN ranges of genomic repeat proportion are…"

      3. Lines 89-91: Why are species rich groups a better system to study RE evolution and environmental interactions than e.g. populations, species complexes, recently diverged species, or groups in the process of speciation?

      4. Lines 113-115: The data description does not, in my opinion, need to justify the species selection since this is done in the intro

      5. Genome size estimates- sequencing based estimates can also be impacted by GC-content, especially in libraries which were produced using PCR, this may be a useful point regarding the differences between FCM and sequencing-based estimates

      6. RepMod versions inconsistent Line 463 says v2, earlier says v1

      7. Line 468-469- What did you use to merge repmask out files?

      8. All read-based analyses: were they run on decontaminated read libraries? If so, please briefly clarify this in the main manuscript. Genome size with GenomeScope: 444-448; RepeatExplorer: Lines 471-479

      9. Why only use dnaPipeTE for repeat divergences and not also abundances? Does dnaPipeTE agree with RepeatExplorer?

      10. Line 495: What is meant by "BUSCO genes showed regions of unexpected high copy number…"? Are these genes reported by BUSCO as duplicated or is this referring to increased coverage?

      11. Lines 506-507: "We used copy number profiles to identify BUSCO genes with repetitive sequences based on coverage profiles" The meaning of this is unclear. The reported copy number from BUSCO? Coverage of mapped reads?

      12. Table 1- please report the full BUSCO summary (e.g. C:39.7%[S:39.2%,D:0.5%],F:35.8%,M:24.5%,n:2442) for each species, lumping complete and fragmented together is unneccesary, and readers are usually interested enough in the full complement of BUSCOs that it should not be in the supplements, but in the main paper

      13. Coverages from backmap method can and should be compared to genomescope kcov estimates (while correcting for kmer size; see here for a brief explanation https://www.biostars.org/p/221672/), this will validate both approaches and offer further evidence when considering polploidy.

      14. In the supplementary note about TAGC plots, Figures S31, S36, S38, S44, S45, S46, S47 don't list contaminant exclusion criteria- if contaminants weren't removed this needs to be stated, and in some cases, especially those where there are different "blobs", (e.g. S47) justified

      15. Supplementary note 9: Figure reference is wrong?

      16. Supplementary note 10: Can coverage comparisons using average BUSCO coverage be re-run using corrected kcov estimates? This would validate the BUSCO coverage approach.

      17. Supp Data 1: Coverage estimates would be more accurate if based on FCM measurements and total sequenced bp (before and after decontamination) and can also be compared to corrected kcov estimates

      18. Limnephilus lunatus has too low coverage to get reliable genomescope

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Surya Saha

      The publication describes a useful tool to quickly survey a range of QC metrics for genomes available in NCBI. The a3cat toolkit can be used to setup as well as update the assessment results for public or private assemblies for a user-defined taxon. Overall, the website and the workflow on gitlab are a useful resource for the genomics community ask a number of comparative genomics questions. I enjoyed reading this manuscript and only have minor comments. I would like to bring some more use cases to the attention of the authors that can enrich the discussion.

      The authors have already presented nuggets from the data mining of results but here are a few thoughts to add to the value of results reported here, as that can be further improved. Given an assembly from an insect with an approximate taxonomic classification based on morphology or genetic markers, can the a3cat results be used to figure out the best reference genome or a set of closely related genomes for comparative analysis of the gene space? One idea could be to use the overlap of lineage specific BUSCO genes found in the new genome with BUSCO genes present in other assemblies to identify related genomes.

      The discussion covers results when the results are filtered by level (contig, scaffold, chromosome) or type (haploid, principal or alternate pseudohaplotype). It might be worthwhile to further segment the results based on input raw data (for e.g. short reads, short reads + mate pair, long reads) to explore if the contiguity of the assembly and completeness and duplication of the gene space is impacted by the proportion of indels in the raw reads irrespective of the length of the reads. There a number of other relevant variables like assembly algorithm and parameters but that can lead to very sparse data. The authors talk about the proportion of repeat content in larger genomes. This might be a valuable resource to add to the a3cat results as initiatives like Ag100Pest and DToL are producing high quality insect genomes >1-2Gbp with a large number of repeats that are going to be better assembled than ever before with high fidelity long reads. Adding the results of a widely used de novo repeat identification tool like RepeatModeler based on the DFAM database will provide a consistent measure of repeat content across all analyzed genomes and add to the value of this toolkit. In case some of this information is already available in NCBI, it can be pulled using the API avoiding the need for this massive compute job.

      This next issue is related to BUSCO but effects the results and conclusions of the a3cat tool. Is it possible that some of the BUSCO marker genes (from OrthoDB9 or 10) are based on short read assemblies with minor errors in gene models? When run on recent assemblies based on high fidelity long reads with the correctly assembled gene model, BUSCO might report the marker as missing or fragmented. I understand this outside the scope of this paper but if this is possible, it should be mentioned as a potential pitfall.

      A common problem with bioinformatics resources is the lack of a sustainability plan. I know this is difficult to pin down for the mid or long term in the face of unpredictable funding but I would like to encourage the authors to present a plan to manage and update the web resource if at all possible. For future work, it might be a good idea to consider the extension of the a3cat toolkit to include other metrics beyond the current contiguity and gene space completeness measures. Mash or ANI distances are becoming computationally tractable for large data sets. I have already mentioned the repeat content issue. Long range similarity measures based on Hi-C data or nucleotide composition based on kmer analysis might be other items to ponder.

      Minor revisions

      Since the logic and applicability of this work is so straightforward, some of the text can be shortened to reduce duplication. For e.g. on Pg 4 this paragraph can be shortened, "Using their Complete Proteome…. for selected groups of species from their field of interest." In the same paragraph, I see "(i) aid project design, particularly in the context of comparative genomics analyses; (ii) simplify comparisons of the quality of their own data with that of existing assemblies; and (iii) provide a means to survey accumulating genomics resources of interest to their ongoing research projects." Can the difference between (i) and (iii) be clearly explained?

      Typographical errors

      On Pg 8, the abbreviation CoL- needs an explanation.

      On Pg 12, can the term span be elaborated?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Stephen Richards

      We are now entering a period with rapidly increasing numbers arthropod genome assemblies. Quality has vastly improved because of new high quality long read technologies, but still has a chance to be uneven.

      Comparative genomics requires at least some effort to ensure the datasets are comparable. Here the authors have produced a nice tool to help find sequenced arthropod genomes and compare their quality.

      They use their previous experience with BUSCO to measure quality, and overall I expect will be using this resource quite a lot.

      I also expect a lot of people will use this resource to identify high quality assemblies for comparative analysis.

      One possible plot that would be useful would be completeness plots - things like number of orders with a representative, families etc, partly to show progress, and partly so missing taxa can be easily identified.

      The manuscript is well written, but more importantly the data and methods are easily accessed, and everything is well written up.

      The tool and website does what it says on the tin, and I can't really see any reason not to publish rapidly.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Yaomin Xu

      The authors presented a web tool - NETMAGE that produces an interactive network-based visualization of disease cross-phenotype relationships based on PheWAS summary statistics. NETMAGE provides search functions for various attributes and selecting nodes to view related phenotypes, associated SNPs, and various network statistics. As a use case, authors used NETMAGE to construct a network from UK BioBank (UKBB) PheWAS summary statistic data. The purpose of the tool as claimed by the authors is to provide a holistic, network-based view for an intuitive understanding of the relationships between disease phenotypes and to help analyze the shared genetic etiology.

      Major comments:

      A DDN based on true genetic associations is useful for understanding complex disease comorbidities and their shared genetic etiology (pleiotropy). An interactive web tool to explore such a complex networked information could be highly useful for the proposed purposes of this tool. However, the EHR/Biobank PheWAS associations data are statistical in nature and commonly with small effect sizes. The reported genetic associations often are not well understood at the mechanistic level, and many genetic associations are spurious. Although certain positive findings can be observed from the disease network generated by NETMAGE, it's of concern the general usability of the current implementation of the tool in order to facilitate novel applications in drug design and personalized medicine, which requires the genetic associations to best represent the underlying true causal mechanism. Further work is needed to verify the genetic associations reported from PheWAS to minimize the impact of spurious associations. Network edges based on SNPs without considering the linkage disequilibrium (LD) between SNPs is misleading and could miss a significant portion of associations that should be linked between diseases if the LD correlations are considered. When construct the network using NETMAGE, the LD correlation between SNPs should be considered.

      For the reported DDN and its statistics to be relevant to true disease - disease relationships, the quality of disease diagnosis using Phecode should be considered. Phecodes are based on ICD codes that are known to be noisy. The accuracy of ICD can be as low as only 50%. Ignoring this limitation and treating disease diagnoses from Phecodes as gold standards or as precise and accurate may result in irrelevant and misleading findings.

      Phecodes are hierarchical. For example, parent codes are three digits (008), and each additional digit after decimal point indicates a subset of ICD codes of the parent code (008.5 and 008.52). So here a code 008.52 implies 008.5 also 008. What's the impact of this hierarchy to the NETMAGE network and the inferences to be made based on the network?

      Minor comments:

      On Page 9, you said "Out of the 2189 edges for which phi correlations could be calculated, 1811 (82.73%) appeared in the DDN. This behavior suggests that our genetic associations identified by our PheWAS results serve as a reasonable approximation of disease co-occurrences".

      This is expected because both phi correlation and PheWAS analyses were performed on the same dataset. If a pair of disease highly co-occur in the dataset, you would expect a strong correlation on their genetic associations analyzed on the same dataset. However, it may not be generalizable that the genetic associations from PheWAS are a reasonable approximation to disease co-occurrences. The disease-SNP relationships from the PheWAS analysis result are bipartite. Even though NETMAGE focuses on the projected disease-disease network, the information about how specific SNPs link to their corresponding disease pairs is important. For example, in your UKBB-based network (https://hdpm.biomedinfolab.com/ddn/ukbb), when a specific disease is selected, a subgraph of the selected disease and other disease linked to the selected one are showing, but sonly a lump of SNPs without linking to their specific disease pair is provided. This is not helpful. Also annotating those SNPs their genetic context could be very useful for users to quickly grasp the nature of the genetic associations in the subgraph.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Dongjun Chung

      In this paper, the authors developed the humaN disease phenotype Map Generator (NETMAGE), a webbased tool that produces interactive disease-disease network visualization based on PhEWAS summary statistics. The tool proposed in this manuscript has important implication and utility for biological and clinical studies. The manuscript is also overall well-written and clearly described NETMAGE. However, there are still some aspects I hope the authors to address. I provide my comments in detail below.

      Major comments:

      1. I tried the web interface Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com), which utilizes NETMAGE. I found that sometimes it takes some time for the network to appear. While the network is loaded, only the gray empty space with the side panel is shown. I recommend the authors to show the progress bar while loading the network, especially when it is first loaded, to avoid users to think that their web browser is frozen.

      2. In the Search bar, it is not always trivial to guess what to enter, especially for Phenotype Name, Associated SNPs, and category. Auto-completion features for these variables will significantly facilitate users' convenience.

      3. Meaning of edges is somewhat unclear to me. Are the existence and the weights of edges purely based on the number of shared SNPs or are they based on any statistical methods?

      4. When the weights of edges are calculated, are the marginal counts taken into account? The same number of shared SNPs can have different meanings when the disease to which this edge is connected has a small number of associated SNPs vs. a large number of associated SNPs. How is this factor considered?

      5. The network generated by the Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com) is usually huge and complex with a large number of edges. As a result, it is often not straightforward to understand the generated network. This is partially relevant to the fact that the network layout is static, i.e., locations of nodes remain the same regardless of which subnetworks are chosen. If the network layout is optimized for each subnetwork, it should be much easier for users to understand the network architecture. Given this, I recommend the authors to consider updating the network layout interactively when a subnetwork is selected.

      6. When a subnetwork is chosen, the "Information Pane" appears. In this pane, it might be helpful for users if the authors provide some quick help link for each network score, e.g., how to interpret PageRank scores, etc.

      7. In the "Information Pane", a long list of SNPs is provided for "Associated SNPs" but it is not easy to use this list. I recommend the authors to make it downloadable as a table so that users can do downstream analysis. In addition, it will significantly facilitate users' convenience if each SNP ID is chosen, it brings the user to the relevant database, e.g., dbSNP. In this way, users can easily check where it is located in the sense of chromosome, gene, exon/intron/promoter/intergenic, etc. Alternatively, the authors can consider to use a quick information table (SNP ID, gene name, exon/intron/promoter/intergenic) instead of simply providing as a list.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Sarah Gagliano Taliun

      Sriram et al. introduce an open-source web-based tool NETMAGE to produce interactive disease-disease network (DDN) visualizations of biobank-level phenome-wide association summary statistics. The concept is interesting and relevant, but my major concern is regarding the interpretability of the DDN for researchers and clinicians to draw insights intuitively.

      Comments on the manuscript:

      Generally well written and logical flow. Some minor errors (e.g. "an SNP" rather than "a SNP") and some headers could be improved for readability (e.g. "Testing" is vague; this section really only touches upon Run time).

      Figure 1- Displaying a single Manhattan plot for "PheWAS Summary Statistics" is not very intuitive. It makes me think of a single GWAS rather than a phenome-wide set of GWAS run on a Biobank. Perhaps revise the image.

      Is the disease-disease network only applicable to case/control studies? Could there be an extension to quantitative traits, and if so, would that be pertinent for discoveries?

      The authors refer to "SNPs" throughout to define genetic variation. If the summary statistics contains another type of variation (e.g. indels), are those associations still used? If so, I would suggest using a more generic term to define the genetic variation.

      The discussion seems underdeveloped. Discussion of limitations rather than only future work would be helpful.

      Case study-- The authors could improve the interpretability/discussion of the UKB PheWAS example. This is one of my largest concerns because the author state that the tool can help researchers and clinicians get insight into the underlying genetic architecture of disease complications; however, the case study part of the manuscript is quite technical and could be challenging to interpret for someone without network experience; e.g. Table 2.

      Additionally, more details should be provided on the underlying summary statistics used (e.g. some details can be found on the About page of the HRC-imputed UKB PheWeb page: https://pheweb.org/UKB-SAIGE/about).

      The authors list additional filtering that they performed on the summary statistics, but it appears that some details are missing. For instance, how many traits remain after the case count filtering is applied? Also, what is used as a reference for the LD-pruning in PLINK?

      Run time-- I am wondering why Table 3 (run time for subsets of the UKBB data) ends at 1000 phenotypes. It would be interesting to see the run time that is close to case example (e.g. possibly adding a column for the total number of phenotypes used in the UKBB DDN). Additionally, this section gives the impression that run time only depend on the number of phenotypes? I would assume that run time should also depend on the number of variants that were tested.

      Comments on the online tool:

      It is nice that on each page the authors have allowed users to download a pdf of the image and also the data behind the image (e.g. edge-map, node-map, etc.). The zoom-in feature for the visualization is also useful, as is the short video tutorial.

      I think that the search bar would be more user-friendly if suggestions automatically came up when the user begins to type. Additionally, displaying the list of "associated SNPs" in a (sortable and/or searchable) table (with some annotations, such as chr, position, closest gene, consequence, rather than just rsID) could be a neater and more informative way to show these data, rather than simply as it appears currently as a list in the "information pane".

      My comment on interpretability for researchers and clinicians comes up again: I am not sure how useful/interpretable some of the search categories are for users to intuitively draw insights; for instance, number of triangles, page range, etc. I think the authors should really focus on the intuitiveness for the target audience so that the tool can have more impact.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac010), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Juan Alzate

      The present paper entitled "Fully resolved assembly of Cryptosporidium parvum" shows the results of the genomic sequencing of the protozoan parasite C. parvum using both 2nd (Novaseq) and 3rd (ONT) generations NGS technologies. Additionally, they assembled the C. parvum genome and compared their results with the previous C. parvum IOWAA II reference. The authors also undertake some QC analysis to validate chromosome models.

      The paper is interesting because there is a need to have a fully resolved Cryptospodium genome. The sequencing by itself is not much an achievement, the authors applied commercially available platforms. In the assembly process, they also used already known assemblers and mapper tools. I think BUSCO does not deliver the detailed results expected here. Maybe a more comprehensive analysis, including all the single-copy genes present in the C. parvum, can help to better support the quality of the genome.

      One additional recommendation is that the authors present a detailed analysis of single nucleotide variants (SNVs). This data can be extracted from the same BAM files that the authors already generated for Structural Variants analysis. This analysis is particularly important because it can show the readers how clonal is the C. parvum strain used.

      I don't know if this is possible. Can you compare your genome model with the one published here BioRxiv - DOI: 10.1101/2021.01.29.428682.?

      Please make public the raw-read data. (Novaseq and ONT raw reads)

      Please explain in more detail in the Methods section how do you find and analyze the structural variants.

      I don't understand why to estimate the genome size. Could you explain it?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac010), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Matthew Knox

      Overall, Menon et al. present a significant contribution to the field with this work. Their fully resolved assembly of Cryptosporidium parvum is the first to my knowledge to utilize long read sequencing in whole genome sequencing for this group of protozoan parasites and as such provides validation of previously published work while also improving on current reference standards and providing a robust and well described analysis pipeline for future studies.

      In my view, there are only a couple of issues with the paper that should be addressed. The first is a discussion of recent work using metabarcoding (e.g. DOI10.1016/j.meegid.2012.08.017, DOI10.1016/j.ijpara.2017.03.003), which demonstrates mixed infections in clinical samples of patients infected with Cryptosporidium which were missed with consensus Sanger sequencing. In some cases, mixtures of subtype families can be found, though dominance of a single subtype with a few closely related variants is more common and more likely in the current paper. Nonetheless, this may have implications for sequencing since purity of the "culture" cannot be guaranteed and results from the lack of reliable in vitro culture methods for Cryptosporidium.

      The second issue I have is with the section on comparative genomics. Strictly speaking calling this a comparative genomics analysis is not correct since the authors do not compare genomes with genomes. Instead, it is based on comparison with a small subset of sanger generated sequences and does not add much to the paper in my view. If it is to be included, the text should be rephrased to better reflect the analyses and the identity (species, subtype, subtype family) of the sequences downloaded from genbank should be presented in more detail. Also, it is unclear what criteria were used to select these sequences from among the many hundreds available for C. parvum and this should be stated too.

      In addition to significant comments above, I detected a few inconsistencies and typographical errors in the submission and have included minor comments (sticky notes) in the attached pdf document. I hope that the authors find this helpful in improving the manuscript.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Peter Horvatovich

      The article GIGA-D-21-00223 entitled "Democratizing Data-Independent Acquisition Proteomics Analysis on Public Cloud Infrastructures Via The Galaxy Framework" describes a targeted DIA LC-MS/MS processing workflow implemented in Galaxy framework. The paper describes the tools integrated in Galaxy environment and the workflows steps to process DIA LC-MS/MS data using targeted spectral library approach. The authors used a HEK cell lysate spiked with E.coli digest at various ratio and used these samples to generate DIA LC-MS/MS data on an Orbitrap QE+ with MS1 scans and 24 50% overlapping DIA windows between 400-1000 m/z in 4 replicates for each conditions. The implemented workflow contains the library generation from DDA data with MaxQuant processing, library cleaning and analysis of the DIA with OpenSWATH and statistical analysis using MSStat package in R. The authors present identification and quantification of proteins in the example data (differential analysis, volcano plot, CV plot).

      The article has a potential interest to the proteomics community as it serves to promote the use of complex DIA data processing workflows in Galaxy web interface, which would otherwise require considerable programming skills and time to establish such workflow from the user. However, the authors should address some major and minor issues before I suggest the article to be accepted.

      Major concerns:

      1. The tools and the DIA processing workflows are implemented in Galaxy Europe, which is using for me unknown amount of resource in term of disk space and computational capacity (CPU, RAM). The authors should describe what is the limitations to use this online Galaxy server (maximum amount of upload, CPU time, is there any cost to use the service, limitation of RAM for the tools etc).

      2. Some users do not want to use cloud-based services and public Galaxy server, but would wish to process their data (e.g. clinical sample from humans) on their own local computational closed infrastructure. For these users the authors should provide a tutorial, how to install Galaxy (just refer to Galaxy installation documentation) and how to get the tools from Galaxy toolshed and run their pipeline. Some users may have already a Galaxy server and getting additional tool may interfere, therefore I would strongly suggest creating a docker image where a single instance of Galaxy is installed with all necessary tools and include the raw data and settings in order to provide a clean workflow, that are sure to work.

      3. I would also like to see data on actual runtime of the example dataset, specially focusing on FDR calculation as authors mention that a subsampling of the data is required for this.

      4. I would also present peptide results as protein quantities are obtained after protein inference from multiple peptides, while the instrument is measuring peptides.

      5. CV distribution of proteins in Figure 4a should be compared to other results from other dataset as it shows multimodal and large distribution, which seems to be independent from the spiking levels. This indicate some artifacts in the data.

      6. The data is only submitted to time alignment using iRT peptides, but there is no normalization applied. The authors should check with box-plot/violin plot the individual distribution of peptides and proteins in each replicate and if necessary apply normalization to avoid "upregulated" human proteins. It would be also useful to color the dots in the volcano plot according to the species (human/E coli). The authors refer to displacement effects, which is not explained what it mean in the text (maybe ion suppression?).

      7. Please provide the distribution of the missing values for each replicate as DIA should provide data with low percentage of missing (0) value.

      Minors:

      1. All figures and plots look like low resolution bitmap. Please provide high resolution plots preferable made from vector graphic.

      2. Figure 2B, please restrict R2 numbers to 4 decimals.

      3. Page 15, please explain what the contrast matrix is.

      4. Page 15, I would replace "time consumption" to "required execution time"

      5. The author mention in several place (e.g. page 19 and legend of table 2) that they have "developed tools" for DIA analysis. This is not true as they did not develop the original tools but integrated these tools in Galaxy environment in this study. Please correct this.

      6. In figure 3 and supplementary figures 1-4 "Blot" is written, which I guess should be "Plot".

      7. Page 21, Unix is mentioned as operating system, which I guess is not correct, but rather Linux is used. Please provide the distribution and version number.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Paul Stewart

      Fahrner et al have produced a very nice manuscript and corresponding pipeline. They describe a collection of DIA tools in the Galaxy framework for reproducible and version-controlled data processing. These DIA tools are an excellent addition to the growing number of proteomics-centric tools already available in Galaxy. The reviewer could find no major revisions needed and therefore only requests a few minor revisions before this is ready for publication:

      Please include page numbers in the revised manuscript to make referencing the text easier.

      Page 6

      OpenSwath and PyProphet are cited and are also used in the manuscript. Please cite one or two alternatives.

      Please consider citing a tool the each time it is used in a new paragraph (e.g. MSstats).

      There is heavy reliance on conjunctive adverbs (However, ...; Thus, ...) on this page and throughout the manuscript. These can make passages a bit hard to read. Please consider rephrasing.

      Page 7

      Why "so-called histories"? Aren't they simply "Histories"?

      Page 14

      'To decrease the analysis time of the semi-supervised learning, the merged OSW results can be first subsampled using the PyProphet subsample tool and subsequently scored using the PyProphet score tool. '

      The reviewer is not familiar with this approach. Can you please give additional justification (maybe under methods?) or provide a citation that this is a reasonable approach?

      Page 15

      Please check your reference software and/or work with the journal to ensure that the web addresses are linked properly. For example, the reviewer tried copying the link "https://training.galaxyproject.org/training- %20material/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.html" but a "%20" (or a space) is inserted into the URL after "training-" so the link as it appears did not work until this was removed. A less technically savy reader may think the links are broken and will not be able to access the materials.

      Page 16

      'We identified and quantified between 25.000 to 27.000 peptides ...'

      Please be consistent with number formatting (25000 vs 25.000). Other values in the tables did not use this formatting. Please check with journal editor for convention.

      Figures

      Please be consistent with axes labels. Some are upper case and some are lower case.

      Figure 2B

      Please round R2 to 2 or 3 decimals.

      Figure 3

      Please change the red-green color scheme to a more color-blind friendly color scheme (e.g. red blue)

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab093), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Elisabetta Manduchi

      This manuscript presents a workflow for gene-gene epistasis detection which leverages functional annotation resources such as Biofilter (to reduce the search space) and FUMA (to map SNPs to genes) and investigates the results obtained via different SNP-gene mapping criteria (positional, eQTL, Chromatin contacts, and some combinations of these). Moreover, these results are compared with those obtained via a 'standard' analysis where no filtering is applied to pair selection and positional SNPgene mapping is used. Due to the challenges presented by GWAIS, leveraging functional genomics to focus the search is a valid strategy which has been employed in other recent works in the field. This is a nice work and the paper is generally well written, with sufficiently detailed methodological information. Below are some comments and questions.

      1. As indicated in recent GWAS works aimed at 'solving GWAS loci' (i.e. determining the genes affected by significant SNPs), it is not always the case that the gene affected by a SNP is that positionally closest to the SNP. Indeed, a SNP may not only affect a gene when it resides in its coding or promoter regions, but it may also affect a gene when it resides in a far away enhancer. This is why epigenetic information such as chromatin loops (referred by the authors as 'Chromatin') can be useful for SNP-gene mapping. In the presence of chromatin contacts or eQTL information, typically one would use the derived mapping to augment the positional mapping, which is always available. That is, if one had chromatin contacts data, they would use positional + Chromatin to map SNPs to genes. If one had eQTL data, they would use positional + eQTL to map to genes. If one had both, they would use positional + eQTL + Chromatin. From a biological interpretability perspective, there is no reason to exclude the positional information. For example, a SNP in the promoter of gene could interact with a SNP in a distal enhancer of another gene, affecting a specific trait. In view of this, the statement (lines 326-328) "Since the main objective of this protocol is to increase the biological interpretability of epistasis findings, we have excluded other combinations that mix functional and non-functional information (Positional + eQTL and Positional + Chromatin)" is not quite valid, as positional information is also functional. On the other hand, using eQTL only, Chromatin only, or eQTL + Chromatin, albeit interesting in terms of looking at how this type of reduction in the search space affects results, do not quite reflect a biologically guided approach.

      2. I wonder on whether the authors have considered filtering also by markers of relevant chromatin states. Information about open chromatin and other epigenetic marks could help further filtering SNPs, both in enhancers and promoters. This would be particularly useful for SNPs mapped via chromatin contacts, which are likely to contain many irrelevant signals.

      3. The eQTL and chromatin contact data used in this work were from all available tissues. Typically, GWAS related functional filtering is done using data from tissues relevant to the trait under investigation, when available. For IBD, it may help to restrict to intestinal tissues, immune cells (like T- cells, macrophages, dendritic cells), and possibly also nervous system cells (which, at least according to some, could also be among the potential 'culprit' IBD tissues).

      4. To adjust for population structure the authors regressed out the first 7 PCs from the phenotype. Given that the PCs are confounders, it would be good to discuss the impact of doing this as opposed to also regress the confounders out of the SNPs, i.e. testing the response residuals vs the SNP residuals. In the same spirit, it would be good to discuss the impact of the PC-SNP association on the p-value and type-I error results obtained by permuting the response residuals.

      5. Section 2.1 is somewhat too concise and may result unclear to the reader. Later in the Discussion (lines 229-239) the authors explain how their procedure corrects for multiple testing at the SNP model without additional corrections for multiple testing at the gene model (this is also implicitly described in Fig 7CD), but yet their procedure keeps type I error under control. However, it may be beneficial, for ease of reading, to expand section 2.1 (via text and/or figure) so to clarify better, at the onset, where and when multiple testing corrections are applied.

      6. In the absence of a replication data set, the authors assess the robustness of the gene pair results via 10 repetitions of the workflow using 80% of the discovery data set. It would be useful to include some discussion of how their results could be further assessed in other GWAS data sets (e.g. from UK biobank, etc.), in view of the fact that it is typically hard to reproduce epistasis findings, at least at the SNP level. Certainly one could first check whether the discovered SNP-SNP interactions are reproduced and limiting the analyses to those pairs would require a less severe multiple testing correction. But another approach may be to start with the discovered gene pairs and then analyze all pairs of SNPs mapping to these genes (not necessarily those discovered in this study), etc. Do the authors plan future follow-up studies on this?

      7. In section 2.7 the results of pathway analyses on 3 (eQTL, Positional, and Standard) of the 5 networks presented in Figure 3 are provided. What about the other 2?

      8. For these two points I defer to the editor:

      (i) The format of the manuscript is close to but does not exactly match the specifications at https://academic.oup.com/gigascience//pages/research. I do not know how strict these specifications are and I have no objections to the current format.

      (ii) Data availability is not discussed (as per Data and materials in https://academic.oup.com/gigascience/pages/instructions_to_authors). I imagine that the IIBDGC only makes publicly available the summary statistics. This is, however, common in the GWAS field.

      1. Some minor notes follow:

      (i) In the Author Summary the 'ATPM' acronym is used for the first time without explanation.

      (ii) In section 4.2 it would be helpful to re-iterate that the SNP-gene mapping for the Standard analysis was genomic proximity (this is only mentioned briefly at line 206).

      (iii) Typo at line 168 "the same than" should be "the same as".

      (iv) It should be specified which of the MigSigDB collections was used. Later in this section gene sets are referred to as 'pathways' but there is more than one pathway collection in MigSigDB.

      (v) In the formula at line 397 doesn't "tested gene sets" refer to "tested gene neighborhoods"? If so, it would be better to use the latter for clarity.

      (vi) There appear to be some typos in the caption for Supplementary Figure 1: "we computed three linear models using the different residuals as response variable and SNP interactions as dependent variables". I guess should be "SNP interactions as independent variables". Also, weren't the two individual SNPs also included as independent variables in these models?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab093), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Shing Wan Choi

      Here, the authors presented a pipeline for the analysis of epistasis effects in GWAS data (GWAIS). This is traditionally a difficult problem due to the large search space involved, which makes the analysis computationally intensive and suffers from multiple testing, a new method that restrict search space can definitely help making GWAIS more feasible and reproducible. I have some questions after reading the current paper, please excuse me if the information is already presented within the paper:

      1. I am not sure what the Standard model comprised of. According to the methodology section, the Standard analyzed all SNP pairs without prior filtering, does that mean all 14,501,130,150 SNP pairs (C(170301, 2)) were tested? Or was it not all SNP pairs were considered?

      2. When presenting the number of SNPs linked to each gene based on different criteria (e.g. position, eQTL or Chromatin contact), wouldn't the gene size be a major predictor of the number of SNPs link? It would most likely be the case for positional mapping, right?

      3. I am curious to see if restricting the eQTL and Chromatin information to disease specific tissue will help improving the performance of the current model.

      4. Very little information was provided for the PRS analysis. What genome wide association summary statistics were used? Did the authors perform high-resolution scoring? With shrinkage / thresholding done in normal PRS analysis, some SNPs' effect might be excluded or "shrinked" away from the PRS model, would that affect the interpretation of the PRS covariate analysis? E.g. maybe SNPs not included in the PRS model were those unaffected? (With PRSice, can use --print-snp to obtain list of SNPs that are included in the model)

      5. It seems like Biofilter provide a SNP-SNP interaction prediction model, how does that compare to what was presented here?

      6. Figure 3, the results from eQTL + Chromatin and Position + eQTL + Chromatin is identical. Together, it seems like the positional mapping does not contribute to the result at all, which is a bit surprising. Are there any explanation of that? Would it be due to mapping of the Immunochip array, or a characteristic of IBD?

      7. Given the sample size of the current data, a HWE threshold of 0.001 seems rather stringent. Will the result improve if a less stringent threshold is used (e.g. 1e-6?)

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Hirak Sarkar

      Producing single-cell count matrix from the raw barcoded read sequences consists of several contributing steps such as whitelisting, correcting cell barcodes, resolving multi-mapped reads, etc. Each step can potentially introduce variability in the resulting count matrix depending on the specific algorithm adapted by the tool used. Bruning et al. attempted to disentangle these effects using the most popular scRNA-seq quantification tools such as Cell Ranger 5, STARsolo, Kallisto, and Alevin. The manuscript is well-written and would add considerable value to the broad single-cell research community. I have a few concerns about the current draft of the manuscript that can be addressed in a revision.

      • The scina tool is used to construct an "artificial ground truth". The consensus of two or more mappers are used to arrive at this reference annotation. In my opinion, the consensus can lead to a biased reference, especially since STARSolo and Cell Ranger5 follow a very similar pipeline; it is expected, by design, that those tools would have highly-overlapping results.

      I suggest that the simulated datasets from the pre-decided clusters might be more appropriate for an unbiased evaluation (The recent paper from Kaminow et al. https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full has similar simulations). Having said that, the current consensus-based analysis in my opinion should give a reasonable reference for most of the cells, but a more principled simulation is required to identify the extreme cases where each of the tools might show variable assignments.

      -The Sankey plots (Supp Figure 5) and the heatmaps (Supp Figure 6) represent the mutual agreement from different tools. As the scina clusters are used as ground truth, a more direct qualitative measure such as precision/recall would be more helpful.

      To be more specific, the resolution parameter of FindCluster could be tuned (now set to 0.12/0.15) to produce the same number of clusters present in the ground truth. Each predicted cluster can then be assigned to a ground truth cluster greedily. The number of mismapped cells can be further categorized as false-positive or false-negative.

      • The variability of different tools on the three real datasets is worth exploring in depth. For example, quoting from the paper, "Alevin detected more cells with less genes per cell in the PBMC and Endothelial dataset. However, it detected less cells with more genes per cell in the Cardiac dataset." It would be interesting to understand the origin of these variations and what authors hypothesize, e.g. apart from mapping/alignment there are other additional steps in the quantification pipeline that could potentially lead to variation in the detected cells and respective gene count. The tools can also have underlying algorithmic biases that are worth exploring.

      • "We could show that Alevin often detects unique barcodes, which were not identified by the other tools. These barcodes had very low UMI content and were not listed in the 10X whitelist.", the alevin -- whitelist option (https://salmon.readthedocs.io/en/develop/alevin.html#whitelist) enables use of any external filtered whitelist while running alevin. I wonder if using this option would change the behavior mentioned in the manuscript.

      • The manuscript raises the important question of multi-mapped reads across cell-types, it would be interesting to quantify the percentage of reads that are discarded as multi-mapped by different tools (those which discard). If that percentage is substantial, then the difference in handling such ambiguous reads through EM-like algorithm might be promising.

      Plots and Figures

      -Intersection Plots

      The minor differences in the $y$ axis of the intersection plots (Fig. 4, supp fig. 3 etc.) are not pronounced. (log-scale might help)

      Overview Figure The manuscript correctly pointed out how different intermediate steps contribute to the general variance in the downstream results. An overview figure with a flow chart of a typical scRNA-seq quantification pipeline will be beneficial.

      Minor Concerns

      There is a spelling mistake in the abstract celtype -> cell-type

      Possible incomplete sentence : "The recommended annotation from 10X, which only contains genes with the biotypes protein coding and long non-coding, might lead to an overestimation of mitochondrial gene expression respectively the absence of other gene types."

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Serghei Mangul

      1 -- Abstract contains. Confusing terminology, for example became available can be replaced by developed.

      2 -- Also analyzed several data sets, can be replaced by benchmarking to clear indicate that that refers to benchmarking rather than analysis. Some terminology needs to be explained. For example, white listing should be defined

      3 -- KALISTO is not alignment tool in a proper sense, as it doesn't report position of the read insteadonly the transcriptof origin. Instead, this is pseudo alignment. Alignment needs to be defined, or word pseudoalignment used

      4 -- How the ground truth or gold standard was defined ? Is the assumption of the paper that the tool with the highest number of mapped reads perform the best? This needs to be explained in the introduction.

      5 -- In general. I read alignment is artificial rather than biological problem, so that molecular gold standard cannot be defined. See for example https://www.nature.com/articles/s41467-019-09406-4. It would be helpful to explain this upfront when talking about gold standard and cite this.

      6 -- It is unclear how the tools were selected. What was the reasoning to select only 4 tools and how do offer know that those tools are common? For the complete list of RNA-based alignment tools author can refer to https://arxiv.org/abs/2003.00110 A reasonable criteria to select would be to take the tools, which are available, for example, in bioconda, which will make installing those tools easy. However, randomly selecting tools is not acceptable. For example, why the SALMON was not included. However, KALISTO was included.

      7 -- Language of the paper needs to be improved, for example, in the background section the word great was used, which can be replaced by a more appropriate scientific wording.

      8-- More explanation needs to be provided for cell ranger. Is it essentially the wrapper around the star? Does it have any novel Algorithms or software development involved?

      9-- Needs me to explain why they chose only 10x genomics among the available single cell platforms.

      10-- And the annotations indeed may influence, the alignment when they are provided for alignment tools. is every alignment tool able to take custom annotations?The paper is lacking the Figure providing results on which annotation performs the best for a given data sent.

      11-- Datasets and reference genomes section Gold standard data sets are not reported. It was not clear if the paper is having such data set or such data set is missing in case such data set, is missing. How the authors are able to say which read alignment tool performs the best ?

      12-- The paper contains a single human sample. Any particular reason for that? The paper would benefit from having multiple human samples as a as it was done for the mouse. Did the authors performed a systematic search to identify as many single cell sample as possible. If not, that will be desirable.

      13 -- Was that 10x data human data only available on 10x website, and not available on SRA or Geo

      14 -- Paper provides a GitHub link with data sets and the code used for this analysis. Does the GitHub has also the BAM files? If not, those needs to be uploaded. Additionally is the code and summary data behind the figures provided?

      15 -- Results section, the beginning of results section would benefit with the short description of the datasets, for example. How many samples were in total? What was the read length for each sample? what was the number of reads for each sample? Was a different. So providing the mean and the variance can be helpful.

      16 -- In general, figures needs to be improved in terms of visualization. It's very hard to understand what are the figures are trying to convey. For example, figure 2 is absolutely impossible to understand. And also, what is the purpose of that figure is also unclear? The same for the figure 3 It's very busy, figure. However, what it is trying to convey? It's hard to know.

      17 -- Figure 4 is also very hard to understand. So maybe making the log scale can improve. What is the X axis, for example, that's unclear those details. And in general figures needs to be improved.

      18 -- in general figures needs to be visually understandable and and more effective.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Bo Li

      Single-cell RNA-seq has revolutionized our abilities of investigating cell heterogeneity in complex tissue. Generating a high-quality gene count matrix is a critical first step for single-cell RNA-seq data analysis. Thus, a detailed comparison and benchmarking of available gene-count matrix generation tools, such as the work described in this manuscript, is a pressing need and has the potential to benefit the general community.

      Although this work has a great potential, the benchmarking efforts described in the manuscript are not comprehensive enough to justify its publication at GigaScience unless the authors address my following major and minor concerns.

      Major concerns:

      1) The authors should discuss related benchmarking efforts and the differences between previous work and this manuscript in the Background section instead of the Discussion section. For example, Du et al. 2020 G3: Genes, Genomics, Genetics. and Booeshaghi & Pacther bioRxiv 2021 should be mentioned and discussed in the Background section. In addition, STARsolo manuscript (https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1), which contains a comprehensive comparison of CellRanger, STARsolo, Alevin and Kallisto-Bustools should be cited and discussed. Zakeri et al. 2021 bioRxiv (https://www.biorxiv.org/content/10.1101/2021.02.10.430656v1) should also be included and discussed in the Background section.

      2) Benchmark with latest versions of the software. The choice of Cell Ranger, STARsolo, Alevin and Kallisto-BUStools is good because they are four major gene count matrix generation tools. However, I urge the authors also include CellRanger v6 and Alevin-fry (Alevin_sketch/Alevin_partialdecoy/Alevin_full-decoy, see STARsolo manuscript), which are currently lacking, into their benchmarking efforts. The authors may also consider add STARsolo_sparseSA into the benchmark. Since single-cell RNA-seq tool development is a fast-evolving field, benchmarking of the up-to-date versions of tools is super critical for a benchmarking paper.

      3) Conclusions. The authors summarized the observed differences between tools based on the benchmarking results. This is good but very helpful for end-users. I recommend the authors to emphasize their recommendations for end-users more clearly in the discussion/results section. For example, do the authors recommend one tool over the others under certain circumstances? If so, which tool and which circumstance and why? I like Figure 5 a lot and hope the authors can summarize this figure better in the manuscript.

      4) This manuscript concluded that differential expression (DEG) results showed no major differences among the alignment tools (Figure 4). However, the STARsolo manuscript suggested DEG results are strongly influenced by quantification tools (Sec. 2.6, Figure 5). Please explain this discrepancy.

      5) This manuscript suggested simulated data is not as helpful as real data. However, the STARsolo manuscript reported drastic differences between tools using simulated data. Please comment on this discrepancy.

      6) I have big concerns regarding the filtered vs. unfiltered annotation comparison. In particular for pseudogenes, we know that many of them are merely transcribed or lowly transcribed. As a result, many of these pseudogenes would not be captured by the single-cell RNA-seq protocol. At the same time, because these pseudogenes share sequence similarities with functional genes, they would bring trouble for read mapping. This is one of the main reasons for using a carefully filtered annotation. Actually, whether and how to filter annotation is in active debate in big cell atlas consortia such as Human Cell Atlas. Thus, I would be super careful about describing results comparing filtered vs. unfiltered annotation. For example, in Suppl. Figure 8D, there are 6 mitochondrial genes that have 100% sequence similarity to their corresponding pseudogenes. It is impossible to distinguish if a read comes from a gene or a pseudogene for these 6 genes and it is also not necessary --- the transcribed RNA should also be exactly the same. Thus, I encourage the authors remove their pseudogenes from the annotation and I suspect the mouse data results should look similar to the human data in the Suppl. Figure 8A.

      7) The endothelial dataset was only run on CellRanger 3 because the UMI sequence is one base shorter. Could the authors augment the UMI sequence with one constant base and run this dataset through CellRanger 4/5/6?

      8) I think it is more appropriate to call the tools benchmarked as "gene count matrix generation tools" instead of "alignment tools".

      Minor concerns:

      1) The Suppl Table 2 mentioned in the main text corresponds to Suppl. Table 3 in the attachment. In addition, there is no reference to Suppl Table 2.

      2) Suppl Table 3 PBMC, why do I see endothelial cell markers in PBMC dataset?

      3) Suppl Figure 7 is never referenced in the main text.

      4) Suppl Figure 8D is never referenced in the main text.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab101), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Filippos Bantis

      The authors used imaging tools with three types of phenotypic descriptors (dimensions, shape, colour indices) and side- or top-camera views in order to determine non-destructive parameters of seven diverse species (Arabidopsis thaliana, Brachypodium distachyon, Euphorbia peplus, Ocimum basilicum, Oryza sativa, Solanum lycopersicum, and Setaria viridis) growing under different Red/Blue gradients (from 100% Blue to 100% Red). The results are important since they are non-destructive and provide a good basis for the selection of light treatments for specific plants in controlled environment agriculture. The introduction is informative and sufficiently describes the scope of the research. I like the way the authors describe/display the results. Relatively few words (compared to the volume of the obtained measurements) but beautifully built figures which provide all the necessary information. However, I would expect more discussion at the end of each set of parameters results' description, as well as possible comparisons with the literature, even if it is rather scarce. For example, in PDF page 11, subsection "Patterns of change over time", the results are barely discussed. Moreover, the review process would be facilitated if the manuscript had line numbering.

      Specific comments are following:

      • In the title, LED should be written with capital letters, not Led

      • Keywords must not be included in the title. Please remove or substitute LED and light quality Introduction * PDF page 4, L3. Controlled environment agriculture must be abbreviated the first time it is written in the text. The same applies with other terms such as RGB.

      • PDF page 5, L23. "Large-scale crops" is more appropriate term.

      • I agree with the active voice in the objectives' part of the introduction. However, you should refrain from beginning most sentences with "we". Results

      • PDF page 8. I suggest that "Data description" subsection is moved in the "Methods" section

      • This section should be renamed to "Results and Discussion" since there is also discussion within the results. Methods

      • PDF page 14. How many cabinets were used? How many treatments and plants were placed in each cabinet? Apart from figure 1 depiction, you should also describe the experimental design in order for the reader (and me as well) to fully understand it.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab101), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Yujin Park

      The manuscript presents the result of an experiment investigating the impact of red:blue ratio of light gradient on plant phenotypic traits in seven plant species. The subject of the manuscript is very innovative and interesting, but there are parts of the materials and methods that are less clear. Specific comments:

      • In this study, plant phenotypic traits were evaluated using an imaging platform. Plant biomass (fresh and dry weights of shoot and root) is one of the most important plant growth parameters. Are there any suggestions that plant biomass can be predicted from the plant phenotypic traits quantified by the imaging platform?

      • Growth conditions:

      • Does the irradiance of 130-150 µE∙m-2∙s-1 indicate the PPFD (400-700 nm)? How was it measured?

      • Please be consistent for the unit for photon flux density throughout the manuscript. µEinsteins were interchangeably used along with µmol∙m-2∙s-1 in the past, but the Einstein is not a unit in the SI of units. Thus, please use µmol∙m-2∙s-1 when you quantify the photon flux density. Also, please revise the µmoles∙m-2∙s-1 in Fig. 1 into µmol∙m-2∙s-1.

      • Could you provide the spectral distribution data for white light, red LED, and blue LED used in this study?

      • For the concentration of the slow release fertilizer, do you mean gram per liter? If so, please correct it to 6 g∙L-1.

      • What was the growing conditions (air temperature, relative humidity, photoperiod, etc) during the treatment of red:blue gradient?

      • Did you keep the control plants under white light continuously? Then, did you make sure that the control plants and treatment plants are grown under the same growing condition except for the light quality treatment?

      • It is not clear whether the experiment was replicated. The experimental unit is the physical entity which can be assigned, at random, to a treatment. Here the experiment unit was the experimental plot under each light gradient treatment. A single plant should be treated as an observational unit. So, without replications, the data is less reliable.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: idoia ochoa

      The authors present a novel tool for the compression of collections of bacterial genomes. The authors present sound results that demonstrate the performance gain of their tool, MBGC, with respect to the state-of-the-art. As such, I do not have concerns about the method itself. My main concerns are with respect the description of the tool, and how the results are presented. Next I list some of my suggestions (in no particular order):

      Main Paper: - Analysis section: Before naming MBGC specify that it is the proposed tool. - Analysis section: Reference for HRCM. Mention here also that other tools such as iDoComp, GDC2, etc. are discussed in the Supplementary (this way the reader knows more tools were analyzed or at least tried on the data).

      • Analysis section: The paragraph "Our experiments with MBGC show that... " is a little misleading, since it seems that the tool has the capacity to compress a collection and just extract a single genome from it. This becomes clear later in the text when it is discussed how the tool could be used to speed up the download of a collection of genomes from a repository. So maybe explain that in more detail here, or mention that it could be used to compress a bunch of genomes prior to download. And then point to the part of the text where this is discussed in more detail.

      • Analysis section: The results talk about the "stronger MGBC mode", the "MGBC max", but in the tables it reads "MBGC default" or "MBGC -c 3". I assume "MBGC -c 3" refers to "MBGC max", but it is not stated anywhere. maybe better to call it "MBGC default" and "MBGC max".

      • Analysis section: Although the method is explained later in the text, it would be a good idea to give a sense of the difference between the default and max modes of the tool. Or some hints on the trade-off between the two. Also, the parameter "-c 3" is never explained.

      • Analysis section: Figures, it is difficult to see the trade-off between relative size and relative time, can you use colored lines? such that the same color refers to the same set of genomes. Also, in the caption, explain if we want small or high relative size and time. it may be clear, but better to clearly state it.

      • Analysis section: there is a sentence that says "all figures w.r.t. the default mode of MBCG". It would be good also to state that in the caption, so that the reader knows which mode of the tool is being used to generate the presented results. and if the input files are gzipped or not. For example, for the following paragraph that starts with Fig. 1, it is not clear if the files are gzipped or not.

      • Analysis section: First time GDC2 is mentioned, the first thing that comes to mind is why it was not used for the bacterial experiments. See my previous point on having a couple of sentences about the other tools that were considered, and why they are not included in the main tables/figures.

      • Methods:

      -- Here I am really missing a diagram explaining the main steps of the tool. It seems the paper has been rewritten slightly to fit the format of the journal and some things are not in the correct order. For example, it says the key ideas are already sketched, but i do not think that is true.

      -- (offset, length) i assume refers to the position of the REF where the match begins, and the length of the match, but again, not really explained. A diagram would help. Also, when it is time to compress the pairs, are the offset delta encoded? or encoded as they are with a general compressor?

      -- How are the produced tokens (offset, length, literals, etc.) finally encoded?

      -- First time parameter "k" is mention, default value? Also, how can you do a left extension and "swallow" the previous match? is it because the previous match could have been at another position? otherwise if it was in that position it would have been already extended to the right, correct? i mean, it would have generated a longer match.

      -- The "skip margin" idea is not well explained. not sure why the next position after a match is decreased by m. please explain better or use a diagram with an example.

      -- when you mention 1/192, maybe already state that this is controlled by the parameter u. otherwise when you mention the different parameters is difficult to relate them to the explanation of the algorithm.

      Availability of supp...

      -- from from (typo) Tables

      -- Specify the number of genomes in each collection.

      -- change MBGC -c 3 to MBGC max or something similar. (see my previous comment -c flag is not explained!)

      Supplementary Material

      -- move table 1 after the text for ease of reading

      -- not clcear if the tool has random access or not. it is discussed the percentage of time (w.r.t. decompreessing the whole collection i believe) that it would take to decompress one of the first gneomes vs one of the last ones. this should be better explained. for example, if we decompress the last genome of the collection we will employ 100% of the time, right? given that previous genomes are part of REF (potentially). please explain better and discuss this point in the analysis part, not only in the supplementary. seems like an important aspect of the algorithm.

      -- I assume this is not possible, but should be discussed as well. can you add a genome to an already compressed collection? this together with the random access capabilities will highlight better the main possible uses of the tool.

      -- section 4.3: here HT is used, and then HT is introduced in the next paragraph. please revise the whole text and make sure everything is in the right order.

      -- parameter m, please explain better.

      -- add colors to figures, it will be easier to read them. Overall, as I mentioned before, I believe the tool offers significant improvements with respect to the competitors for bacterial genomes, and performs well on non bacterial genomes as well. What should be improved for publication is the description of the method, since at the end of the day is the main contribution, and how the text is presented.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Jinyan Li

      This paper proposed a compression algorithm to compress sets of bacterial genome sequences. The motivation is based on the reason that the existing algorithms and tools are targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is unknown. The key idea of the proposed method is to detect characteristic features from both the direct and reverse-complemented copies of the reference genome via LZ-matching. The compression ratio is high and the compression speed is fast. Specifically, on a collection of 168,311 bacterial genomes (587 GB in file size), the algorithm achieved a compression ratio around the factor of 1260. The author claimed that the performance is much better than the existing algorithms. Overall, the quality of the paper is quite good.

      I have two suggestions for the author to improve the manuscript:

      1/ it's not clear to me about this sentence that "we focus on the compression of bacterial genomes, for which existing genome collection compressors are not appropriate from algorithmic or technical reasons." More clarifications are needed.

      2/ With my own experience, GDC2 has a better performance on virus genome collections than HRCM. It's strongly suggested for the author to add the performance of GDC2 on the bacterial genome collections.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Diogo Pratas

      This article presents a new compressor that uses both direct and reverse-complemented LZ-matches with multi-threaded and cache optimizations.

      Generally, the reported results of this tool are exciting, and once confirmed, they have good applicability in the bioinformatics community.

      However, I could not reproduce the results by lack of instructions, the benchmark is not representative of the state-of-the-art, and there are also several associated questions. Below the comments are specified.

      Regarding the experiments:

      1. The experiments could not be reproduced. Unfortunately, the instructions and documentation are not clear (See below my tentatives).

      2. The benchmarking is missing several well-known tools (for example, naf, geco3, Deliminate, MFCompress, Leon, ...). See, for example:

      Kryukov, Kirill, et al. "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences." Bioinformatics 35.19 (2019): 3826-3828.

      Silva, Milton, et al. "Efficient DNA sequence compression with neural networks." GigaScience 9.11 (2020): giaa119.

      Yao, Haichang, et al. "Parallel compression for large collections of genomes." Concurrency and Computation: Practice and Experience (2021): e6339.

      To access more compressors, please see the following benchmark (that is already cited in the article): https://academic.oup.com/gigascience/article/9/7/giaa072/5867695

      Regarding the manuscript:

      1. The State-of-the-art in genomic data compression (or at least in collections of genomes) is brief and does not offer a consistent and diverse description of the already developed tools.

      2. "By the compression ratio we mean the ratio between the original input size and the compressed size. If, for example, the ratio improves from 1000 to 1500, e.g., due to changing some parameters of the compressor, we can say that the compression ratio improves 1.5 times (or by 50%)." This sentence seems a little confusing (at least for me). Please, rephrase.

      3. "The performance of the specialized genome compressor, HRCM [7], is only mediocre, and we refrained from running it on the whole collection, as the compression would take about a week.". The purpose of a data compressor can be very different: to use in machines with lower RAM, for compression-based analysis, for long-term storage, research purposes, among others. The qualification of HRCM without putting it into context seems to be depreciative.

      Regarding the tool and documentation:

      1. Although I have downloaded and compiled the tool, I had to dedicate some minutes to a "libdeflate" default version issue. The majority of the bioinformatics community uses conda. In order to minimize installation issues for the users, please, provide a conda installation for the proposed tool. Also, the libdeflate can already be retrieved with conda. Then, with the instructions for the installation of mbgc, please, add this line to mbgc repository:

      conda install -c bioconda libdeflate

      Notice that this "conda" part is a suggestion that will facilitate the usage of mbgc by the bioinformatics community.

      1. Running ./mbgc gives the output:

      ./mbgc: For compression expected 2 arguments after options (found 0)

      try './mbgc -?' for more information

      If the menu appears as default (no arguments besides the program's name), it will be much more helpful.

      1. The program should have a version flag to depict the version of the program (besides the version at the menu). This feature is essential for integration/implementations (e.g., conda) and to differentiate from eventual new versions to the mbgc software.

      2. Please, provide a running example at the help menu (with tiny existing sequences at the repository).

      3. Is this characteristic of mbgc a strict property: "decompresses DNA streams 80 bases per line"? This characteristic may create differences between original files and uncompressed files. Perhaps, having the possibility to have a custom line size would be a valuable feature, at least for data compression scientists to access and compare with other compressors, mainly because it makes the decompressor not completely lossless (although in practice, there is minimal information required to maintain the whole lossless property). Nevertheless, if the program decompresses FASTA data with a unique line size (for DNA bases) of 80 bases, this should also be mentioned in the article (besides what already exists in the repository).

      4. The first impression was that "sequencesListFile" are the IDs of the bacterial genomes, then I found out that they are the URL-suffixes for the FASTA repository. Then, I start to wonder if mbgc could accept directly the FASTA containing the collection of genomes. How can the user provide the FASTA file directly? This feature would simplify a lot the usage of mbgc. Rationale: most of the reconstruction pipelines output multi-FASTA sequences in a single file. Therefore, this feature has direct applicability. Please, add more information about this in the help print and at the README. A higher goal would be to have stdin and stdout in compression/decompression as an option and the style of the argument as POSIX (Program Argument Syntax Conventions). This features are important for building bioinformatics pipelines and perform analysis (especially since the tools seems to be ultra-fast).

      5. Table 1,2,3,4 (and the additional table at supplementary material) have "Compress / decompress times (as "ctime" / "dtime") are given in seconds," but no unity reference is provided in the cap to the cmemory and dmemory. Is this value on a GigaByte unity?

      6. The README should provide a small example for testing purposes with the files already available at the repository or by efetch download (see below).

      7. The reproducibility is hard to follow:

      I had to search for the following procedure to test the software:

      wget https://github.com/kowallus/mbgc/releases/download/v1.1/tested_samples_lists.7z

      7z e tested_samples_lists.7z

      After the cere download, also tar -vzxf cere_assemblies.tgz

      Then, I realized that it was missing the sequences, and by the NCBI interface, I lost track. I gave up after a few segmentation faults/combinations without understanding if the program or the settings generated the issue.

      Please provide supplementary material and README with the complete instructions to reproduce the experiments (the exact commands).

      Also, this simple way to download a multi-FASTA file with Escherichia Coli sequences may be helpful: conda install -y -c conda-forge -c bioconda -c defaults entrez-direct

      esearch -db nucleotide -query "Escherichia coli" | efetch -format fasta > Escherichia.mfa

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab092), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Luiz Gadelha

      This manuscript proposes a presents a tool called ExTaxsI for management and plotting of molecular and taxonomic data from NCBI. Information can be persisted on a local database as well as FASTA-formatted sequences, which can be used to display the information as scatter or sunburst-pie plots, and maps. The tool uses the Entrez API from NCBI to retrieve data. It also uses the ETE toolkit to manage taxonomic data. Three use cases were presenting to demonstrate ExTaxsI: - geospatial distribution and gene data of Atlantic cod and the Gadiformes Order, - exploration of biodiversity data related to the SARS-COV-2 pandemic.

      Using ExTaxSi from the command-line apparently produces consistent and correct outputs. However, ExTaxSi functionality seems to be available only through this command-line interface. This considerably limits the applicability of the tool since many researchers usually incorporate these routines programmatically to their scripts. It would be more useful if ExTaxSi functions were provided additionally through a library that could be imported in Python scripts. This would enable more use cases and would lead to a wider applicability. Some issues in a previous submission of this manuscript were corrected. A more detailed comparison with related tools is included and the installation instructions for the tool now work correctly. The documentation was also significantly improved.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab092), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Iddo Friedberg

      The authors have markedly improved the software in terms of usability and documentation. The manuscript could still use some language editing.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Shanlin Liu

      The authors presented us with an improved genome for Glanville fritillary butterfly. However, there are several issues that need to be addressed before its acceptance.

      Major:

      What the current manuscript lacks the most is the comparison between the improved genome assembly and its former version. Although the authors showed us an improved N50, I failed to find the explanations for several critical differences. For example, (1) the authors stated that ca. 90 MB additional assembly sequences were achieved, but no further information is available for those new sequences, are they redundancies or missed fragments in the version 1; (2) the improved genome predicted less genes compared to its former version, decreasing from ca. 16,000 genes to ~ 14,000 genes, which is contradictory to the aforementioned longer genome assembly; (3) the former genome version observed unevenly distributed repeat elements across chromosomes, while not in this improved one, which also needs explanations.

      Another important issue of the present manuscript is the confusion introduced by varied genome assembly sizes. Firstly, the authors did not provide this critical information that can be estimated using several well-known methods, such as C value based on flow cytometry, or estimations based on kmer frequency information. Secondly, the author firstly mentioned that they sampled individuals that have low heterozygosity, but later the FALCON generated an assembly almost twice the size of the final genome. The authors may want to add extra analysis or words to clarify the genome size uncertainty. Same to the above concern, Haplomerge seems an important step to obtain the final version assembly, and if I understand it correctly, the authors did not use a standardized analysis pipeline, please consider to include a schematic plot for your procedure to help readers better understand your steps and the principle behind them.

      In addition, lots of methods are vaguely described, the authors should provide details for them to make sure the analyses are repeatable, e. g. on Page 6, the authors wrote: "This cut-off was experimentally found to give the best contiguity for the assembly, while minimizing (within a small margin of error) the percentage of possibly erroneous contigs". But I failed to find any details of their experiments. And on the same page, the authors checked putative chimerics manually, saying the error regions are with low coverage or repeat regions, the authors should give demonstration examples and statistics for different kinds of errors. Meanwhile, when they say the error regions were split, the authors should give details about how they determined the split positions since what they found are error regions instead of error bases. Also, on page 7, the authors stated "The contigs orders and orientations were manually fixed when needed", please list the different situations that meet your criteria. The author may want to explain why they choose the 1,232 genes for manual annotation. Random?

      Minor: Remove

      "(e.g. Kahilainen et al. unpubl.)", it provides no useful information.

      Table 1. N(%) of the verion 2 genome is zero? The scaffolding step does not introduce any Ns? I doubt that.

      Page 5, please give the location information instead of a citation.

      Page 7, please clarify the assembly version for raw read mapping, is it the one generated by FALCON with a genome size ~ 700 MB?

      Page 9, "the first two step (bath A1 and bath A2)", please provide biological explanations.

      Marey map needs citation and a brief explanation of its debut.

      "In M. cinxia the repeats are placed in single chromosomes whereas in H. melpomene they are present in all chromosomes. " How does it help to show the power of long read assembly? Need explanation.

      Page 10, how does Velvet apply a kmer size of 99 bp when you only have a read length as long as 85 bp?

      Table 2 title: species name should be in format of italic.

      Please give a full name for BUSCO in its first appearance.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Annabel Charlotte Whibley

      In this manuscript, Blande, Smolander and colleagues report an improved chromosome-level genome assembly of the important ecological model lepidopteran species Melitaea cinxta. The manuscript would benefit from further language review by a native English speaker to improve readability, but the intentions of the authors are nevertheless clearly articulated throughout, the workflow is logical, and the assembly quality is a clear improvement on the earlier draft release.

      I would suggest revisiting the title to better reflect the work- as it stands it is a little underwhelming. One suggestion would be "Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating PacBio long reads and a high-density linkage map". I would ideally also like to see more discussion of the more unusual aspects of this project- for example, long-read assemblies are commonplace now, but the linkage map approach (and the extent to which there was manual curation of potential chimeric scaffolds) is less frequently employed these days and often superscaffolding and error correction is undertaken with Hi-C methods only. Similarly the extensive manual curation of gene annotations and the impact that this had on the models is likely of more general interest (e.g. how many gene models were corrected, what type of errors were encountered?). Particularly also some mention of some of the specific challenges of this project (e.g. the need to combine multiple individuals to obtain sufficient quantities of gDNA) might be interesting for the readership.

      The absence of line numbers is a little cumbersome for reviewing purposes, I'll below refer to specific parts of the text by page number (as printed on pdf document) - paragraph -line(within paragraph). 3-1-6: suggest changing "…. and included both laboratory and natural environmental conditions" to "…and have included…"

      3-2-1: change "The first M. cinxia genome was released in 2014" to "The first M. cinxia draft genome" or "The first M. cinxia genome assembly"

      Table1: reporting both GC and AT % is unnecessary. There are some discrepancies between the statistics reported for the chromosomal assembly in the Ahola et al (2014) paper vs this table. This may simply be due to different methods for assessing summary statistics (e.g. whether or not gaps are included by default), but warrants investigation/clarification. For example, the largest scaffold reported in the Ahola et al (2014) paper is 14,178,551bp. The description of the generation of a chromosomal build for the previous version indicates >280Mb were assigned to chromosomes, whereas the total assembly size in this table is reported to be only 251Mb.

      6-2-2: What are the units for the cut-off (read length?)? If available, the data exploring the impact of different cut-offs on the assembly error rate could be of interest to others assembling genomes de novo. 6-2-6: As a specific example of a more general comment on number reporting, perhaps state 24.4 Gb instead of 24,409,505,551 bp? I am not sure that the precision is always necessary and scaling/rounding can help readability.

      6-2-10: Are the alternative contigs extracted by default by the FALCON pipeline? Are there any adjustments that need to be made for an input of >1 individual, for example?

      7-2-2: The raw data for the linkage map crosses, and also the RNAseq data for the transcriptome studies (on ) is described as "unpublished", but I believe public sequence accessions are also being released with this manuscript. Is there additional information that would need to be disclosed for this information to be utilised by others or is the intention to highlight that the data will also be presented in upcoming publications?

      7-2-6 "Part of" should be "Some of"

      7-3-3: Specify "relative humidity" instead of RH. Discuss why different approaches used for different RNAseq experiments.

      8-1-5: Sequencing was "performed" rather than "made". Can you specify which HiSeq model and which sequencing library kit (or at the very least whether it was PCR-free)?

      9-2-6: Presumably "de novo transcripts" refers to both transcriptomes 1 and 2, in which case I think it would be helpful to state this here. I assume the different analysis approaches for datasets 1 and 2 reflect different histories of the two datasets but it would be interesting to see some assessment of the relative performances of these approaches.

      13-2-4: I think that http://butterflygenome.org would be sufficient for the URL here.

      14-1-4: Are there any flow cytometry (or other) estimates of genome size that can be used to set alongside the v1 and v2 assembly sizes?

    1. current

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Stephen Nayfach

      In their manuscript, Ortuno et al. develop a procedure for imputing missing genotypes of SARS-CoV-2. Missing genotypes can arise from fragmented whole genome assemblies, targeted sequencing (e.g. spike protein), or incomplete genotype panels. I really like this idea and thought the paper was conducted quite carefully. I was impressed by the high level of precision across all experiments. I have a few minor comments, questions, and suggestions below:

      Major comments: My understanding is that only SNPs are imputed by the program. Is this correct? If this is the case, can the authors comment on the frequency of other types of variants in the SARS-CoV-2 genome? How common are small indels, large indels, or rearrangements? Can the authors include code for building their reference panel? This would enable the same pipeline to be applied to updated SARS-CoV-2 references or to other kinds of viruses entirely. For example, metagenomic DNA sequencing often yields partial viral genomes, and it would be great to use this same pipeline to impute these genomes (where sufficient references exist). I noticed that several of the PANGOLIN lineages seem especially hard to impute. Can the authors comment on why this might be the case? Regarding the PANGOLIN lineages, how to these correspond to specific variants of interest (e.g. delta variant)? Is this information provided to users? A visual could really help here showing the phylogenetic relationships between PANGOLIN lineages and how they relate to variants of interest. The authors indicate that missing regions of partial genome assemblies must be indicated by Ns. This seems like an artificial constraint that may be a pain point for users. Can the authors modify their program to detect missing regions from FASTA files and automatically fill these regions with Ns prior to imputation?

      Minor comments: For the installation options, please provide an alternative to docker. Would it be feasible to add an installation option using conda? In their methods, could the authors clearly define true positives, true negatives, false positives, and false negatives in the context of their validation experiments? Related to this point, I noticed that the precision is consistently high in the validation experiments, but recall can be quite low. I assume this means that the program will not impute a genotype where there is insufficient evidence, leaving it as a "N". In this case, users should have high confidence in all imputed genotypes. Is this correct? All the figures in the manuscript were of low resolution and difficult to read. The authors should use a consistent tense (present or past) throughout the manuscript. In some places future tense was even used: "Once we have validated the robustness of our imputation against different missing regions scenarios, the validation will focus on the imputation of variants"

    2. the

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Siyang Liu

      The authors have introduced an imputation pipeline that integrated softwares of minimac 3, minimac 4 and PANGOLIN to impute the variant of the missing region of the SARS-CoV-2 sequencing data. The accuracy of the imputation for genotyping assay kits is around 0.9. The idea is interesting and may be helpful in a few limited scenario. However, given the high mutation rate of the SARS-CoV-2 and for most of the studies that can generate high quality SARS-CoV-2 (reference-based) genome assembly, I don't think the method will be widely used in the SARS-CoV-2 studies. In addition, it lacks a bit genuine creativity in terms of mathematics behind the method. I think the author's study may be more suitable for a journal like bioinformatics.

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab080), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102906

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102907

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102908

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102909

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab081), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102910

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102911

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102912

    1. A version of this preprint has been published in the Open Access journal GigaScience https://doi.org/10.1093/gigascience/giab079), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102903

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102904

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102905

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paperhttps://doi.org/10.1093/gigascience/giab077), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102900

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102901

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102902

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix107), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102986

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102988

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.100893

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102987

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-016-0150-5), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102985

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz096), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102982

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102983

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102984

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz088), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102978

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102979

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102980

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102981

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz144), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102961

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102962

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102963

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz135), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102964

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102965

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz138), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102959

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102960

    1. A version of this preprint has been published in the Open Access journal GigaScience (https://doi.org/10.1093/gigascience/giz143), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102956

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102957

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102958

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz145), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102945

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102946

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102947

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102948

      Reviewer 5: http://dx.doi.org/10.5524/REVIEW.102949

    1. Long

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz125), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102935 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102936

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz150), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102942

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102943

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102944

    1. Abstract

      This paper has been published in GigaByte as part of the Asian citrus psyllid community annotation series. https://doi.org/10.46471/GIGABYTE_SERIES_0001.

      The CC-BY 4.0 peer reviews are as follows:

      Reviewer 1. Mary Ann Tuli Are all data available and do they match the descriptions in the paper?

      Yes. As with the other manuscripts, OGS v3 is mentioned, but this is not get available from the CGEN. The data underlying Fig 4 and Fig5 are available.

      This manuscript is a comprehensive description of the manual curation of the ubiquitin proteasome pathway gene, with clear aims and methodology.

    2. Ubiquitination

      Reviewer 2. Subhas Hajeri

      The manuscript is well written. Even though the authors could not find a major impact of CLas infection on the annotated, subset of ubiquitin-proteasome genes but the negative data is also equally important for further understanding of pathways and developing better RNAi targets.

      I would like to recommend acceptance of the manuscript as is.

    1. sequencing

      This paper has been published by GigaScience ( https://doi.org/10.1093/gigascience/giab100) and the peer-reviews have been shared under a CC-BY 4.0 license. These are as follows.

      Reviewer 1. Edward Rice

      In this manuscript, the authors present a sophisticated method for closing gaps in assemblies, built around the knowledge that gaps usually occur in repetitive regions. They test their software against similar software with more realistic scenarios than previous studies, through the use of gaps from real assemblies of genomes that have other assemblies with fewer gaps, rather than randomly generated gaps. These tests convincingly demonstrate that this software is more sensitive and accurate than existing gap closers.

      Given this increase in performance over existing software and the novelty of the methods, I recommend this manuscript for publication with some changes. I do have some concerns about the usability and maintainability of the software it describes, noted below, but most of the alternate options have similar issues, and the methodological advancements present in the manuscript merit publication. 1. The introduction seems to imply that the primary use of this software is for closing gaps in short-read assemblies where high-coverage long reads are not available due to cost. Although I do not have a statistic to back this up, it is my sense from recent genome assembly papers that long-read de novo assembly is much more the norm these days than short-read assembly. In my personal experience I have found that gap closing can sometimes greatly improve long-read assemblies as well, especially CLR assemblies of highly repetitive genomes. I recommend rewriting the introduction somewhat to make it clear that usage of this software is not limited to short-read assemblies, as these are becoming rarer and rarer. 2. I have some concerns about the maintainability of this code base, considering its size (>40k lines), language (D, which is not a common language in bioinformatics), and sparsity of comments in the code. Further, the use of non-standard dependencies and file formats may make it difficult to adapt the software to future advances in sequencing technology; for example, this package uses daligner to perform alignment, and so far as I can tell, daligner does not produce output in SAM format, so it may be difficult to switch to using another aligner in the future as the types of long reads available change. The fact that many of the dependencies are not maintained on bioconda is also concerning. The presence of integration tests is helpful. I apologize that this is probably not a particularly helpful comment as it's far too late to change any of these things, but still wanted to point them out. 3. I also have concerns about usability. The availability of a docker file and snakemake workflow for running this software and the thorough and mostly comprehensible documentation alleviate these concerns to some degree, but it still takes a significant amount of work to configure it for a specific cluster. The example run did not work out of the box without fixing some errors (see minor edits). To test on my own assembly, I had to edit one JSON file to choose the parameters for dentist itself, which required reading about the two ways to specify two required coverage parameters; one yaml file to configure the workflow options; and one yaml file to make snakemake work with my cluster. In addition, not all clusters have singularity, so the lack of a conda package may be a problem for some potential users. The singularity image and snakemake workflow make its usability far better than PBJelly, which required actually editing the source code to make it work on my cluster with conda-installable versions of its dependencies, but it is still much worse than TGS-GapCloser, which only takes a single conda command to install with all dependencies and a single command to run, and no editing of configuration files.

      Minor comments: Abstract: - "Here, we developed" -> "Here, we present" - "Highly-accurate" — no hyphen - "Short read assemblies" -> "short-read assemblies" (this occurs in several other places too throughout manuscript) - Replace "right loci" with "correct loci" Introduction: - Page 3: "High contiguity, completeness, and accuracy... is fundamental" — change "is" to "are" - Page 3: avoid parentheses inside other parentheses - Page 3: I'm not sure I've ever heard of GenomicConsensus being used for gap closing, and cannot find any reference to it being used for this purpose with a quick scan of documentation. It must be capable of doing this, though, as you tested it alongside other gap closers. Could you explain this in the manuscript?

      Results: - Page 4: replace "right loci" with "correct loci" - Page 4: say a little more about what makes DENTIST's "state-of-the-art" consensus module better than or different from existing consensus callers - Page 5: "real life" to "real-life" - Page 5: "high quality" to "high-quality" Discussion: - Page 9: "long read data" -> "long-read data" Methods: - Page 11: "genomic regions, where the number" — remove comma - Page 12: "a common conflict are" to "a common conflict is" - Page 12: "less than three reads" to "fewer than three reads" - Page 14: "'copied' gaps from short read assembly" to "copied gaps from the short-read assembly" - Page 14: remove quotation marks around "disassembled"

      Software: - The "small example" does not work out of the box as "dentist_v1.0.2.sif" is hard-coded into snakemake.yml but the image distributed with the example is v2.0.0. - The "read-coverage" and "ploidy" options are listed as required (unless you're using "min-coverage-reads" and "max-coverage-reads", but they are not among the "important options" listed in the README under the "How to choose DENTIST parameters" subheading. - In the more extensive list of command-line options, the description of the "read-coverage" option is "this is used to provide good default values for -max-coverage-reads or -min-coverage-reads; both options are mutually exclusive." This tells the user how it is used by the program but gives the reader no explanation of how it should be chosen, which is important as it is one of the required options. - The use of comments in dentist.json by putting double slashes in front of attribute strings is confusing and also not supported by the json specification. Dentist.json would be better in yaml format because: a) YAML supports comments b) YAML is easier to read by humans c) YAML is used for the other two configuration files necessary to run the pipeline, so for consistency purposes it's best to have them all in the same for

      Re-review The authors have thoroughly and satisfactorily addressed all of my comments and the comments of the other reviewers. After testing the latest version, I can confidently say ease of use is much improved as it took me less than five minutes to go from zero to successfully starting a run of the example. I am therefore happy to recommend this manuscript for publication in its current format.

    2. reads

      Reviewer 2. Leena Salmela

      Overview: The paper presents a new tool called DENTIST for closing gaps in short read assemblies using PacBio CLR data. Although new assemblies are nowadays most often done with PacBio HiFi data resulting in contiguous and accurate assemblies, closing the gaps of an existing short read assembly with long read data is a cost effective and therefore attractive alternative for species for which short read assemblies are already available. The new tool is shown to be more accurate than previous tools and of comparable sensitivity.

      Suggestions for revision: 1) The authors should clearly indicate in the Introduction that their tool is tested on PacBio CLR reads. It would also be good to specify in the abstract that the reads were CLR reads and not HiFi reads. 2) In the Discussion, the authors recommend to "polish" the final gap closed assembly with Illumina reads. It would be interesting to see how much this improves the accuracy of gap closing. I would assume that the improvement on the gap sequences would be smaller than on other regions of the assembly because the gap sequences typically cover repetitive regions. 3) Last paragraph of section "Closing the gaps", page 14: DENTIST has three modes. Here it is indicated that the third mode (only use scaffolding information for conflict resolution and freely scaffold the contigs using long reads) would be the best mode for contig-only assemblies. It seems to me that also the second mode would be appropriate for this as it also closes gaps between scaffolds (or contigs in case of lack of scaffold information). Is this so?

    3. allow

      Reviewer 3. Ian Korf.

      The paper by Ludwig et al demonstrates that DENTIST offers a substantial improvement in closing genomic assembly gaps. The paper is well written with a clear and concise style. I liked the way they approached the experiments with a combination of simulated and real data for both the assemblies and reads. Specifically, I applaud how they generated gaps where they actually happen. The figures are generally effective. The only exception to this is Figure 4 with the black background and inconsistent ordering of competing software. In addition to winning the bake-off against other software, they did a very useful analysis of read depth (figure 6) and resources used (table 2). These help future users plan their projects. From a code perspective, I like that they have put their code on github. I don't think they need to have the supplemental file of command line parameters, as anyone who wants to use the software is going to go to the github anyway, which has a much more comprehensive explanation of usage.

  13. Feb 2022
    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaByte (see paper https://doi.org/10.46471/gigabyte.42), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Jose De Vega

      I think this long-read assembly is a great improvement against the previous short-read version available to the community to date. The assembly metrics are good, the dataset public, and there is good quality control all through the process. The manuscript is well written and the protocols are well explained. The data is public and the new assembly of interest to the community.

      However, I think the assembly has a limited interest for the research and breeding community without a gene annotation, which is not part of the manuscript. Since the authors have the data (e.g. iso-seq) and expertise, I do not understand why it has not been included in first place.

    2. Red clover

      Reviewer 2. Jianghua Chen

      Red clover is one of the most important forage crops in the world. The gametophytic self-incompatibility resulting in inherent high heterozygosity is the big challenge to get a high quality genome sequence using traditional short-read based genome assemblies. The author Bickhart et al used the long-read based assemblies method to get a high quality genome which significantly reduced the number of contigs by more than 500-folds, and improves the per-base quality and the genome size to 413.5 Mb matching well with the predicted genome size. This assembly accurately represents the seven main linkage groups, and it will help scientists to understand the origin of condensed tannins biology pathway in the leaf forages and to facilitate gene discovery and application of biotechnology to increase the nutritional value.

      I strongly support the editor to accept this manuscript to be published.

    1. ABSTRACT

      Reviewer 2. Cory Hirsh

      This manuscript describes the generation of a time-series dataset of conventional and hyperspectral images of commonly known and important maize lines. The authors describe the methods of data collection and how it is useful, especially in conjunction with other already available datasets for the same lines. The authors begin to analyze the dataset generated, focusing on biomass measures and determining heritability. The authors conclude that they believe it is important and necessary to combine controlled environment data with field data to tackle problems facing crop production. I do have several comments about the manuscript in its current form:

      1. My main concern about the manuscript is the amount of data use in the article. The manuscript was submitted as a 'Data Note', but it is not obvious this data is exceptional, rare, or novel as it was collected nearly 2 years ago. One criteria to review this type of article is dataset size. The authors are claiming a dataset size of ~500Gb, but this includes data (thermal infrared and fluorescence images) that was not mentioned in the manuscript except that it was collected. I applaud the authors for the willingness to be so open with their data, but I'm not convinced that one month worth of images for 32 genotypes is enough for publication.

      2. The manuscripts main point is not to get into conclusions based on their image analysis, but I would have liked to have seen more strenuous ground truthing. The manual measurements were made only at the very last time point. These really should encompass the variation of plants throughout development. How can we determine if the measured traits are accurate at day 9 for example? Nothing can be done for true manual measurements, but digital manual measurements could be made and correlated with image analysis extracted values.

      3. Board sense heritability needs to be corrected throughout the manuscript.

      Re-review:

      This manuscript describes the generation of a time-series dataset of conventional and hyperspectral images of commonly known and important maize lines. The authors describe the methods of data collection and how it is useful, especially in conjunction with other already available datasets for the same lines. The authors begin to analyze the dataset generated, focusing on biomass measures and determining heritability. The authors conclude that they believe it is important and necessary to combine controlled environment data with field data to tackle problems facing crop production.

      Comments: I want to clarify my first review of this manuscript. It was not my intention to make it seem as the dataset generated for this manuscript is not important, large, or useful for the broader maize and plant phenotyping community. This dataset could be very useful for some research groups, including the corresponding authors group. The authors response to the age question of the dataset of, look at the cycle time of data collection to publication in plant phenomics is generally longer, I totally agree with. The authors give numerous examples to back up this point. I'm not disputing this, but the authors should also note the amount of downstream analyses and new biological findings that are in these manuscripts as well. The importance of the presented dataset as outlined by the authors is its ability to link with other already available datasets, which isn't shown in the manuscript. This paper is a data release paper with a valuable, controlled, and well documented dataset. The real value in the dataset will be shown in subsequent publications that begin to combine the multiple datasets available from these maize lines (field phenotyping, genotyping, controlled environment phenotyping).

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix103), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Andrew Adey

      In the manuscript by De Maere and Darling, the authors describe their computational simulator for HiC and 3C sequencing that models the 3D arrangement of chromatin and how that arrangement is conveyed via proximity ligation methods. Overall the manuscript is long and does not clearly describe the main goals of the simulator. The detail is appreciated, but not when it obfuscates the main goal of the manuscript. Also the figures could be condensed so that there are less figures with more panels. That being said, I do believe the simulator that the authors have developed is very sophisticated and appears to work well with a few exceptions. The major issue is the packaging of the method into more a concise and clear text. Below are some more specific comments:

      My first thought is regarding where this simulator will be particularly useful? The authors mention it is primarily for software tool development and that the cost of generating HiC/3C data is very high and that many of the existing datasets are sparse. However, there are many existing datasets that are extremely rich and deep that would seem more appropriate. While I am not convinced on the utility for software development when abundant real data is publicly available, I do agree that having means to simulate sequence read data may have other valuable applications - primarily in exploring power in deconvolving metagenomic samples. For the eukaryotic simulated data there is a clear stretch of signal this is perpendicular to the diagonal as is typically observed for circular genomes, though this would not be expected for linear chromosomes (e.g. Figure 7). Does the simulator assume all chromosomes are circular? This is odd and needs to be addressed. Also on figure 7, the authors are highlighting that there is a greater inter chromosomal signal when compared to real data - is that a good thing? I can see that it may be desirable if the goal is to generate signal that would be generated under the assumption that there is no chromatin organization in the genome and thus be used as a background model. I can see this as a potential use, but it should be more clearly stated. The authors describe the ability to simulate TADs - however it is not clearly described how the TADs are decided upon - can users specify where TADs should be located (e.g. if they have a callset of TADs and want to create data simulating them that they can then alter - e.g. change one TAD and see how it effects signal nearby so they can know what to expect for an experiment where they may be altering TAD-forming loci). Or are they only created randomly (which seems the case given page 8 line 212). This could also be more clearly described by stating broadly what is done then going into the methods of how that is accomplished. Figure 2 is an extremely simple and small diagram – could it not just be added into figure 1? It seems a bit excessive to stand as its own figure. This goes for several other figures. Figure 8 - there is no description for c and d panels. I assume c is real and d is simulated. The strong perpendicular band midway through the chromosome is observed which is discouraging as I have commented on for Figure 7.

      Re-review; The major issues I had with the manuscript previously were that it was too long and may have limited interest. The authors have addressed the first point. For the second, I believe that the interest is broad enough to warrant publication.

    2. Background

      Reviewer 2. Ming Hu

      In this paper, the authors developed a software package Sim3C to simulate Hi-C data and other 3C-based data. This work addresses a very important research question, and has the potential to become a useful computational tool in genomics research. However, the authors need to provide more explanations and technical details to further improve the current manuscript.

      Here are my specific comments: Major comments: 1. Figure 3. It is better to plot Figure 3 in log scale for both x-axis and y-axis. In log scale, the slope of contact probably has direct biophysical interpretation, as described by the first Hi-C paper (Lieberman-Aiden et al, Science, 2009). I am very curious to see how biophysics model contributes to the data generation mechanism. 2. In Rao et al, Cell, 2014 paper, they identified chromatin loops anchored by CTCF motifs. In Sim3C, the authors considered the 1D genomic distance effect and hierarchical TAD structures. It would be great if Sim3C can also take chromatin loops into consideration. 3. Hi-C data can help to detect allelic-specific chromatin interactions. Is Sim3C able to simulate allelic specific proximity ligation data? 4. It is very important to rigorously evaluate the data reproducibility. Using Sim3C, if users simulate Hi-C data multiple times with different random seeds, would the reproducibility between two simulated datasets be comparable to the reproducibility between two real biological replicates? 5. The authors showed simulated contact matrices of bacteria (Figure 6) and budding yeast (Figure 7). They also need to simulate both human and mouse genome-wide contact matrices, and compare the simulated contact matrices with real data.

      Minor comments: 1.Please replace all 'HiC' by 'Hi-C'. 2. Page 6, line 116, "sciHiC" should be "scHi-C".

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaByte (see paper https://doi.org/10.46471/gigabyte.41), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Mary Ann Tuli

      Are all data available and do they match the descriptions in the paper? No. The paper states "The D. citri genome assembly (v3), OGS (v3) and transcriptomes are accessible on the Citrusgreening.org portal" I believe v2 is available, not v3 yet.

      Additional Comments The paper states "The gene models will be part of an updated official gene set (OGS) for D. citri that will be submitted to NCBI." Until these models are available in NCBI their reuse is limited.

      Recommendation: Minor Revision

    2. Citrus greening disease

      Reviewer 2. Xinyu Li

      In the paper entitled “Annotation of glycolysis, gluconeogenesis, and trehaloneogenesis pathways provide insight into carbohydrate metabolism in the Asian citrus psyllid”, the authors conducted a high quality annotation of genes involved in glycolysis, gluconeogenesis, and trehaloneogenesis in Diaphorina citri genome, which provided the bases to develop gene-targeting therapeutics for this important pest species.

      The MS is well-written, and the analyses are clear and proper. I found some minor concerns that should be addressed.

      In the first paragraph of Page 10, the authors used cross symbol and the asterisk in the sentence “The number of genes identified in glycolysis….from NCBI, OrthoDB, and Flybase.”. However, the cross symbol and the asterisk are used without any explanation or citation. I suggest to cite the Appendix the authors referred to or add an explanation to make it clearer.

      In Conclusion part, on Page 15, the authors stated “Expression analysis of the genes annotated in the carbohydrate metabolism pathways identified differences related to life stage, sex and tissue.”. But what are the differences are not mentioned here. I think it would be better to summarize the key/predominant differences about gene expression in the carbohydrate metabolism pathways.

      In addition, it is interesting that the gene expression related with carbohydrate metabolism is sexually different in the Asian citrus psyllid. Is it common in insects or existed in some specific groups?

    1. Abstract

      Reviewer 2. Bruno Fosso

      The paper by Bremges et al. describe CAMITAX a workflow designed for the taxonomic classification of microbial genomes obtained from the application of NGS-based methodologies, such as single-cell sequencing and metagenomics. Even if the 4 implemented methodologies itself do not represent a real novelty in the field, their harmonization by using a classification algorithm is interesting. Moreover, the idea to deploy the workflow in a container greatly simplify both the installation and usage and ensure the analysis reproducibility.

      The manuscript is well written and easy to read. All the proposed figures are appropriate and adequately support the data described in the main text. Figure 2 may be improved by using different colors allowing to easily discriminate the paths through the plot.

      The CAMITAX GitHub repository clearly describe how to access and configure the container but very few information are available about the manual installation. The usage section needs an improvement.

      I have some minor concerns about the paper: - the classification algorithm needs to be described more in deep. A figure may help the readers; - regarding the overall drop of CAMITAX recall in mid-range ranks, I was wondering if it may be due to the fact that CAMITAX seems to be more conservative than the Delmont classification (figure 2). Authors should discuss in how many cases CAMITAX results more conservative than the reference classification. - Moreover, the authors claim that "Notably, 95% of CAMITAX's predictions were consistent with Delmont et al., i.e. the two assignments were on the same taxonomic lineage and their LCA is either of the two." Does it mean the authors consider consistent a classification for which CAMITAX assigns to the kingdom rank while Dermont assigns to species? Please clarify

      It would be useful to add some information about the technical requirements such as consumed RAM and required CPU time.

    2. Now published in GigaScience doi: 10.1093/gigascience/giz154 Andreas Bremges 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, Germany2German Center for Infection Research (DZIF), partner site Hannover-Braunschweig, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Andreas BremgesFor correspondence: andreas.bremges@helmholtz-hzi.de alice.mchardy@helmholtz-hzi.deAdrian Fritz 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlice C. Mchardy 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: andreas.bremges@helmholtz-hzi.de alice.mchardy@helmholtz-hzi.de

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz154 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102049

    1. Abstract

      A version of this preprint has been published in the journal GigaByte under a CC-BY 4.0 license (see https://doi.org/10.46471/gigabyte.40)

      Reviewer 1. Jianbo Jian

      This submission described a reference genome for the Atlantic chub mackerel (Scomber colias) using the combination of PacBio HiFi long reads and Illumina short reads. The sequencing data process and genome assembling and related bioinformatics are comprehensive and adequate. The reported reference genome is the first genome and good continuity. It is a pity that the genome is not the chromosome level due to lack of the Hi-C data or genetic map data. However, the associated analysis and results make sense. In my opinion, as the first reference genome in the genus Scomber, this reference genome is a valuable genomic resource for population genetics, ecology and physiology and other future research. I have some concerns that should be addressed before publication in GigaByte.

      1) In the project design, for genome assembly, two individuals were used for genomics DNA extraction. Why not used the same individual for avoiding the assembly error due to the genetic different between individuals? 2) Line 186-196, I have some confuse about the contamination process, is there some contamination in your sample? In general, most of the genome project will not contain contamination. This process is effective for the specific sample to avoid the contamination. 3) In Phylogenomics analysis, the divergence time was recommended, then the Figure should be updated make more sense. 4) Supp. Table 6 is blank. 5) All of the supplementary tables were not shown in manuscript. 6) The genome assemble for Illumina sequencing is useless compared with HiFi data. 7) In supplementary Table 5, N50 (Kb) should be N50 (bp).

      Recommendation: Minor Revision

      Reviewer 2. Rong Huang

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No. It is suggested that the author can make a simple table to show the assembly effects within the Order Scombriformes.

      Additional Comments: Scomber colias is a valuable marine resource, with a high impact on the fisheries of several countries on the west coast of the Atlantic Ocean and/or the Mediterranean Sea. This study reports the first genome assembly of Atlantic chub mackerel. This genome is timely and the assembly process is clearly describedis, which contribute to the effective conservation, management, and sustainable exploitation of S. colias species in the Anthropocene. I still have the following questions.

      The assembly effect of the genome does not seem to be particularly good. For example, the length of N50 length of scaffolds is not long enough. How many ploidy is this species? Do heterozygosity and repetition rate affect the assembly effect?

      It is suggested that the author can make a simple table to show the assembly effects within the Order Scombriformes. It is helpful for relevant researchers to make use of the genomic resources.

      Is "data validation" followed by the results section? And there is no subtitle in the result part. Is it required by this type of article?

      Recommendation: Major Revision

    1. Abstract

      A version of this preprint has been published in the journal GigaByte under a CC-BY 4.0 license (see paper), and is also part of the Asian citrus psyllid community annotation series of papers that can be viewed here: https://doi.org/10.46471/GIGABYTE_SERIES_0001

      Reviewer 1. Alex Arp In “Genomic identification, annotation, and comparative analysis of Vacuolar-type ATP synthase subunits in Diaphorina citri” the authors did just that. The paper is well written, direct, and easy to follow. The reasoning of annotating these genes is clearly defined; that they are possible targets for RNAi based control for Diaphorina citri, an economically important pest of citrus. The annotation of the genes utilized genomic and transcriptomic databases and the gene expression profiles used existing datasets. Figures and Tables are clear and the phylogenetic trees give sufficient supporting evidence that the annotations are correct. Overall is a good manuscript and needs no major revision for publication.

      Additional Comments: On Page 18 what is meant by "new protein" in "It is a relatively new protein critically associated with the assembly of a certain cell type V-ATPase and is still being studied"?

      Reviewer 2. Mary Ann Tuli Are all data available and do they match the descriptions in the paper?

      Yes. As with the other manuscripts, OGS v3 is mentioned, but this is not get available from the CGEN. The requesting data underlying table and figures has been uploaded.

      Any Additional Overall Comments to the Author This manuscript is a comprehensive description of the manual curation of the V-ATPase genes, with clear aims and methodology.

      Recommendation: Accept

    1. Now published in GigaScience doi: 10.1093/gigascience/giab063 Yilei Fu 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yilei FuMedhat Mahmoud 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of America3Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Medhat MahmoudViginesh Vaibhav Muraliraman 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFritz J. Sedlazeck 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Fritz J. SedlazeckFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.eduTodd J. Treangen 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Todd J. TreangenFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab063 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102841 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102842

    1. Now published in GigaScience doi: 10.1093/gigascience/giab062 Lukas M. Weber 1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas M. WeberAriel A. Hippen 2Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ariel A. HippenPeter F. Hickey 3Advanced Technology & Biology Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Peter F. HickeyKristofer C. Berrett 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Kristofer C. BerrettJason Gertz 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jason GertzJennifer Anne Doherty 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteStephanie C. Hicks 1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Stephanie C. HicksFor correspondence: shicks19@jhu.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab062 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102826 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102827

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab064 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102834 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102835

    1. Now published in GigaScience doi: 10.1093/gigascience/giab056 Shufang Wu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteZhencheng Fang 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJie Tan 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMo Li 3Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChunhui Wang 3Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteQian Guo 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCongmin Xu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiaoqing Jiang 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuaiqiu Zhu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USA5Institute of Medical Technology, Peking University Health Science Center, Beijing 100191, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Huaiqiu ZhuFor correspondence: hqzhu@pku.edu.cn

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab056 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102812 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102813

    1. Now published in GigaScience doi: 10.1093/gigascience/giab004 Fan Zhang 1Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Fan ZhangFor correspondence: fanzhang@umich.eduHyun Min Kang 2Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab004 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102627 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102628

    1. Now published in GigaScience doi: 10.1093/gigascience/giz121 Yun-Ching Chen 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAbhilash Suresh 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChingiz Underbayev 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteClare Sun 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteKomudi Singh 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFayaz Seifuddin 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAdrian Wiestner 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMehdi Pirooznia 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: mehdi.pirooznia@nih.gov

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz121 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101927 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101928 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101929

    1. Now published in GigaScience doi: 10.1093/gigascience/giz118 Xiao Hu Find this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz118 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101954 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101955 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101956

    1. COVID-19 pandemic

      Reviewer 3. Daniel Mietchen

      This review includes supplemental files, videos and hypothes.is annotations of the preprint!: https://zenodo.org/record/4909923

      The videos of the review process are also available on YouTube:

      Part 1 (Screen Recording 2021-06-05 at 10.02.02.mov): https://youtu.be/_UnDdE3Oi-4 Part 2 (Screen Recording 2021-06-05 at 10.52.51.mov): https://youtu.be/z5xRK0lg3b4 Part 3 (Screen Recording 2021-06-05 at 11.27.01.mov): https://youtu.be/VnztlEqFW2A Part 4 (Screen Recording 2021-06-07 at 02.51.59.mov): https://youtu.be/IYtLfMcLTvA Part 5 (Screen Recording 2021-06-07 at 06.11.52.mov): https://youtu.be/Jv_AUHCASQw Part 6 (Screen Recording 2021-06-07 at 18.07.45.mov): https://youtu.be/6Y-yA9oahzM Part 7 (Screen Recording 2021-06-07 at 19.07.02.mov): https://youtu.be/LV5whFhfmEU

      First round of review:

      Summary The present manuscript provides an overview of how the English Wikipedia incorporated COVID-19-related information during the first months of the ongoing COVID-19 pandemic.

      It focuses on information supported by academic sources and considers how specific properties of the sources (namely their status with respect to open access and preprints) correlate with their incorporation into Wikipedia, as well as the role of existing content and policies in mediating that incorporation.

      No aspect of the manuscript would justify a rejection but there are literally lots of opportunities for improvements, so "Major revision" appears to be the most appropriate recommendation at this point.

      General comments The main points that need to be addressed better: (1) documentation of the computational workflows; (2) adaptability of the Wikipedia approach to other contexts; (3) descriptions of or references to Wikipedia workflows; (4) linguistic presentation.

      Ad 1: while the code used for the analyses and for the visualizations seems to be shared rather comprehensively, it lacks sufficient documentation as to what was done in what order and what manual steps were involved. This makes it hard to replicate the findings presented here or to extend the analysis beyond the time frame considered by the authors. Ad 2: The authors allude to how pre-existing Wikipedia content and policies - which they nicely frame as Wikipedia's "scientific infrastructure" or "scientific backbone" - "may provide insight into how its unique model may be deployed in other contexts" but that potentially most transferrable part of the manuscript - which would presumably be of interest to many of its readers - is not very well developed, even though that backbone is well described for Wikipedia itself. Ad 3: there is a good number of cases where the Wikipedia workflows are misrepresented (sometimes ever so slightly), and while many of these do not affect the conclusions, some actually do, and overall comprehension is hampered. I highlighted some of these cases, and others have been pointed out in community discussions, notably at https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_COVID- 19&oldid=1028476999#Review_of_Wikipedia's_coverage_of_COVID and http://bluerasberry.com/2021/06/review-of-paper-on-wikipedia-and-covid/ . Some resources particularly relevant to these parts of the manuscript have not been mentioned, be it scholarly ones like https://arxiv.org/abs/2006.08899 and https://doi.org/10.1371/journal.pone.0228786 or Wikimedia ones like https://en.wikipedia.org/wiki/Wikipedia_coverage_of_the_COVID-19_pandemic and https://commons.wikimedia.org/wiki/File:Wikimedia_Policy_Brief_-_COVID-19_- _How_Wikipedia_helps_us_through_uncertain_times.pdf . Likewise essentially missing - although this is a common feature in academic articles about Wikipedia - is a discussion of how valid the observations made for the English Wikipedia are in the context of other language versions (e.g. Hebrew). On that basis, it is understandable that no attempt is made to look beyond Wikipedia to see how coverage of the pandemic was handled in other parts of the Wikimedia ecosystem (e.g. Wikinews, Wikisource, Wikivoyage, Wikimedia Commons and Wikidata), but doing so might actually strengthen the above case for deployability of the Wikipedia approach in other contexts. Disclosure: I am closely involved with WikiProject COVID-19 on Wikidata too, e.g. as per https://doi.org/10.5281/zenodo.4028482 . Ad 4: The relatively high number of linguistic errors - e.g. typos, grammar, phrasing and also things like internal references or figure legends - needlessly distracts from the value of the paper. The inclusion of figures - both via the text body and via the supplement - into the narrative is also sometimes confusing and would benefit from streamlining. While GigaScience has technically asked me to review version 3 of the preprint (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v3 and also via GigaScience's editorial system), that version was licensed incompatibly with publication in GigaScience, so I pinged the authors on this (via https://twitter.com/EvoMRI/status/1393114202349391872 ), which resulted (with some small additional changes) in the creation of version 4 (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4 ) that I concentrated on in my review.

      Production of that version 4 - of which I eventually used both the PDF and the HTML, which became available to me at different times - took a while, during which I had a first full read of the manuscript in version 3.

      In an effort to explore how to make the peer review process more transparent than simply sharing the correspondence, I recorded myself while reading the manuscript for the second time, commenting on it live. These recordings are available via https://doi.org/10.5281/zenodo.4909923 .

      In terms of specific comments, I annotated version 4 directly using Hypothes.is, and these annotations are available via https://via.hypothes.is/https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4.full .

      Re-review: I welcome the changes the authors have made - both to the manuscript itself (of which I read the bioRxiv version 5) and to the WikiCitationHistoRy repo - in response to the reviewer comments. I also noticed comments they chose not to address, but as stated before, none of these would be ground for rejection. What I am irritated about is whether the proofreading has actually happened before the current version 5 was posted. For instance, reference 44 seems missing (some others are missing in the bioRxiv version, but I suspect that's not the authors' fault), while lots of linguistic issues in phrases like "to provide a comprehensive bibliometric analyses of english Wikipedia's COVID-19 articles" would still benefit from being addressed. At this point, I thus recommend that the authors (a) update the existing Zenodo repository such that there is some more structure in the way the files are shared (b) archive a release of WikiCitationHistoRy on Zenodo

    2. Background

      Reviewer 2. Dean Giustini This is a well-written manuscript. The methods are well-described. I've confined my comments to improving the reporting of your methods, some comments about the paper's structure, and a few about the readability of the figures and tables (which I think in general are too small, and difficult to read). Here are my main comments for your consideration as you work to improve your paper:

      1) Title of manuscript - the title of your paper seems inadequate to me, and doesn't really convey its content. A more descriptive title that includes the idea of the "first wave" might be useful from my point of view as a reader who scans titles to see if I am interested. I'd recommend including words in the title that refer to your methods. What type of research is this - a quantitative analysis of citations? Title words say a lot about the robust nature of your methods. As you consider whether to keep your title as is, keep mind that title words will aid readers in understanding your research at a glance, and provide impetus to read your abstract (and one hopes the entire manuscript). These words will help researchers find the paper later as well via the Internet's many search engines (i.e., Google Scholar).

      2) Abstract - The abstract is well-written. Could the aims of your research be more obvious? and clearly articulated? How about using a statement such as "This research aims to" or similar? I also don't understand the sentence that begins with "Using references as a readout". What is meant by a "readout" in this context? Do you mean to read a print-out of references later? Lower down, you introduce the concept of Wikipedia's references as a "scientific infrastructure", and place it in quotations. Why is it in quotations? I wondered what the concept was on first reading it. A recurring web of papers in Wikipedia constitutes a set of core references - but would I call them a scientific infrastructure? Not sure; they are a mere sliver of the scientific corpus. Not sure I have any suggestions to clarify the use of this phrase.

      3) Introduction - This is an excellent introduction to your paper, and it provides a lot of useful context and background. You make a case for positioning Wikipedia as a trusted source of information based on the highly selective literature cited by the entries. However, I would only caution that some COVID-19 entries cite excellent research but the content is contested, and vice versa. One suggestion I had for this section was the possibility of tying citizen science (part of open science) to the rise of Wikipedia's medwiki volunteers. Wikipedia provides all kinds of ways for citizens to get involved in science. As an open science researcher, I appreciated all of the open aspects you mention. Clearly, open access to Wikipedia in all languages is a driving force in combatting misinformation generally, and the COVID "infodemic" specifically. I admit I struggled to understand the point of the section that begins, "Here, we asked what role does scientific literature, as opposed to general media, play in supporting the encyclopedia's coverage of the COVID-19 as the pandemic spread." The opening sentence articulates your a priori research question, always welcome for readers. Would some of the information that follows in this section around your methods be better placed in the following section under the "Material and Methods"? I found it jarring to read that "....after the pandemic broke out we observed a drop in the overall percentage of academic references in a given coronavirus article, used here as a metric for gauging scientificness in what we term an article's Scientific Score." These two ideas are introduced again later, but I had no idea on reading them here what they signified or whether they were related to research you were building on. You might consider adding a parenthetical statement that they will be described later, and that the idea of a score is your own.

      4) Material and methods - Your methods section might benefit from writing a preamble to prepare your readers. As already mentioned, consider taking some of the previous section and recasting it as an introduction to your methods. Consider adding some information to orient readers, and elaborating in a sentence or two about why identifying COVID-19 citations / information sources is an important activity.

      By the way, what is meant by this: "To delimit the corpus of Wikipedia articles containing DOIs"? Do you mean "identify" Wikipedia articles with DOIs in their references? As I mentioned (apologies in advance for the repetition), it strikes me as odd that you don't refer to this research as a form of citation analysis (isn't that what it is?). Instead you characterize it as "citation counting". If your use of words has been intentional, is there a distinction you are making that I simply do not understand? Also: bibliometricians and/or scientometricians might wonder why you avoid the phrase citation analysis. Further to your methods which are primarily quantitative and statistical - what are the qualitative methods used throughout the paper to analyze the data? How did you carry out this qualitative work? (On page 10, you state "we set out to examine in a temporal, qualitative and quantitative manner, the role of references in articles linked directly to the pandemic as it broke.") That part of your methods seems to be a bit under-developed, and may be worth reconsidering as you work to improve your reporting in the manuscript.

      5) Table 1. I am not sure what this table adds to the methods given it leads off your visuals. Do you really need it? It doesn't reveal anything to me and could be in a supplemental file. I also have difficulties in properly seeing table 1; perhaps you could make it larger and more readable?

      6) Figure 1. This is the most informative visual in the paper but it is hard to read and crowded. It deserves more space or the information it provides is not fully understood.

      7) Figure 3. This is very bulky as a figure, although informative. Again, I'm not sure all of it needs inclusion. Perhaps select part of it, and include other parts in a supplement.

      7) Limitations - The paper does not adequately address its limitations. A more fulsome evaluation of limitations would be beneficial to me as a reader, as it would place your work in a larger context. For example, consider asking whether the results are indicative of Wikipedia's other medical or scientific entries? Or are the results not generalizable at all? In other works, are they indicative of something very limited based on the timeframe that you examined? I found myself disagreeing with: "....the mainstream output of scientific work on the virus predated the pandemic's outbreak to a great extent". Is this still true? and what might its significance be now that we are in 2021? Would it be helpful to say that most of the foundational research re: the family of coronaviruses was published pre-2020, but entries about COVID-19 disease and treatment entries are now distinctly different in terms of papers cited, especially going forward. Wiki editors identify relevant papers over time but are not adept at identifying emerging evidence in my experience, or at incorporating important papers early; it's strange given that recency is one of its true calling cards. For me, the most confounding aspect of the infodemic is the constant shifts of evidence, and how to respond in a way that is prudent and evidence-based. As you point out, Wikipedia has a 8.7 year latency in citing highly relevant papers - and, it seem likely that many important COVID-19 papers were neglected in Wikipedia in the first wave especially about the disease. As you point out, this will form part of future research, which I hope you and your team will pursue.

      8) Reference 31 lacks a source: Amit Arjun Verma and S. Iyengar. Tracing the factoids: the anatomy of information reorganization in wikipedia articles. 2021.

      Good luck with the next stages in improving your manuscript for publication. I believe it adds to our understanding of Wikipedia's role in promoting sources of information.

    3. Abstract

      This paper has been published in GigaScience under a CC-BY 4.0 license (see: https://doi.org/10.1093/gigascience/giab095). As the journal carriers out open peer review these have also been published under the same license.

      Reviewer 1. Dariusz Jemielniak This is a very solid article on a timely topic. I also commend you for the thorough and meticulous methodology.

      One thing that I believe you could amplify on is what would your proposed solution to the "trade off between timeliness and scientificness"? After all, Wikipedia relies on the sources that are reliable, verifiable, but foremostly... available. At the time when there are no academic journal articles published (yet) the chosen modus operandi does not appear to be a trade-off, it is basically the only logical solution. A trade-off would occur if the less valuable sources were not replaced when more academic ones appear, and this is not the case. I believe you should mention the fact that Wikipedia has an agreement with Cochrane database, which likely affects the popularity of this source.

      Additionally, I think that the literature review needs to be expanded. There are already some publications about Wikipedia and COVID-19, as well as about medical coverage on Wikipedia (some non-exhaustive references added below). Moreover, Wikipedia has been a topic covered in GigaScience and it would be reasonable to reflect on the previous conversations in the journal in your publication.

      Chrzanowski, J., Sołek, J., & Jemielniak, D. (2021). Assessing Public Interest Based on Wikipedia's Most Visited Medical Articles During the SARS-CoV-2 Outbreak: Search Trends Analysis. Journal of medical Internet research, 23(4), e26331. Colavizza, G. (2020). COVID-19 research in Wikipedia. Quantitative Science Studies, 1-32. Jemielniak, D. (2019). Wikipedia: Why is the common knowledge resource still neglected by academics?. GigaScience, 8(12), giz139.

      Jemielniak, D., Masukume, G., & Wilamowski, M. (2019). The most influential medical journals according to Wikipedia: quantitative analysis. Journal of medical Internet Research, 21(1), e11429.

      Kagan, D., Moran-Gilad, J., & Fire, M. (2020). Scientometric trends for coronaviruses and other emerging viral infections. GigaScience, 9(8), giaa085.

  14. Jan 2022
    1. Abstract

      Reviewer 2. Rhonda Bacher It is good to have alternative workflows for single-cell analysis, and I am glad to see the authors have submitted the package to Bioconductor. I hope the authors maintain the package and update with new methods as necessary such as if new normalizations or batch corrections are developed. I only have two comments that I hope the authors try to clarify further:

      1. The statement starting with "Optionally, after batch-to-batch normalisation, we also..." should not be in that location. It seems to suggest to readers that this is the recommended method, whereas later that is not the case. In these sentences the manuscript also claims that this normalization approach is more "robust" without providing any evidence or citation.
      2. It's still not completely clear to me how the authors extension of the sc-qPCR method is different from MAST. The same authors of the qPCR method extended it here: "MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data". MAST is also an LRT, but I am assuming that here you are not using the detection rate as a covariate? That's OK if true, it just needs to be clear to the reader. I imagine this could be a frequently asked question by users down the road, so even a sentence on how it is different from (or similar to) MAST would help. Suggestion only: I may have missed it, but it might be helpful to include a statement that says something like "Statistical methods for single-cell analysis are constantly evolving. Here we have implemented XX. The flexibility of ascend allows it to adapt as future methods are developed and prove useful".
    1. Background

      Reviewer 2. Jianbo Jian In this manuscript, Xi et al reported a chromosome-level genome of the common vetch (Vicia sativa) with integration of Oxford Nanopore sequencing, Illumina sequencing, CHiCAGO and Hi-C. Then, the gene annotation and evolution were performed based on the reference genomes. These genomic resources are valuable for evolution research, genetic diversity and genomic breeding. I think this manuscript is suitable published in Gigabyte. Some minor comments and suggestions as following:

      1) The Line Number is missed in this manuscript, which make the detailed comments is not inconvenient. 2) Page 6, “resequenced short-reads” should be “De novo sequencing” or “sequencing”. 3) For the 1.93 Gb assembled genome size, it is a little larger than that of estimated by the flow cytometry (1.77 Gb) and Genomescope (1.61 Gb). Maybe there are some duplicated sequences in this version of assembled genomes. Some redundancy removal software can deal with this question such as Haplotigs, Purge_dups and so on. 4) For the evaluation of genome, LTR Assembly Index (LAI) was suggested for the quality assessment. 5) In Table S2, the mapping rate is very well but the genome coverage is just 76% which looks a little low. What’s the reason? 6) In Table S4, the gene set was combined by August. However, in methods, the annotation software is BRAKER v2.1.6.

      Recommendation Minor Revision

      Re-review: The revised manuscript and response are satisfactory. The additional analyses that the authors have performed are correctly structured. The data presented is clear. In my opinion, I recommend accepting this manuscript.

    2. Abstract

      This paper has been published in GigaByte Journal under a CC-BY 4.0 Open Access license (see: https://doi.org/10.46471/gigabyte.38), and the open peer reviews have also been shared under this license. These reviews were as follows.

      Reviewer 1. Jonathan Kreplak Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. For "Phylogenetic tree construction and divergence time estimation", 64 single copy orthologs are selected, they should be included in a supplementary table to be able to fully reproduce the analysis. Also, Supplementary table S9 should be related to fossil calibrations but show the length of chromosome.

      Recommendation: Minor revision

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa083 Andre Macedo Chronic Diseases Research Center (CEDOC), NOVA Medical School | Faculdade de Ciências Médicas, Universidade Nova de Lisboa, Rua do Instituto Bacteriológico 5, 1150-190, Lisbon, PortugalFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Andre MacedoFor correspondence: andre.macedo@nms.unl.pt alisson.gontijo@nms.unl.ptAlisson M. Gontijo Chronic Diseases Research Center (CEDOC), NOVA Medical School | Faculdade de Ciências Médicas, Universidade Nova de Lisboa, Rua do Instituto Bacteriológico 5, 1150-190, Lisbon, PortugalFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alisson M. GontijoFor correspondence: andre.macedo@nms.unl.pt alisson.gontijo@nms.unl.pt

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa083 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102344 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102345

    1. Now published in GigaScience doi: 10.1093/gigascience/giz149 Michal Stolarczyk 1Center for Public Health Genomics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteVincent P. Reuter 1Center for Public Health Genomics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteNeal E. Magee 5Research Computing, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteNathan C. Sheffield 1Center for Public Health Genomics, University of Virginia2Department of Public Health Sciences, University of Virginia3Department of Biomedical Engineering, University of Virginia4Department of Biochemistry and Molecular Genetics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nathan C. SheffieldFor correspondence: nsheffield@virginia.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz149 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102075 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102076 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102077 Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102078

    1. Now published in GigaScience doi: 10.1093/gigascience/giz115 Bo Song 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYue Song 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, China4BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYuan Fu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteElizabeth Balyejusa Kizito 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePamela Nahamya Kabod 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuan Liu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSandra Ndagire Kamenya 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSamuel Muthemba 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRobert Kariba 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiuli Li 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSibo Wang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShifeng Cheng 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlice Muchugi 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRamni Jamnadass 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHoward-Yana Shapiro 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, Kenya7University of California, 1 Shields Ave, Davis, USA9Mars, Incorporated, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAllen Van Deynze 7University of California, 1 Shields Ave, Davis, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuanming Yang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJian Wang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXun Xu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDamaris Achieng Odeny 8ICRISAT-Nairobi, Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: liuxin@genomics D.Odney@cigar.orgXin Liu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: liuxin@genomics D.Odney@cigar.org

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz115 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101930 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101931 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101932

    1. Now published in GigaScience doi: 10.1093/gigascience/giz119 Sion C. Bayliss 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Sion C. BaylissFor correspondence: s.bayliss@bath.ac.ukHarry A. Thorpe 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Harry A. ThorpeNicola M. Coyle 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nicola M. CoyleSamuel K. Sheppard 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Samuel K. SheppardEdward J. Feil 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Edward J. Feil

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz119 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101935 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101936 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101937

    1. Now published in GigaScience doi: 10.1093/gigascience/giz100 Elena Bushmanova 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDmitry Antipov 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlla Lapidus 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz100 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101881 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101882 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101883 Reviewer 4: http://dx.doi.org/10.5524/REVIEW.101884

    1. Now published in GigaScience doi: 10.1093/gigascience/giz105 Luca Alessandrì 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMarco Beccuti 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMaddalena Arigoni 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartina Olivero 3Department of Oncology, University of Torino, SP142, 95, 10060 Candiolo TO, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGreta Romano 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGennaro De Libero 4Department Biomedizin, University of Basel, Hebelstrasse 20, 4031 Basel, SwitzerlandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLuigia Pace 5IIGM, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFrancesca Cordero 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRaffaele A Calogero 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz105 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101893 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101894

    1. Now published in GigaScience doi: 10.1093/gigascience/giz106 Yingxin Lin 1School of Mathematics and Statistics, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShila Ghazanfar 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDario Strbenac 1School of Mathematics and Statistics, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAndy Wang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia3Sydney Medical School, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteEllis Patrick 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia4Westmead Institute for Medical Research, University of Sydney, Westmead, NSW 2145, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDave Lin 5Department of Biomedical Sciences, Cornell University, Ithaca, NY, 14853, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTerence Speed 6Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Australia7Department of Mathematics and Statistics, University of Melbourne, Melbourne, VIC 3010, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJean YH Yang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jean YH YangPengyi Yang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz106 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101910 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101911

    1. Now published in GigaScience doi: 10.1093/gigascience/giz107 Thomas P. Quinn 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia2Centre for Molecular and Medical Research, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas P. QuinnIonas Erb 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGreg Gloor 4Department of Biochemistry, University of Western Ontario, London, Ontario, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCedric Notredame 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMark F. Richardson 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia5Genomics Centre, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, Australia6Centre for Integrative Ecology, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTamsyn M. Crowley 7Poultry Hub Australia, University of New England, 2351, Armidale, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz107 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101917 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101918

    1. Now published in GigaScience doi: 10.1093/gigascience/giz094 Marek Wiewiórka 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Marek WiewiórkaAgnieszka Szmurło 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Agnieszka SzmurłoTomasz Gambin 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tomasz Gambin

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz094 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101847 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101848 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101849