454 Matching Annotations
  1. Apr 2022
    1. ABSTRACT

      This has been published in GigaByte Journal under a CC-BY Open Access license (see https://doi.org/10.46471/gigabyte.44). The open peer reviews have been unpublished under the same license and are as follows:

      Reviewer 1. Changxu Tian In this paper, a high-quality genome of the Roundjaw Bonefish was successfully constructed, and population structure for Albula glossodonta or any bonefish species were well investigated with high-resolution genomic data. It serves as a valuable resource for future genomic studies of bonefishes to facilitate their management and conservation. Authors have presented the data in a meaningful way, I recommend the manuscript is publishable upon the following minor concerns are well addressed: 1. In the Tissue Collection and Preservation, why not use the same individual sample to complete DNA sequencing, but use the heart tissue of another individual for long-read sequencing and Hi-C sequencing. 2. In the Illumina RNA of Read Error Correction, why use the original read sequenced not filtered? 3. In the discussion section, it is suggested to add a discussion on the genomic results of this species.

      Reviewer 2. Shengyong Xu. In the present study, the authors reported the genome assembly of bonefish Albula glossodonta, as well as population genomic analyses using ddRAD-seq. These genomic data should be useful for management and conservation of this species. Some comments are as follows. 1. The authors should show us the line numbers in their manuscript. 2. In Abstract and Result, the authors should provide fundamental genomic information such as genome size, heterozygosity ratio and repeat ratio, so we can have a better understanding of Albula glossodonta genome. 3. Also, the authors should provide the information of final genome assembly of this fish species, i.e. total length of genome assembly, the number and N50 of scaffolds, and among others. 4. What’s the meaning of NG50, LG50, and auNG in the manuscript? And what’s the difference between NG50 and N50? The authors should interpret why using these statistical data in the description of genome assembly part. 5. With an annotated genome assembly as reference, I suggested the identified SNPs should be annotated using SNPEff or annovar softwares. 6. Population genomic approach can uncover population divergence at a fine spatial scale. In this manuscript, relative high levels of genetic differentiation were detected between Mauritius and other three groups based on neutral SNP dataset, suggesting possible local adaptation in Mauritius population. I suggest the authors can further analyze population structure by using outlier dataset to reveal the influence of local adaptation on population differentiation.

    1. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Mudra Hegde

      Summary:

      In this manuscript, Poudel et al. present a software, GuideMaker, to rapidly design sgRNAs targeting non-model genomes. Various input parameters such as PAM motif, guide length, length of seed region for off-target searching and so on can be toggled to design a panel of sgRNAs for pooled screening projects. The tool also helps pick control sgRNAs to include in the sgRNA pool. To benchmark the computational performance of their tool, the authors used GuideMaker to design sgRNAs targeting E.coli, P.aeruginosa, Aspergillus fumigatus and Arabidopdis thaliana. They also compared GuideMaker to the existing design tool, CHOPCHOP and reported that the targets identified by GuideMaker were mostly similar to those identified by CHOPCHOP. This tool can be used as a stand-alone web application, command-line software or in the CyVerse Discovery Environment.

      Overall, the tool is very well documented and easy to use. In the current version of the manuscript, GuideMaker does not show a clear improvement over the state-of-the-art design tool, CHOPCHOP. The authors do not implement any existing on-target scoring methods to determine the targeting efficacy of the picked sgRNAs. This can lead to picking guides that are highly specific but not effective enough.

      Major points:

      1. Implementing on-target scoring methods, at least for the Cas enzymes that have on-target efficacy information, can help improve the process of picking sgRNAs. This tool will probably be used more often with standard Cas enzymes and it will be useful to have on-target efficacy scores attached to the guide RNAs.

      2. The authors do a thorough analysis of the computational performance of GuideMaker with various genomes and Cas enzymes but including a comparison of the computational performance of GuideMaker vs. CHOPCHOP will strengthen the manuscript.

      3. The authors define the PAM sequence of SaCas9 to be NGRRT whereas the canonical PAM sequence of SaCas9 is NNGRRT. This should be modified throughout the manuscript and analyses involving SaCas9 should be redone.

      4. A good addition to the tool would be to output a file with all the sequences that were designed targeting the region of interest with the specific PAM sequence. This gives the user a sense of the universe from which the final guides were picked.

      5. Another useful input parameter would be to specify a target region that the user wants to focus on such as letting the user input genomic coordinates or a gene name or locus tag. For example, CRISPy by Blin et al., 2016 takes a GenBank file as input and allows the user to input features specific to the uploaded genome.

      Minor points:

      1. "CyVerse" is misspelled as "CyCVerse" in multiple places in the manuscript.

      2. Reference Figure 2 in Line 92.

      3. Line 154: "Ratios between tools were calculated by dividing the number of gRNA identified.."

      4. In Supplementary Figure 3 "wit haVX2" should be "with aVX2".

      5. GitHub link in Line 336 does not work.

      6. Line 225-226: "GuideMaker also creates off-target gRNAs for use as negative controls in highthroughput experiments." "Off-target gRNAs" is misleading in this context.

    2. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Wen Wang

      The author developed a software, GuideMaker, for designing CRISPR-Cas guide RNA pools in non-model genomes. Three bacterial genomes, a fungal genome, and a plant genome were used in performance benchmarking, which proves that the software supports the design of gRNAs in non-standard Cas enzymes for non-model organisms at the genome-scale. However, the advantages of this software are not well estimated nor presented compared to other tools like CHOPCHOP. Also, the software was mainly evaluated in three bacteria genomes, one fungus and Arabidopsis genome. There are no tests for non-model plant or animal genomes. Therefore, the "non-model genomes" in the title are exaggerated. I list more problems as follows.

      Major comments:

      1. The authors did not compare the computation resources and performance (running time, memory) with existing softwares like CHOPCHOP. Also, the authors need to compare the score rankings with CHOPCHOP to present the relative power of GuideMaker. Is there any score rankings concerning efficiency or off-target possibilities for the designed Guide RNAs

      2. It is better to add support for gff formated annotation input files since many non-model species do not have GenBank annotations.

      3. The authors mentioned GuideMaker can design gRNAs for any small to medium size genome (up to about 500 megabases). The maximum genome used in the article was Arabidopsis thaliana (114.1MB), which is obviously smaller than the described (up to about 500 megabases). We couldn't find the description whether the authors had investigated the larger genomes. Therefore, the detailed analysis or discussion of this problem is needed.

      4. The authors stated GuideMaker to design CRISPR-Cas guide RNA pools in non-model genomes. Arabidopsis thaliana is a model organism and test in a non-model plant genome will be highly valuable.

      5. It is also stated that GuideMaker can design gRNAs for any PAM sequence from any Cas system but the results of SaCas and StCad was described in only one sentence.

      6. The source of the genomes was missing in the manuscript. In particular, some species have multiple genome versions in the same database. Therefore, to make the results more repeatable, the specific website and version number for each species are needed.

      Minor comments: There are many typos. I give some examples here.

      1. Line 11, "bacteria" should be "bacterias".

      2. Line 38, delete the", including non-model organisms",prokaryotic and eukaryotic organisms include the non-model organisms.

      3. Line 111, "candidates guides" should be "candidate guides".

      4. Line154, "gRNA identify with GuideMaker" should be "gRNA identified with GuideMaker".

      5. Line 195, "The second way GuideMaker reduces…" should be "The second way that GuideMaker reduces…".

      6. Line 204, "and", no need for italics.

      7. Line 207, "gRNA's" should be "gRNAs".

      8. Lines 209-210, "we anticipate performance will…" should be "we anticipate that performance will…".

      9. Figure. 1. It seems that the font size of the description of Control gRNAs is inconsistent with others, please check.

      10. Line 22,55,98,159,175,187,219 and 247, "Guidemaker" should be "GuideMaker".

      11. Line 262, "CAS" should be "Cas".

      12. Supplementary Figure 4. Grammar mistake in sentence "the different number of logical cores with or without AVX2 settings are available". It should be "the different number of logical cores with or without AVX2 settings is available".

    3. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac007), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Kornel Labun

      In this study, "GuideMaker: Software to design CRISPR-Cas guide RNA pools in non-model genomes", Poudel et al. provides a software for sgRNAs design, focusing on genome wide screens. The tool uses the original strategy of finding off-targets with the use of Hierarchical Navigable Small World graphs trying to provide fast running times for the all vs all comparison. Additional novelty is introduced with proximity filters towards features of interest, and filters for restriction sites inside the guide RNA. What's more, the tool creates control guide RNAs which is mandatory for pooled screens. I applaud the selection of the license as all versions of the GuideMaker are available under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Below I list some of my comments and suggestions. Comments & suggestions:

      1. I tested the website and the tool, not finding any bugs and errors. Website is well made, congratulations!

      2. Name of the tool: GuideMaker is not self-explanatory for what it is specialized for, which is pooled design. In the future consider naming your tools more distinctly as I am afraid that currently the tool will be buried under hundreds of other GuideSomething tools. 3. Authors also claim to support Cas13 (page

      3 line 65), but don't mention anything more specific about it. I mention that because design for RNA is vastly different from design for DNA and it should be explained how the tool designs for RNA.

      1. From my understanding the tool offers highly discriminatory settings towards off-target search for a quick resolution of the all vs all comparison problem, however authors ignore that CRISPR off-targets are not defined by the hamming distance, but levenshtein distance. This was proven already by many studies e.g. Tsai et al. 2015. I recommend that authors embrace this issue in the paper and explain why their design may be suitable, and for what kind of studies it would be alright to use hamming distance vs levenshtein distance instead of ignoring the problem.

      2. Study could gain prominence by showing a couple figures and describing how the grid-optimization parameters were selected. This would be especially important for everyone that wants to use this tool for nonbacterial gnomes (page 6, lines 128-131). Although script for optimization is included, it would be good to see what are the tradeoffs.

      3. I believe that Figure 4 and all other AVX2 vs nonAVX2 comparisons are not interesting enough to include multiple times. AVX2 improvements are nice, but the tool is already plenty fast, and running time of 250 vs 220 seconds does not matter for normal users. Similarly the number of cores does not seem to influence tool speed above 8 cores and one figure should be enough to explain that. Tool claims very fast running times, but does not compare to the running times of other similar tools for the design of the pooled screens, this could highlight its superiority.

      4. CHOPCHOP is a general tool for the design of pooled screens while here it is used as a pooled screen tool due to its configurability. Additionally, CHOPCHOP also supports all PAM and all species, but on its python version available here https://bitbucket.org/valenlab/chopchop/src/master/, website supports only some genomes due to slow process of index building for bowtie.

      5. Comparisons to CHOPCHOP focus on the guides found, but I don't understand why consensus ratio between the tools should matter. What is more important is whether GuideMaker does indeed not filter any guides that are preferable for each gene (e.g. by CHOPCHOP ranking) and whether its hamming based filter is good enough to not cause significant unknown off-target effects (levenshtein distance offtargets not found by hamming distance filter). All it takes is one bulge and the hamming distance will become large, while levenshtein distance can even be as low as 1.

      6. It is not clear to me why the tool can't be used with large genomes, filtering on the 11bp seed and hamming distance should be plenty fast for also very large genomes. Could it be that the tool should support other input, not only genbank file format?

    1. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac028), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: C Robin Buell

      This manuscript describes the sequencing, assembly, annotation and analysis of a cassava genome. The cassava genome has already been published but this manuscript describes the genome of a heterozygous cultivar rather than the slightly inbred cultivar published previously. The authors performed the assembly using a number of assembler programs and benchmarked each assembly. Not surprisingly, they found that hifiasm worked the best with HiFi reads. The authors then did annotation of the genome and performed a set of analyses including allele specific expression and pan-genome analyses.

      The manuscript and its genome will be of use to a range of users in the genomics field. I do feel that the manuscript is exceedingly long and reads more of a dissertation than a research article. A significant portion of the text could be deleted and not impact the take home messages in the manuscript. For example, the analysis of allele specific expression, alternative splice form expression and the pangenome is extremely limited in depth and breadth. If these remain in the manuscript, the authors should perform more extended analyses including examining a wider range of tissues and genomes as there are extensive genomic resources available for cassava. It would be nice to tie this complete, phased assembly with the diversity analyses done previously with cassava that revealed the bases of genetic load.

      De novo annotation of the assembly was not performed. Instead, the authors projected the reference annotation onto their assembly and then did alignments with transcript data derived from IsoSeq. The authors are misinterpreting the pseudogenes. As shown earlier by Gan et al. (2011) with Arabidopsis, projection reference annotation on other genome assemblies fails to capture alternative splice forms and thus, predictions of pseudogenes from projected annotation are grossly in accurate. De novo annotation using cognate transcript evidence should be performed to ensure artifacts are not introduced into the annotation. This also would allow the authors to more deeply investigate the dysfunctional/deleterious alleles that are present in casava, a vegetatively propagated crop.

    2. This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giac028), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Zehong Ding

      In this manuscript, Qi et al. assembled two chromosome-scale haploid genomes in African cassava TME204, validated the structural and phasing accuracy of haplotigs by BACs and high-density genetic map, revealed extensive chromosome re-arrangements and abundant intra-genomic and inter-genomic divergent sequences, analyzed the allele-specific expression patterns in different tissues, and built a cassava pan-genome and demonstrated its importance in down-stream omics analysis. Overall, this work is of crucial importance and should be sufficient to publish in the GigaScience Journal. However, I found that this manuscript lacks the basic logical and some analyses have major flaws. Please see the details below:

      1) According to Supplementary table10, there were at least 9 different tissues of the TME204 Illumina RNA-seq data. However, when the authors performing analysis of 'Tissue specific differentially expressed transcripts (Line 393)', why just compared between leaf and stem but ignore the remaining tissues? This is illogical.

      2) Two cassava haplotypes (H1 and H2) were constructed in this study. In Table 4 and Supplementary figure 9, why the authors performed analysis between 'TME204 H1 vs. AM560' but did not mention the comparison between 'TME204 H2 vs. AM560' at all? Similarly, in Fig. 8 and Fig. 10c, the analysis was also performed in 'TME204 H1' but not in 'TME204 H2'.

      3) in Fig.7C, ASE should be the expression level comparisons between H1 and H2, why the legends still are H1 (red bar) and H2 (blue bar)? I cannot understand. Also in Fig. 7D, it's very difficult to understand this figure. E.g., what's the meaning of labels (e.g., "leaf_H1" and "Stem_H1; Leaf_H1") on x-axis? Logically, there are "stem_H1; leaf_H1", "stem_H1; leaf_H2", "stem_H2; leaf_H2", then where is the "stem_H2; leaf_H1"?

      4) Fig6d, Line 110-111, "The transcriptome comparison between TME204 leaf and stem tissues identified gene loci with associated transcripts that were differentially regulated in one haplotype only." This statement is not true because the comparison between leaf and stem cannot conclude that the transcripts were differentially regulated in one haplotype only. Thus, the sentences in Line 407-408 also need to be revised.

      Other suggestions to the authors:

      • Fig6a, what's meaning of Het_Uniq, Het_Dup, Hom_Uniq, and Hom_Dup.

      • Fig6d, what's the meaning of legend bar? Log2(leaf/stem) or log2(stem/leaf)? - ref30 cannot be cited because it is still under preparation.

      • In 'Conclusions section', the statement "The haplotype-resolved genome allows the first systematic view of the heterozygous diploid genome organization in cassava." is inaccurate, because two haplotypes in heterozygous cassava genome have already been published in Hu et al. (2021, Molecular Plant, 10.1016/j.molp.2021.04.009)

      • The title is also suggested to be changed because it is not attractive.

      • The citation of 'Figure 10b' (Line 497) and 'Figure 10c' (Line 502) are wrong.

  2. Mar 2022
    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: James J Cai

      This manuscript introduces a k-NN based feature selection method, Triku, as one key step to secure informative features in analyzing single cell RNA sequencing datasets. The authors argue that most of the current feature selection methods bias to the highly expressed genes instead of the actual gene markers defining the cell populations. Instead, they focus on the local signature of gene expression for each gene and compute how each of them deviates from their null distributions. The ranked gene list concerning the deviation will be derived after the median correction. The authors use Silhouette coefficient to validate their conclusion of better modularity by comparing to other methods. Additionally, the randomness and the robustness of the method are well discussed. In general, this article is well-organized and well-written. The examples of artificial and benchmark datasets showing certain aspects of improvements compared to current methods are illustrative. Triku will be a valuable contribution to the single cell analysis field. The reviewer has some minor comments to help improve the manuscript further:

      1. The authors compare Triku to many other widely-used benchmark methods but excluding Seurat. Although Seurat method is adopted in Scanpy, as they claim in the "FS methods", the default flavor of Scanpy is "Seurat" instead of "Seurat_v3", the default feature selection method in the latest version of Seurat. It might be good to make it clear. Also, another alternative yet popular method, sctransform, from Seurat is not on the comparing list.

      2. The evidence of "we observed that in certain datasets the Wasserstein distances tend to slightly increase with the mean expression of the genes" could be shown to introduce the necessity of further correction. And the reason why the median correction outperforms other correction methods is left unexplained. For example, Seurat, which also considers binning correction method, uses mean to control the strong relationship between variability and average expression.

      3. Since the authors integrate into the pipeline the k-NN module, which is considered computationally expensive, it would be great to evaluate the time complexity/running speed compared with other methods.

      4. Triku assumes that the local transcriptomic similarity is more likely to define cell types. Apart from clustering, which might be better-quality after Triku, it would be interesting to show any potential effects to other popular downstream analyses in the single cell field, such as trajectory inference, given that Triku is subject to locality.

      5. Triku builds k-NN graph on UMAP all the way around. To validate the robustness of Triku, one could also discuss alternative low embedding methods like t-SNE in the section of "robustness".

      6. Since Triku is likely to identify locally over-expressed genes, it would be interesting to see the overlap between features selected by Triku and the differential expressed genes, if the setting is possible to arrange to make the two comparable.

      7. In the section of previous work, some claims were made without references. For instance, "Early methods for FS in scRNA-seq data were based on the idea that genes whose expression show a greater dispersion across the dataset are the ones that best capture the biological structure of the dataset". Another example of relevant references missing is https://pubmed.ncbi.nlm.nih.gov/31861624/.

      8. Fig. S2 does not show exact gene names. For artificial data, why those four genes are representative is left unexplained.

      9. The authors classify reference 11, the dropout-based method as "a new generation". As far as I know, the benchmark M3Drop was published in 2018.

      This Reviewer's comments were prepared with assistance from my graduate student Yongjian Yang.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Rhonda Bacher

      The manuscript presents a new method, Triku, for feature selection in single-cell RNA-seq data. Feature selection is performed upstream of tasks such as clustering and differential expression to reduce the effect of genes with noisy expression. Triku uses a KNN approach to identify features that are unexpected within cells that are transcriptionally close. Overall, the manuscript is well-written, presented clearly, and is a promising new method for feature selection. The figures are also very nice.

      Major:

      1. In Figure 4, it is not obvious why different methods would rank so differently between the two datasets. What methods did those papers originally use for feature selection (if available). Does that partially explain the differences?

      2. Figure 6, the left-most plot does not belong? It is not described in the legend.

      3. It would be helpful to note somewhere which category of methods the others belong to (i.e. variance based or distribution based).

      4. Some additional results and discussion on the number of genes selected. 250-500 is quite low and may explain the poor overlap between genes selected. In my experience with commonly used methods from the scran or Seurat package a more typical number of genes selected is around 2,000. What are the typical numbers used/recommended for the other methods compared to here? Does the performance difference remain when expanded to the top 2,000 genes? And is the performance better for Triku on 250 compared to 2,000?

      5. In methods, "By default, the number of features is the one automatically selected by triku." These values should be put into the supplement to get a better idea of how many genes are being selected by default.

      Minor:

      1. In Figure 3, I would label the top and bottom as A and B, I initially misread the legend as top 250 and bottom 500 genes.

      2. What are the approximate run times a user can expect for this method?

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Christoph Ziegenhain

      In the present manuscript, Ascension and colleagues introduce a new feature selection method for scRNA-seq data to increase the relevance of selected genes for downstream analysis such as clustering. I am happy to see that the tool is deposited as an open-source package that seems easy to install and plugs in seamlessly into the commonly used scanpy workflow / AnnData data structure. The documentation is sufficient to get users started. While the method has merit as a smarter approach to feature selection, the manuscript would benefit from some additional work in terms of both text and analysis.

      Major points

      1) While Triku's strategy is being introduced as superior to preexisting methods, it seems that the strong improvements (at least for the NMI summary statistic) in synthetic data turns rather incremental in the real world datasets of Mereu and Deng et al. The authors should discuss reasons for this difference. In the light of small differences and the fact that the performance is only measured in abstract summarized scores, it would be more convincing if the authors presented concrete cases where the application of Triku yields a difference in clustering or downstream analysis of biological relevance. The currently presented Gene Ontology / Geneset enrichment analysis are too diffuse and do not provide the reader with a feeling of the impact Triku could make on their analysis.

      2) Comparison to other FS methods: Currently, the most widely used method would probably be Seurat's FindVariableFeatures. It would be good to run the presented example data also via Seurat and include it in all comparisons (eg. Fig. 3-6).

      3) Precision of text: There are quite a few statements throughout the text that seem slightly inaccurate and the authors should work in their revision on precision and guiding the reader better through the background & performed work with a bit more clarity. Example: discussion of observed zeroes in UMI-data being well described by the Poisson or NB distributions was not realized by Svensson et al but rather had been described several years before. Compare Vieth et al., 2017 Bioinformatics & Chen et al., 2018 Genome Biology

      4) One of the main assumptions of Triku is that import genes get "switched on", ie. change their state from rather not expressed to a relatively high expression level. I am wondering if the authors can comment on the performance of Triku in cases where the main difference between cells is a gradual change in already expressed genes and whether such difference might get lost/masked by the selection performed by Triku.

      Minor points

      5) What is the rationale for selecting the % of zero expression as the descriptive statistics within the knn neighborhood? If a gene occurs in less cells but with higher expression, it's dispersion would be higher too. It would be needed to justify this more precisely and ideally the authors would add a version of Triku that works on dispersion (to show possible differences).

      6) Three main types of feature selection methods are introduced but not defined/explained further (p. 2)

      7) Since Triku performs more calculations/steps than existing methods for FS, the runtime is presumably higher. The authors should compare and comment on runtime.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Gregg Thomas

      This paper presents 17 new insect genomes from the order of caddisflies (Trichoptera). The authors combine these genomes with 9 previously sequenced genomes to analyze genome size evolution across the order. They find that genome size tends to correlate with evolution of repeat elements, specifically expansion of transposable elements (TEs). Interestingly, the authors also notice that TE expansions also correlate with gene copy-number (or gene fragment copy-number), even of highly conserved genes used to assess genome completeness. Overall, I find this paper very well written and easy to follow. The genomic resources and analyses presented provide novel new resources and findings for insects in the order Trichoptera, with potential implications beyond. I have only minor suggestions before publication, outlined below.

      1. Regarding the TE and BUSCO gene fragment associations, while I think this is a really interesting analysis, I found the underlying models a bit difficult to understand. Line 236 reads, "To test whether repetitive fragments were due to TE insertions near or in the BUSCO genes or, conversely, due to the proliferation of 'true' BUSCO protein-coding gene fragments…" Is the idea that a BUSCO gene has been duplicated itself and then one copy is either fragmented by a TE insertion or hitch-hikes with a TE (as mentioned on line 501)? Or are these fragments only of BUSCO genes that didn't match a full BUSCO gene at all, but the fragments that did match had unexpectedly high coverage? I guess I'm just confused as to whether a gene duplication needs to precede the TE insertions/hitch-hiking, which is subsequently pseudogenized either prior to or because of the TE activity, or if these are gene losses. I understand how the TE could inflate the coverage of these fragments, but I guess I'm still not clear on how these fragments arise in the first place. Any clarification would be helpful! Also, if the case is that these are fragments of BUSCO genes that have no full matches in the genome, how might assembly contiguity or quality be affecting these matches?

      2. One thing that I noticed throughout the figures is that branch B1, leading to A. sexmaculata, the branch leading to clade A, and the branch leading to clade B (as labeled in Figures 1 and 2) appear to form a polytomy. I don't find this mentioned in the text and am wondering why this relationship remains unresolved with these data. I don't think this has any bearing on the results, since all analyses are done on the tips of the tree, but I think readers looking at these trees will want to know what is going on at that node.

      3. The authors use custom scripts for their BUSCO-TE correlation analysis and provide a link to a Box folder on line 514. I would request that these scripts be put somewhere more stable and accessible (e.g., github). Not only was I asked to login when clicking the link, but after I had done so that link didn't seem to exist.

      Minor/editorial points

      1. Would the authors be able to report concordance factors for the species tree? I think this should be easy enough with IQ-tree and is something I ask everyone to do. This may also help answer my question about the polytomy.

      2. The authors do a good job of mentioning and citing programs used throughout the manuscript but seem to skip this in the Assembly section (starting on Line 398). "First, we applied a long-read assembly method…" Which one? Same for "de novo hybrid assembly approaches." I see that assembly is covered in detail in the Supplement, but I think naming the main programs used (wbtdbg2 and Masurca) should be in the main text.

      3. Line 281-282: I think some of the brackets and parentheses here are mismatched or un-closed.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Julie Blommaert

      Summary of the paper and overall impression

      In their paper, "Genome size evolution in the diverse insect order Trichoptera", Heckenhauer et al report a 14-fold variation in genome size in caddis-flies. The authors find evidence for increases in transposable elements associated with larger genomes, and report that in caddis-flies living in less stable environments, some genes are replicated in association with transposable elements. Overall, this paper represents a comprehensive collection of data, however, I have some concerns about some of the reporting of methods, some analyses and conclusions. To support some of the conclusions, namely that WGD or large-scale duplications do not play a role in caddis fly genomes, I believe the authors could perform additional analyses. Further, I was left confused by the descriptions of the methods, especially around the replicated BUSCO gene analyses. Please see my comments below.

      Main comments:

      1. The authors report that their gene-age distribution analyses do not support the hypothesis of a WGD, but given previous suggestions that WGD are important in these species, the authors should conduct additional analyses (e.g smudgeplot, minor allele frequency distributions in single-copy genes) to rule out this possibility. While it can indeed be difficult to find a balance between the evidence of absence and an absence of evidence, more effort should go into resolving the matter of WGD in caddis-flies. Some of the genomescope peaks, and some of the coverage peaks from the backmap approach seem to at least hint at large-scale duplications or variations in copy number. Further analyses should also consider if assembled gene copies may be collapsed duplicates.

      2. I admit I am confused by the terminology around the TE-associated BUSCO genes. Are these cases where BUSCO has reported a high number of duplicates? Or where BUSCO annotated regions have a high coverage? Two things need to be clarified here; what made them stick out in the first place (coverage? Duplications?), and what are they really (TEs that BUSCO mistook for BUSCOs? fragments of real BUSCOs attached to TEs?).

      Minor comments:

      1. Lines 53-57: "Genome size can vary widely among closely and distantly related species, though our knowledge is still scarce for many non-model groups. This is especially true for insects, which account for much of the earth's species diversity. To date 1,345 insect genome size estimates have been published, representing less than 0.15% of all described insect species." While I appreciate the authors' point that there is a relatively little data available about genome size and only a small proportion of nonmodel insects in the Animal Genome Size database, this is the case for all groups, and insects actually represent the largest group of invertebrates in the AGSDb. However, this does not mean insects, or chironomid are a poor system to study this in, so authors could reframe this first sentence to justify the study system with something more than highlighting how understudied this is in insects.

      2. Line 76: correct to "In insects, the KNOWN ranges of genomic repeat proportion are…"

      3. Lines 89-91: Why are species rich groups a better system to study RE evolution and environmental interactions than e.g. populations, species complexes, recently diverged species, or groups in the process of speciation?

      4. Lines 113-115: The data description does not, in my opinion, need to justify the species selection since this is done in the intro

      5. Genome size estimates- sequencing based estimates can also be impacted by GC-content, especially in libraries which were produced using PCR, this may be a useful point regarding the differences between FCM and sequencing-based estimates

      6. RepMod versions inconsistent Line 463 says v2, earlier says v1

      7. Line 468-469- What did you use to merge repmask out files?

      8. All read-based analyses: were they run on decontaminated read libraries? If so, please briefly clarify this in the main manuscript. Genome size with GenomeScope: 444-448; RepeatExplorer: Lines 471-479

      9. Why only use dnaPipeTE for repeat divergences and not also abundances? Does dnaPipeTE agree with RepeatExplorer?

      10. Line 495: What is meant by "BUSCO genes showed regions of unexpected high copy number…"? Are these genes reported by BUSCO as duplicated or is this referring to increased coverage?

      11. Lines 506-507: "We used copy number profiles to identify BUSCO genes with repetitive sequences based on coverage profiles" The meaning of this is unclear. The reported copy number from BUSCO? Coverage of mapped reads?

      12. Table 1- please report the full BUSCO summary (e.g. C:39.7%[S:39.2%,D:0.5%],F:35.8%,M:24.5%,n:2442) for each species, lumping complete and fragmented together is unneccesary, and readers are usually interested enough in the full complement of BUSCOs that it should not be in the supplements, but in the main paper

      13. Coverages from backmap method can and should be compared to genomescope kcov estimates (while correcting for kmer size; see here for a brief explanation https://www.biostars.org/p/221672/), this will validate both approaches and offer further evidence when considering polploidy.

      14. In the supplementary note about TAGC plots, Figures S31, S36, S38, S44, S45, S46, S47 don't list contaminant exclusion criteria- if contaminants weren't removed this needs to be stated, and in some cases, especially those where there are different "blobs", (e.g. S47) justified

      15. Supplementary note 9: Figure reference is wrong?

      16. Supplementary note 10: Can coverage comparisons using average BUSCO coverage be re-run using corrected kcov estimates? This would validate the BUSCO coverage approach.

      17. Supp Data 1: Coverage estimates would be more accurate if based on FCM measurements and total sequenced bp (before and after decontamination) and can also be compared to corrected kcov estimates

      18. Limnephilus lunatus has too low coverage to get reliable genomescope

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Surya Saha

      The publication describes a useful tool to quickly survey a range of QC metrics for genomes available in NCBI. The a3cat toolkit can be used to setup as well as update the assessment results for public or private assemblies for a user-defined taxon. Overall, the website and the workflow on gitlab are a useful resource for the genomics community ask a number of comparative genomics questions. I enjoyed reading this manuscript and only have minor comments. I would like to bring some more use cases to the attention of the authors that can enrich the discussion.

      The authors have already presented nuggets from the data mining of results but here are a few thoughts to add to the value of results reported here, as that can be further improved. Given an assembly from an insect with an approximate taxonomic classification based on morphology or genetic markers, can the a3cat results be used to figure out the best reference genome or a set of closely related genomes for comparative analysis of the gene space? One idea could be to use the overlap of lineage specific BUSCO genes found in the new genome with BUSCO genes present in other assemblies to identify related genomes.

      The discussion covers results when the results are filtered by level (contig, scaffold, chromosome) or type (haploid, principal or alternate pseudohaplotype). It might be worthwhile to further segment the results based on input raw data (for e.g. short reads, short reads + mate pair, long reads) to explore if the contiguity of the assembly and completeness and duplication of the gene space is impacted by the proportion of indels in the raw reads irrespective of the length of the reads. There a number of other relevant variables like assembly algorithm and parameters but that can lead to very sparse data. The authors talk about the proportion of repeat content in larger genomes. This might be a valuable resource to add to the a3cat results as initiatives like Ag100Pest and DToL are producing high quality insect genomes >1-2Gbp with a large number of repeats that are going to be better assembled than ever before with high fidelity long reads. Adding the results of a widely used de novo repeat identification tool like RepeatModeler based on the DFAM database will provide a consistent measure of repeat content across all analyzed genomes and add to the value of this toolkit. In case some of this information is already available in NCBI, it can be pulled using the API avoiding the need for this massive compute job.

      This next issue is related to BUSCO but effects the results and conclusions of the a3cat tool. Is it possible that some of the BUSCO marker genes (from OrthoDB9 or 10) are based on short read assemblies with minor errors in gene models? When run on recent assemblies based on high fidelity long reads with the correctly assembled gene model, BUSCO might report the marker as missing or fragmented. I understand this outside the scope of this paper but if this is possible, it should be mentioned as a potential pitfall.

      A common problem with bioinformatics resources is the lack of a sustainability plan. I know this is difficult to pin down for the mid or long term in the face of unpredictable funding but I would like to encourage the authors to present a plan to manage and update the web resource if at all possible. For future work, it might be a good idea to consider the extension of the a3cat toolkit to include other metrics beyond the current contiguity and gene space completeness measures. Mash or ANI distances are becoming computationally tractable for large data sets. I have already mentioned the repeat content issue. Long range similarity measures based on Hi-C data or nucleotide composition based on kmer analysis might be other items to ponder.

      Minor revisions

      Since the logic and applicability of this work is so straightforward, some of the text can be shortened to reduce duplication. For e.g. on Pg 4 this paragraph can be shortened, "Using their Complete Proteome…. for selected groups of species from their field of interest." In the same paragraph, I see "(i) aid project design, particularly in the context of comparative genomics analyses; (ii) simplify comparisons of the quality of their own data with that of existing assemblies; and (iii) provide a means to survey accumulating genomics resources of interest to their ongoing research projects." Can the difference between (i) and (iii) be clearly explained?

      Typographical errors

      On Pg 8, the abbreviation CoL- needs an explanation.

      On Pg 12, can the term span be elaborated?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Stephen Richards

      We are now entering a period with rapidly increasing numbers arthropod genome assemblies. Quality has vastly improved because of new high quality long read technologies, but still has a chance to be uneven.

      Comparative genomics requires at least some effort to ensure the datasets are comparable. Here the authors have produced a nice tool to help find sequenced arthropod genomes and compare their quality.

      They use their previous experience with BUSCO to measure quality, and overall I expect will be using this resource quite a lot.

      I also expect a lot of people will use this resource to identify high quality assemblies for comparative analysis.

      One possible plot that would be useful would be completeness plots - things like number of orders with a representative, families etc, partly to show progress, and partly so missing taxa can be easily identified.

      The manuscript is well written, but more importantly the data and methods are easily accessed, and everything is well written up.

      The tool and website does what it says on the tin, and I can't really see any reason not to publish rapidly.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Yaomin Xu

      The authors presented a web tool - NETMAGE that produces an interactive network-based visualization of disease cross-phenotype relationships based on PheWAS summary statistics. NETMAGE provides search functions for various attributes and selecting nodes to view related phenotypes, associated SNPs, and various network statistics. As a use case, authors used NETMAGE to construct a network from UK BioBank (UKBB) PheWAS summary statistic data. The purpose of the tool as claimed by the authors is to provide a holistic, network-based view for an intuitive understanding of the relationships between disease phenotypes and to help analyze the shared genetic etiology.

      Major comments:

      A DDN based on true genetic associations is useful for understanding complex disease comorbidities and their shared genetic etiology (pleiotropy). An interactive web tool to explore such a complex networked information could be highly useful for the proposed purposes of this tool. However, the EHR/Biobank PheWAS associations data are statistical in nature and commonly with small effect sizes. The reported genetic associations often are not well understood at the mechanistic level, and many genetic associations are spurious. Although certain positive findings can be observed from the disease network generated by NETMAGE, it's of concern the general usability of the current implementation of the tool in order to facilitate novel applications in drug design and personalized medicine, which requires the genetic associations to best represent the underlying true causal mechanism. Further work is needed to verify the genetic associations reported from PheWAS to minimize the impact of spurious associations. Network edges based on SNPs without considering the linkage disequilibrium (LD) between SNPs is misleading and could miss a significant portion of associations that should be linked between diseases if the LD correlations are considered. When construct the network using NETMAGE, the LD correlation between SNPs should be considered.

      For the reported DDN and its statistics to be relevant to true disease - disease relationships, the quality of disease diagnosis using Phecode should be considered. Phecodes are based on ICD codes that are known to be noisy. The accuracy of ICD can be as low as only 50%. Ignoring this limitation and treating disease diagnoses from Phecodes as gold standards or as precise and accurate may result in irrelevant and misleading findings.

      Phecodes are hierarchical. For example, parent codes are three digits (008), and each additional digit after decimal point indicates a subset of ICD codes of the parent code (008.5 and 008.52). So here a code 008.52 implies 008.5 also 008. What's the impact of this hierarchy to the NETMAGE network and the inferences to be made based on the network?

      Minor comments:

      On Page 9, you said "Out of the 2189 edges for which phi correlations could be calculated, 1811 (82.73%) appeared in the DDN. This behavior suggests that our genetic associations identified by our PheWAS results serve as a reasonable approximation of disease co-occurrences".

      This is expected because both phi correlation and PheWAS analyses were performed on the same dataset. If a pair of disease highly co-occur in the dataset, you would expect a strong correlation on their genetic associations analyzed on the same dataset. However, it may not be generalizable that the genetic associations from PheWAS are a reasonable approximation to disease co-occurrences. The disease-SNP relationships from the PheWAS analysis result are bipartite. Even though NETMAGE focuses on the projected disease-disease network, the information about how specific SNPs link to their corresponding disease pairs is important. For example, in your UKBB-based network (https://hdpm.biomedinfolab.com/ddn/ukbb), when a specific disease is selected, a subgraph of the selected disease and other disease linked to the selected one are showing, but sonly a lump of SNPs without linking to their specific disease pair is provided. This is not helpful. Also annotating those SNPs their genetic context could be very useful for users to quickly grasp the nature of the genetic associations in the subgraph.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Dongjun Chung

      In this paper, the authors developed the humaN disease phenotype Map Generator (NETMAGE), a webbased tool that produces interactive disease-disease network visualization based on PhEWAS summary statistics. The tool proposed in this manuscript has important implication and utility for biological and clinical studies. The manuscript is also overall well-written and clearly described NETMAGE. However, there are still some aspects I hope the authors to address. I provide my comments in detail below.

      Major comments:

      1. I tried the web interface Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com), which utilizes NETMAGE. I found that sometimes it takes some time for the network to appear. While the network is loaded, only the gray empty space with the side panel is shown. I recommend the authors to show the progress bar while loading the network, especially when it is first loaded, to avoid users to think that their web browser is frozen.

      2. In the Search bar, it is not always trivial to guess what to enter, especially for Phenotype Name, Associated SNPs, and category. Auto-completion features for these variables will significantly facilitate users' convenience.

      3. Meaning of edges is somewhat unclear to me. Are the existence and the weights of edges purely based on the number of shared SNPs or are they based on any statistical methods?

      4. When the weights of edges are calculated, are the marginal counts taken into account? The same number of shared SNPs can have different meanings when the disease to which this edge is connected has a small number of associated SNPs vs. a large number of associated SNPs. How is this factor considered?

      5. The network generated by the Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com) is usually huge and complex with a large number of edges. As a result, it is often not straightforward to understand the generated network. This is partially relevant to the fact that the network layout is static, i.e., locations of nodes remain the same regardless of which subnetworks are chosen. If the network layout is optimized for each subnetwork, it should be much easier for users to understand the network architecture. Given this, I recommend the authors to consider updating the network layout interactively when a subnetwork is selected.

      6. When a subnetwork is chosen, the "Information Pane" appears. In this pane, it might be helpful for users if the authors provide some quick help link for each network score, e.g., how to interpret PageRank scores, etc.

      7. In the "Information Pane", a long list of SNPs is provided for "Associated SNPs" but it is not easy to use this list. I recommend the authors to make it downloadable as a table so that users can do downstream analysis. In addition, it will significantly facilitate users' convenience if each SNP ID is chosen, it brings the user to the relevant database, e.g., dbSNP. In this way, users can easily check where it is located in the sense of chromosome, gene, exon/intron/promoter/intergenic, etc. Alternatively, the authors can consider to use a quick information table (SNP ID, gene name, exon/intron/promoter/intergenic) instead of simply providing as a list.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Sarah Gagliano Taliun

      Sriram et al. introduce an open-source web-based tool NETMAGE to produce interactive disease-disease network (DDN) visualizations of biobank-level phenome-wide association summary statistics. The concept is interesting and relevant, but my major concern is regarding the interpretability of the DDN for researchers and clinicians to draw insights intuitively.

      Comments on the manuscript:

      Generally well written and logical flow. Some minor errors (e.g. "an SNP" rather than "a SNP") and some headers could be improved for readability (e.g. "Testing" is vague; this section really only touches upon Run time).

      Figure 1- Displaying a single Manhattan plot for "PheWAS Summary Statistics" is not very intuitive. It makes me think of a single GWAS rather than a phenome-wide set of GWAS run on a Biobank. Perhaps revise the image.

      Is the disease-disease network only applicable to case/control studies? Could there be an extension to quantitative traits, and if so, would that be pertinent for discoveries?

      The authors refer to "SNPs" throughout to define genetic variation. If the summary statistics contains another type of variation (e.g. indels), are those associations still used? If so, I would suggest using a more generic term to define the genetic variation.

      The discussion seems underdeveloped. Discussion of limitations rather than only future work would be helpful.

      Case study-- The authors could improve the interpretability/discussion of the UKB PheWAS example. This is one of my largest concerns because the author state that the tool can help researchers and clinicians get insight into the underlying genetic architecture of disease complications; however, the case study part of the manuscript is quite technical and could be challenging to interpret for someone without network experience; e.g. Table 2.

      Additionally, more details should be provided on the underlying summary statistics used (e.g. some details can be found on the About page of the HRC-imputed UKB PheWeb page: https://pheweb.org/UKB-SAIGE/about).

      The authors list additional filtering that they performed on the summary statistics, but it appears that some details are missing. For instance, how many traits remain after the case count filtering is applied? Also, what is used as a reference for the LD-pruning in PLINK?

      Run time-- I am wondering why Table 3 (run time for subsets of the UKBB data) ends at 1000 phenotypes. It would be interesting to see the run time that is close to case example (e.g. possibly adding a column for the total number of phenotypes used in the UKBB DDN). Additionally, this section gives the impression that run time only depend on the number of phenotypes? I would assume that run time should also depend on the number of variants that were tested.

      Comments on the online tool:

      It is nice that on each page the authors have allowed users to download a pdf of the image and also the data behind the image (e.g. edge-map, node-map, etc.). The zoom-in feature for the visualization is also useful, as is the short video tutorial.

      I think that the search bar would be more user-friendly if suggestions automatically came up when the user begins to type. Additionally, displaying the list of "associated SNPs" in a (sortable and/or searchable) table (with some annotations, such as chr, position, closest gene, consequence, rather than just rsID) could be a neater and more informative way to show these data, rather than simply as it appears currently as a list in the "information pane".

      My comment on interpretability for researchers and clinicians comes up again: I am not sure how useful/interpretable some of the search categories are for users to intuitively draw insights; for instance, number of triangles, page range, etc. I think the authors should really focus on the intuitiveness for the target audience so that the tool can have more impact.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac010), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Juan Alzate

      The present paper entitled "Fully resolved assembly of Cryptosporidium parvum" shows the results of the genomic sequencing of the protozoan parasite C. parvum using both 2nd (Novaseq) and 3rd (ONT) generations NGS technologies. Additionally, they assembled the C. parvum genome and compared their results with the previous C. parvum IOWAA II reference. The authors also undertake some QC analysis to validate chromosome models.

      The paper is interesting because there is a need to have a fully resolved Cryptospodium genome. The sequencing by itself is not much an achievement, the authors applied commercially available platforms. In the assembly process, they also used already known assemblers and mapper tools. I think BUSCO does not deliver the detailed results expected here. Maybe a more comprehensive analysis, including all the single-copy genes present in the C. parvum, can help to better support the quality of the genome.

      One additional recommendation is that the authors present a detailed analysis of single nucleotide variants (SNVs). This data can be extracted from the same BAM files that the authors already generated for Structural Variants analysis. This analysis is particularly important because it can show the readers how clonal is the C. parvum strain used.

      I don't know if this is possible. Can you compare your genome model with the one published here BioRxiv - DOI: 10.1101/2021.01.29.428682.?

      Please make public the raw-read data. (Novaseq and ONT raw reads)

      Please explain in more detail in the Methods section how do you find and analyze the structural variants.

      I don't understand why to estimate the genome size. Could you explain it?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac010), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Matthew Knox

      Overall, Menon et al. present a significant contribution to the field with this work. Their fully resolved assembly of Cryptosporidium parvum is the first to my knowledge to utilize long read sequencing in whole genome sequencing for this group of protozoan parasites and as such provides validation of previously published work while also improving on current reference standards and providing a robust and well described analysis pipeline for future studies.

      In my view, there are only a couple of issues with the paper that should be addressed. The first is a discussion of recent work using metabarcoding (e.g. DOI10.1016/j.meegid.2012.08.017, DOI10.1016/j.ijpara.2017.03.003), which demonstrates mixed infections in clinical samples of patients infected with Cryptosporidium which were missed with consensus Sanger sequencing. In some cases, mixtures of subtype families can be found, though dominance of a single subtype with a few closely related variants is more common and more likely in the current paper. Nonetheless, this may have implications for sequencing since purity of the "culture" cannot be guaranteed and results from the lack of reliable in vitro culture methods for Cryptosporidium.

      The second issue I have is with the section on comparative genomics. Strictly speaking calling this a comparative genomics analysis is not correct since the authors do not compare genomes with genomes. Instead, it is based on comparison with a small subset of sanger generated sequences and does not add much to the paper in my view. If it is to be included, the text should be rephrased to better reflect the analyses and the identity (species, subtype, subtype family) of the sequences downloaded from genbank should be presented in more detail. Also, it is unclear what criteria were used to select these sequences from among the many hundreds available for C. parvum and this should be stated too.

      In addition to significant comments above, I detected a few inconsistencies and typographical errors in the submission and have included minor comments (sticky notes) in the attached pdf document. I hope that the authors find this helpful in improving the manuscript.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Peter Horvatovich

      The article GIGA-D-21-00223 entitled "Democratizing Data-Independent Acquisition Proteomics Analysis on Public Cloud Infrastructures Via The Galaxy Framework" describes a targeted DIA LC-MS/MS processing workflow implemented in Galaxy framework. The paper describes the tools integrated in Galaxy environment and the workflows steps to process DIA LC-MS/MS data using targeted spectral library approach. The authors used a HEK cell lysate spiked with E.coli digest at various ratio and used these samples to generate DIA LC-MS/MS data on an Orbitrap QE+ with MS1 scans and 24 50% overlapping DIA windows between 400-1000 m/z in 4 replicates for each conditions. The implemented workflow contains the library generation from DDA data with MaxQuant processing, library cleaning and analysis of the DIA with OpenSWATH and statistical analysis using MSStat package in R. The authors present identification and quantification of proteins in the example data (differential analysis, volcano plot, CV plot).

      The article has a potential interest to the proteomics community as it serves to promote the use of complex DIA data processing workflows in Galaxy web interface, which would otherwise require considerable programming skills and time to establish such workflow from the user. However, the authors should address some major and minor issues before I suggest the article to be accepted.

      Major concerns:

      1. The tools and the DIA processing workflows are implemented in Galaxy Europe, which is using for me unknown amount of resource in term of disk space and computational capacity (CPU, RAM). The authors should describe what is the limitations to use this online Galaxy server (maximum amount of upload, CPU time, is there any cost to use the service, limitation of RAM for the tools etc).

      2. Some users do not want to use cloud-based services and public Galaxy server, but would wish to process their data (e.g. clinical sample from humans) on their own local computational closed infrastructure. For these users the authors should provide a tutorial, how to install Galaxy (just refer to Galaxy installation documentation) and how to get the tools from Galaxy toolshed and run their pipeline. Some users may have already a Galaxy server and getting additional tool may interfere, therefore I would strongly suggest creating a docker image where a single instance of Galaxy is installed with all necessary tools and include the raw data and settings in order to provide a clean workflow, that are sure to work.

      3. I would also like to see data on actual runtime of the example dataset, specially focusing on FDR calculation as authors mention that a subsampling of the data is required for this.

      4. I would also present peptide results as protein quantities are obtained after protein inference from multiple peptides, while the instrument is measuring peptides.

      5. CV distribution of proteins in Figure 4a should be compared to other results from other dataset as it shows multimodal and large distribution, which seems to be independent from the spiking levels. This indicate some artifacts in the data.

      6. The data is only submitted to time alignment using iRT peptides, but there is no normalization applied. The authors should check with box-plot/violin plot the individual distribution of peptides and proteins in each replicate and if necessary apply normalization to avoid "upregulated" human proteins. It would be also useful to color the dots in the volcano plot according to the species (human/E coli). The authors refer to displacement effects, which is not explained what it mean in the text (maybe ion suppression?).

      7. Please provide the distribution of the missing values for each replicate as DIA should provide data with low percentage of missing (0) value.

      Minors:

      1. All figures and plots look like low resolution bitmap. Please provide high resolution plots preferable made from vector graphic.

      2. Figure 2B, please restrict R2 numbers to 4 decimals.

      3. Page 15, please explain what the contrast matrix is.

      4. Page 15, I would replace "time consumption" to "required execution time"

      5. The author mention in several place (e.g. page 19 and legend of table 2) that they have "developed tools" for DIA analysis. This is not true as they did not develop the original tools but integrated these tools in Galaxy environment in this study. Please correct this.

      6. In figure 3 and supplementary figures 1-4 "Blot" is written, which I guess should be "Plot".

      7. Page 21, Unix is mentioned as operating system, which I guess is not correct, but rather Linux is used. Please provide the distribution and version number.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Paul Stewart

      Fahrner et al have produced a very nice manuscript and corresponding pipeline. They describe a collection of DIA tools in the Galaxy framework for reproducible and version-controlled data processing. These DIA tools are an excellent addition to the growing number of proteomics-centric tools already available in Galaxy. The reviewer could find no major revisions needed and therefore only requests a few minor revisions before this is ready for publication:

      Please include page numbers in the revised manuscript to make referencing the text easier.

      Page 6

      OpenSwath and PyProphet are cited and are also used in the manuscript. Please cite one or two alternatives.

      Please consider citing a tool the each time it is used in a new paragraph (e.g. MSstats).

      There is heavy reliance on conjunctive adverbs (However, ...; Thus, ...) on this page and throughout the manuscript. These can make passages a bit hard to read. Please consider rephrasing.

      Page 7

      Why "so-called histories"? Aren't they simply "Histories"?

      Page 14

      'To decrease the analysis time of the semi-supervised learning, the merged OSW results can be first subsampled using the PyProphet subsample tool and subsequently scored using the PyProphet score tool. '

      The reviewer is not familiar with this approach. Can you please give additional justification (maybe under methods?) or provide a citation that this is a reasonable approach?

      Page 15

      Please check your reference software and/or work with the journal to ensure that the web addresses are linked properly. For example, the reviewer tried copying the link "https://training.galaxyproject.org/training- %20material/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.html" but a "%20" (or a space) is inserted into the URL after "training-" so the link as it appears did not work until this was removed. A less technically savy reader may think the links are broken and will not be able to access the materials.

      Page 16

      'We identified and quantified between 25.000 to 27.000 peptides ...'

      Please be consistent with number formatting (25000 vs 25.000). Other values in the tables did not use this formatting. Please check with journal editor for convention.

      Figures

      Please be consistent with axes labels. Some are upper case and some are lower case.

      Figure 2B

      Please round R2 to 2 or 3 decimals.

      Figure 3

      Please change the red-green color scheme to a more color-blind friendly color scheme (e.g. red blue)

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab093), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Elisabetta Manduchi

      This manuscript presents a workflow for gene-gene epistasis detection which leverages functional annotation resources such as Biofilter (to reduce the search space) and FUMA (to map SNPs to genes) and investigates the results obtained via different SNP-gene mapping criteria (positional, eQTL, Chromatin contacts, and some combinations of these). Moreover, these results are compared with those obtained via a 'standard' analysis where no filtering is applied to pair selection and positional SNPgene mapping is used. Due to the challenges presented by GWAIS, leveraging functional genomics to focus the search is a valid strategy which has been employed in other recent works in the field. This is a nice work and the paper is generally well written, with sufficiently detailed methodological information. Below are some comments and questions.

      1. As indicated in recent GWAS works aimed at 'solving GWAS loci' (i.e. determining the genes affected by significant SNPs), it is not always the case that the gene affected by a SNP is that positionally closest to the SNP. Indeed, a SNP may not only affect a gene when it resides in its coding or promoter regions, but it may also affect a gene when it resides in a far away enhancer. This is why epigenetic information such as chromatin loops (referred by the authors as 'Chromatin') can be useful for SNP-gene mapping. In the presence of chromatin contacts or eQTL information, typically one would use the derived mapping to augment the positional mapping, which is always available. That is, if one had chromatin contacts data, they would use positional + Chromatin to map SNPs to genes. If one had eQTL data, they would use positional + eQTL to map to genes. If one had both, they would use positional + eQTL + Chromatin. From a biological interpretability perspective, there is no reason to exclude the positional information. For example, a SNP in the promoter of gene could interact with a SNP in a distal enhancer of another gene, affecting a specific trait. In view of this, the statement (lines 326-328) "Since the main objective of this protocol is to increase the biological interpretability of epistasis findings, we have excluded other combinations that mix functional and non-functional information (Positional + eQTL and Positional + Chromatin)" is not quite valid, as positional information is also functional. On the other hand, using eQTL only, Chromatin only, or eQTL + Chromatin, albeit interesting in terms of looking at how this type of reduction in the search space affects results, do not quite reflect a biologically guided approach.

      2. I wonder on whether the authors have considered filtering also by markers of relevant chromatin states. Information about open chromatin and other epigenetic marks could help further filtering SNPs, both in enhancers and promoters. This would be particularly useful for SNPs mapped via chromatin contacts, which are likely to contain many irrelevant signals.

      3. The eQTL and chromatin contact data used in this work were from all available tissues. Typically, GWAS related functional filtering is done using data from tissues relevant to the trait under investigation, when available. For IBD, it may help to restrict to intestinal tissues, immune cells (like T- cells, macrophages, dendritic cells), and possibly also nervous system cells (which, at least according to some, could also be among the potential 'culprit' IBD tissues).

      4. To adjust for population structure the authors regressed out the first 7 PCs from the phenotype. Given that the PCs are confounders, it would be good to discuss the impact of doing this as opposed to also regress the confounders out of the SNPs, i.e. testing the response residuals vs the SNP residuals. In the same spirit, it would be good to discuss the impact of the PC-SNP association on the p-value and type-I error results obtained by permuting the response residuals.

      5. Section 2.1 is somewhat too concise and may result unclear to the reader. Later in the Discussion (lines 229-239) the authors explain how their procedure corrects for multiple testing at the SNP model without additional corrections for multiple testing at the gene model (this is also implicitly described in Fig 7CD), but yet their procedure keeps type I error under control. However, it may be beneficial, for ease of reading, to expand section 2.1 (via text and/or figure) so to clarify better, at the onset, where and when multiple testing corrections are applied.

      6. In the absence of a replication data set, the authors assess the robustness of the gene pair results via 10 repetitions of the workflow using 80% of the discovery data set. It would be useful to include some discussion of how their results could be further assessed in other GWAS data sets (e.g. from UK biobank, etc.), in view of the fact that it is typically hard to reproduce epistasis findings, at least at the SNP level. Certainly one could first check whether the discovered SNP-SNP interactions are reproduced and limiting the analyses to those pairs would require a less severe multiple testing correction. But another approach may be to start with the discovered gene pairs and then analyze all pairs of SNPs mapping to these genes (not necessarily those discovered in this study), etc. Do the authors plan future follow-up studies on this?

      7. In section 2.7 the results of pathway analyses on 3 (eQTL, Positional, and Standard) of the 5 networks presented in Figure 3 are provided. What about the other 2?

      8. For these two points I defer to the editor:

      (i) The format of the manuscript is close to but does not exactly match the specifications at https://academic.oup.com/gigascience//pages/research. I do not know how strict these specifications are and I have no objections to the current format.

      (ii) Data availability is not discussed (as per Data and materials in https://academic.oup.com/gigascience/pages/instructions_to_authors). I imagine that the IIBDGC only makes publicly available the summary statistics. This is, however, common in the GWAS field.

      1. Some minor notes follow:

      (i) In the Author Summary the 'ATPM' acronym is used for the first time without explanation.

      (ii) In section 4.2 it would be helpful to re-iterate that the SNP-gene mapping for the Standard analysis was genomic proximity (this is only mentioned briefly at line 206).

      (iii) Typo at line 168 "the same than" should be "the same as".

      (iv) It should be specified which of the MigSigDB collections was used. Later in this section gene sets are referred to as 'pathways' but there is more than one pathway collection in MigSigDB.

      (v) In the formula at line 397 doesn't "tested gene sets" refer to "tested gene neighborhoods"? If so, it would be better to use the latter for clarity.

      (vi) There appear to be some typos in the caption for Supplementary Figure 1: "we computed three linear models using the different residuals as response variable and SNP interactions as dependent variables". I guess should be "SNP interactions as independent variables". Also, weren't the two individual SNPs also included as independent variables in these models?

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab093), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Shing Wan Choi

      Here, the authors presented a pipeline for the analysis of epistasis effects in GWAS data (GWAIS). This is traditionally a difficult problem due to the large search space involved, which makes the analysis computationally intensive and suffers from multiple testing, a new method that restrict search space can definitely help making GWAIS more feasible and reproducible. I have some questions after reading the current paper, please excuse me if the information is already presented within the paper:

      1. I am not sure what the Standard model comprised of. According to the methodology section, the Standard analyzed all SNP pairs without prior filtering, does that mean all 14,501,130,150 SNP pairs (C(170301, 2)) were tested? Or was it not all SNP pairs were considered?

      2. When presenting the number of SNPs linked to each gene based on different criteria (e.g. position, eQTL or Chromatin contact), wouldn't the gene size be a major predictor of the number of SNPs link? It would most likely be the case for positional mapping, right?

      3. I am curious to see if restricting the eQTL and Chromatin information to disease specific tissue will help improving the performance of the current model.

      4. Very little information was provided for the PRS analysis. What genome wide association summary statistics were used? Did the authors perform high-resolution scoring? With shrinkage / thresholding done in normal PRS analysis, some SNPs' effect might be excluded or "shrinked" away from the PRS model, would that affect the interpretation of the PRS covariate analysis? E.g. maybe SNPs not included in the PRS model were those unaffected? (With PRSice, can use --print-snp to obtain list of SNPs that are included in the model)

      5. It seems like Biofilter provide a SNP-SNP interaction prediction model, how does that compare to what was presented here?

      6. Figure 3, the results from eQTL + Chromatin and Position + eQTL + Chromatin is identical. Together, it seems like the positional mapping does not contribute to the result at all, which is a bit surprising. Are there any explanation of that? Would it be due to mapping of the Immunochip array, or a characteristic of IBD?

      7. Given the sample size of the current data, a HWE threshold of 0.001 seems rather stringent. Will the result improve if a less stringent threshold is used (e.g. 1e-6?)

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Hirak Sarkar

      Producing single-cell count matrix from the raw barcoded read sequences consists of several contributing steps such as whitelisting, correcting cell barcodes, resolving multi-mapped reads, etc. Each step can potentially introduce variability in the resulting count matrix depending on the specific algorithm adapted by the tool used. Bruning et al. attempted to disentangle these effects using the most popular scRNA-seq quantification tools such as Cell Ranger 5, STARsolo, Kallisto, and Alevin. The manuscript is well-written and would add considerable value to the broad single-cell research community. I have a few concerns about the current draft of the manuscript that can be addressed in a revision.

      • The scina tool is used to construct an "artificial ground truth". The consensus of two or more mappers are used to arrive at this reference annotation. In my opinion, the consensus can lead to a biased reference, especially since STARSolo and Cell Ranger5 follow a very similar pipeline; it is expected, by design, that those tools would have highly-overlapping results.

      I suggest that the simulated datasets from the pre-decided clusters might be more appropriate for an unbiased evaluation (The recent paper from Kaminow et al. https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full has similar simulations). Having said that, the current consensus-based analysis in my opinion should give a reasonable reference for most of the cells, but a more principled simulation is required to identify the extreme cases where each of the tools might show variable assignments.

      -The Sankey plots (Supp Figure 5) and the heatmaps (Supp Figure 6) represent the mutual agreement from different tools. As the scina clusters are used as ground truth, a more direct qualitative measure such as precision/recall would be more helpful.

      To be more specific, the resolution parameter of FindCluster could be tuned (now set to 0.12/0.15) to produce the same number of clusters present in the ground truth. Each predicted cluster can then be assigned to a ground truth cluster greedily. The number of mismapped cells can be further categorized as false-positive or false-negative.

      • The variability of different tools on the three real datasets is worth exploring in depth. For example, quoting from the paper, "Alevin detected more cells with less genes per cell in the PBMC and Endothelial dataset. However, it detected less cells with more genes per cell in the Cardiac dataset." It would be interesting to understand the origin of these variations and what authors hypothesize, e.g. apart from mapping/alignment there are other additional steps in the quantification pipeline that could potentially lead to variation in the detected cells and respective gene count. The tools can also have underlying algorithmic biases that are worth exploring.

      • "We could show that Alevin often detects unique barcodes, which were not identified by the other tools. These barcodes had very low UMI content and were not listed in the 10X whitelist.", the alevin -- whitelist option (https://salmon.readthedocs.io/en/develop/alevin.html#whitelist) enables use of any external filtered whitelist while running alevin. I wonder if using this option would change the behavior mentioned in the manuscript.

      • The manuscript raises the important question of multi-mapped reads across cell-types, it would be interesting to quantify the percentage of reads that are discarded as multi-mapped by different tools (those which discard). If that percentage is substantial, then the difference in handling such ambiguous reads through EM-like algorithm might be promising.

      Plots and Figures

      -Intersection Plots

      The minor differences in the $y$ axis of the intersection plots (Fig. 4, supp fig. 3 etc.) are not pronounced. (log-scale might help)

      Overview Figure The manuscript correctly pointed out how different intermediate steps contribute to the general variance in the downstream results. An overview figure with a flow chart of a typical scRNA-seq quantification pipeline will be beneficial.

      Minor Concerns

      There is a spelling mistake in the abstract celtype -> cell-type

      Possible incomplete sentence : "The recommended annotation from 10X, which only contains genes with the biotypes protein coding and long non-coding, might lead to an overestimation of mitochondrial gene expression respectively the absence of other gene types."

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Serghei Mangul

      1 -- Abstract contains. Confusing terminology, for example became available can be replaced by developed.

      2 -- Also analyzed several data sets, can be replaced by benchmarking to clear indicate that that refers to benchmarking rather than analysis. Some terminology needs to be explained. For example, white listing should be defined

      3 -- KALISTO is not alignment tool in a proper sense, as it doesn't report position of the read insteadonly the transcriptof origin. Instead, this is pseudo alignment. Alignment needs to be defined, or word pseudoalignment used

      4 -- How the ground truth or gold standard was defined ? Is the assumption of the paper that the tool with the highest number of mapped reads perform the best? This needs to be explained in the introduction.

      5 -- In general. I read alignment is artificial rather than biological problem, so that molecular gold standard cannot be defined. See for example https://www.nature.com/articles/s41467-019-09406-4. It would be helpful to explain this upfront when talking about gold standard and cite this.

      6 -- It is unclear how the tools were selected. What was the reasoning to select only 4 tools and how do offer know that those tools are common? For the complete list of RNA-based alignment tools author can refer to https://arxiv.org/abs/2003.00110 A reasonable criteria to select would be to take the tools, which are available, for example, in bioconda, which will make installing those tools easy. However, randomly selecting tools is not acceptable. For example, why the SALMON was not included. However, KALISTO was included.

      7 -- Language of the paper needs to be improved, for example, in the background section the word great was used, which can be replaced by a more appropriate scientific wording.

      8-- More explanation needs to be provided for cell ranger. Is it essentially the wrapper around the star? Does it have any novel Algorithms or software development involved?

      9-- Needs me to explain why they chose only 10x genomics among the available single cell platforms.

      10-- And the annotations indeed may influence, the alignment when they are provided for alignment tools. is every alignment tool able to take custom annotations?The paper is lacking the Figure providing results on which annotation performs the best for a given data sent.

      11-- Datasets and reference genomes section Gold standard data sets are not reported. It was not clear if the paper is having such data set or such data set is missing in case such data set, is missing. How the authors are able to say which read alignment tool performs the best ?

      12-- The paper contains a single human sample. Any particular reason for that? The paper would benefit from having multiple human samples as a as it was done for the mouse. Did the authors performed a systematic search to identify as many single cell sample as possible. If not, that will be desirable.

      13 -- Was that 10x data human data only available on 10x website, and not available on SRA or Geo

      14 -- Paper provides a GitHub link with data sets and the code used for this analysis. Does the GitHub has also the BAM files? If not, those needs to be uploaded. Additionally is the code and summary data behind the figures provided?

      15 -- Results section, the beginning of results section would benefit with the short description of the datasets, for example. How many samples were in total? What was the read length for each sample? what was the number of reads for each sample? Was a different. So providing the mean and the variance can be helpful.

      16 -- In general, figures needs to be improved in terms of visualization. It's very hard to understand what are the figures are trying to convey. For example, figure 2 is absolutely impossible to understand. And also, what is the purpose of that figure is also unclear? The same for the figure 3 It's very busy, figure. However, what it is trying to convey? It's hard to know.

      17 -- Figure 4 is also very hard to understand. So maybe making the log scale can improve. What is the X axis, for example, that's unclear those details. And in general figures needs to be improved.

      18 -- in general figures needs to be visually understandable and and more effective.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Bo Li

      Single-cell RNA-seq has revolutionized our abilities of investigating cell heterogeneity in complex tissue. Generating a high-quality gene count matrix is a critical first step for single-cell RNA-seq data analysis. Thus, a detailed comparison and benchmarking of available gene-count matrix generation tools, such as the work described in this manuscript, is a pressing need and has the potential to benefit the general community.

      Although this work has a great potential, the benchmarking efforts described in the manuscript are not comprehensive enough to justify its publication at GigaScience unless the authors address my following major and minor concerns.

      Major concerns:

      1) The authors should discuss related benchmarking efforts and the differences between previous work and this manuscript in the Background section instead of the Discussion section. For example, Du et al. 2020 G3: Genes, Genomics, Genetics. and Booeshaghi & Pacther bioRxiv 2021 should be mentioned and discussed in the Background section. In addition, STARsolo manuscript (https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1), which contains a comprehensive comparison of CellRanger, STARsolo, Alevin and Kallisto-Bustools should be cited and discussed. Zakeri et al. 2021 bioRxiv (https://www.biorxiv.org/content/10.1101/2021.02.10.430656v1) should also be included and discussed in the Background section.

      2) Benchmark with latest versions of the software. The choice of Cell Ranger, STARsolo, Alevin and Kallisto-BUStools is good because they are four major gene count matrix generation tools. However, I urge the authors also include CellRanger v6 and Alevin-fry (Alevin_sketch/Alevin_partialdecoy/Alevin_full-decoy, see STARsolo manuscript), which are currently lacking, into their benchmarking efforts. The authors may also consider add STARsolo_sparseSA into the benchmark. Since single-cell RNA-seq tool development is a fast-evolving field, benchmarking of the up-to-date versions of tools is super critical for a benchmarking paper.

      3) Conclusions. The authors summarized the observed differences between tools based on the benchmarking results. This is good but very helpful for end-users. I recommend the authors to emphasize their recommendations for end-users more clearly in the discussion/results section. For example, do the authors recommend one tool over the others under certain circumstances? If so, which tool and which circumstance and why? I like Figure 5 a lot and hope the authors can summarize this figure better in the manuscript.

      4) This manuscript concluded that differential expression (DEG) results showed no major differences among the alignment tools (Figure 4). However, the STARsolo manuscript suggested DEG results are strongly influenced by quantification tools (Sec. 2.6, Figure 5). Please explain this discrepancy.

      5) This manuscript suggested simulated data is not as helpful as real data. However, the STARsolo manuscript reported drastic differences between tools using simulated data. Please comment on this discrepancy.

      6) I have big concerns regarding the filtered vs. unfiltered annotation comparison. In particular for pseudogenes, we know that many of them are merely transcribed or lowly transcribed. As a result, many of these pseudogenes would not be captured by the single-cell RNA-seq protocol. At the same time, because these pseudogenes share sequence similarities with functional genes, they would bring trouble for read mapping. This is one of the main reasons for using a carefully filtered annotation. Actually, whether and how to filter annotation is in active debate in big cell atlas consortia such as Human Cell Atlas. Thus, I would be super careful about describing results comparing filtered vs. unfiltered annotation. For example, in Suppl. Figure 8D, there are 6 mitochondrial genes that have 100% sequence similarity to their corresponding pseudogenes. It is impossible to distinguish if a read comes from a gene or a pseudogene for these 6 genes and it is also not necessary --- the transcribed RNA should also be exactly the same. Thus, I encourage the authors remove their pseudogenes from the annotation and I suspect the mouse data results should look similar to the human data in the Suppl. Figure 8A.

      7) The endothelial dataset was only run on CellRanger 3 because the UMI sequence is one base shorter. Could the authors augment the UMI sequence with one constant base and run this dataset through CellRanger 4/5/6?

      8) I think it is more appropriate to call the tools benchmarked as "gene count matrix generation tools" instead of "alignment tools".

      Minor concerns:

      1) The Suppl Table 2 mentioned in the main text corresponds to Suppl. Table 3 in the attachment. In addition, there is no reference to Suppl Table 2.

      2) Suppl Table 3 PBMC, why do I see endothelial cell markers in PBMC dataset?

      3) Suppl Figure 7 is never referenced in the main text.

      4) Suppl Figure 8D is never referenced in the main text.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab101), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Filippos Bantis

      The authors used imaging tools with three types of phenotypic descriptors (dimensions, shape, colour indices) and side- or top-camera views in order to determine non-destructive parameters of seven diverse species (Arabidopsis thaliana, Brachypodium distachyon, Euphorbia peplus, Ocimum basilicum, Oryza sativa, Solanum lycopersicum, and Setaria viridis) growing under different Red/Blue gradients (from 100% Blue to 100% Red). The results are important since they are non-destructive and provide a good basis for the selection of light treatments for specific plants in controlled environment agriculture. The introduction is informative and sufficiently describes the scope of the research. I like the way the authors describe/display the results. Relatively few words (compared to the volume of the obtained measurements) but beautifully built figures which provide all the necessary information. However, I would expect more discussion at the end of each set of parameters results' description, as well as possible comparisons with the literature, even if it is rather scarce. For example, in PDF page 11, subsection "Patterns of change over time", the results are barely discussed. Moreover, the review process would be facilitated if the manuscript had line numbering.

      Specific comments are following:

      • In the title, LED should be written with capital letters, not Led

      • Keywords must not be included in the title. Please remove or substitute LED and light quality Introduction * PDF page 4, L3. Controlled environment agriculture must be abbreviated the first time it is written in the text. The same applies with other terms such as RGB.

      • PDF page 5, L23. "Large-scale crops" is more appropriate term.

      • I agree with the active voice in the objectives' part of the introduction. However, you should refrain from beginning most sentences with "we". Results

      • PDF page 8. I suggest that "Data description" subsection is moved in the "Methods" section

      • This section should be renamed to "Results and Discussion" since there is also discussion within the results. Methods

      • PDF page 14. How many cabinets were used? How many treatments and plants were placed in each cabinet? Apart from figure 1 depiction, you should also describe the experimental design in order for the reader (and me as well) to fully understand it.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab101), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Yujin Park

      The manuscript presents the result of an experiment investigating the impact of red:blue ratio of light gradient on plant phenotypic traits in seven plant species. The subject of the manuscript is very innovative and interesting, but there are parts of the materials and methods that are less clear. Specific comments:

      • In this study, plant phenotypic traits were evaluated using an imaging platform. Plant biomass (fresh and dry weights of shoot and root) is one of the most important plant growth parameters. Are there any suggestions that plant biomass can be predicted from the plant phenotypic traits quantified by the imaging platform?

      • Growth conditions:

      • Does the irradiance of 130-150 µE∙m-2∙s-1 indicate the PPFD (400-700 nm)? How was it measured?

      • Please be consistent for the unit for photon flux density throughout the manuscript. µEinsteins were interchangeably used along with µmol∙m-2∙s-1 in the past, but the Einstein is not a unit in the SI of units. Thus, please use µmol∙m-2∙s-1 when you quantify the photon flux density. Also, please revise the µmoles∙m-2∙s-1 in Fig. 1 into µmol∙m-2∙s-1.

      • Could you provide the spectral distribution data for white light, red LED, and blue LED used in this study?

      • For the concentration of the slow release fertilizer, do you mean gram per liter? If so, please correct it to 6 g∙L-1.

      • What was the growing conditions (air temperature, relative humidity, photoperiod, etc) during the treatment of red:blue gradient?

      • Did you keep the control plants under white light continuously? Then, did you make sure that the control plants and treatment plants are grown under the same growing condition except for the light quality treatment?

      • It is not clear whether the experiment was replicated. The experimental unit is the physical entity which can be assigned, at random, to a treatment. Here the experiment unit was the experimental plot under each light gradient treatment. A single plant should be treated as an observational unit. So, without replications, the data is less reliable.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: idoia ochoa

      The authors present a novel tool for the compression of collections of bacterial genomes. The authors present sound results that demonstrate the performance gain of their tool, MBGC, with respect to the state-of-the-art. As such, I do not have concerns about the method itself. My main concerns are with respect the description of the tool, and how the results are presented. Next I list some of my suggestions (in no particular order):

      Main Paper: - Analysis section: Before naming MBGC specify that it is the proposed tool. - Analysis section: Reference for HRCM. Mention here also that other tools such as iDoComp, GDC2, etc. are discussed in the Supplementary (this way the reader knows more tools were analyzed or at least tried on the data).

      • Analysis section: The paragraph "Our experiments with MBGC show that... " is a little misleading, since it seems that the tool has the capacity to compress a collection and just extract a single genome from it. This becomes clear later in the text when it is discussed how the tool could be used to speed up the download of a collection of genomes from a repository. So maybe explain that in more detail here, or mention that it could be used to compress a bunch of genomes prior to download. And then point to the part of the text where this is discussed in more detail.

      • Analysis section: The results talk about the "stronger MGBC mode", the "MGBC max", but in the tables it reads "MBGC default" or "MBGC -c 3". I assume "MBGC -c 3" refers to "MBGC max", but it is not stated anywhere. maybe better to call it "MBGC default" and "MBGC max".

      • Analysis section: Although the method is explained later in the text, it would be a good idea to give a sense of the difference between the default and max modes of the tool. Or some hints on the trade-off between the two. Also, the parameter "-c 3" is never explained.

      • Analysis section: Figures, it is difficult to see the trade-off between relative size and relative time, can you use colored lines? such that the same color refers to the same set of genomes. Also, in the caption, explain if we want small or high relative size and time. it may be clear, but better to clearly state it.

      • Analysis section: there is a sentence that says "all figures w.r.t. the default mode of MBCG". It would be good also to state that in the caption, so that the reader knows which mode of the tool is being used to generate the presented results. and if the input files are gzipped or not. For example, for the following paragraph that starts with Fig. 1, it is not clear if the files are gzipped or not.

      • Analysis section: First time GDC2 is mentioned, the first thing that comes to mind is why it was not used for the bacterial experiments. See my previous point on having a couple of sentences about the other tools that were considered, and why they are not included in the main tables/figures.

      • Methods:

      -- Here I am really missing a diagram explaining the main steps of the tool. It seems the paper has been rewritten slightly to fit the format of the journal and some things are not in the correct order. For example, it says the key ideas are already sketched, but i do not think that is true.

      -- (offset, length) i assume refers to the position of the REF where the match begins, and the length of the match, but again, not really explained. A diagram would help. Also, when it is time to compress the pairs, are the offset delta encoded? or encoded as they are with a general compressor?

      -- How are the produced tokens (offset, length, literals, etc.) finally encoded?

      -- First time parameter "k" is mention, default value? Also, how can you do a left extension and "swallow" the previous match? is it because the previous match could have been at another position? otherwise if it was in that position it would have been already extended to the right, correct? i mean, it would have generated a longer match.

      -- The "skip margin" idea is not well explained. not sure why the next position after a match is decreased by m. please explain better or use a diagram with an example.

      -- when you mention 1/192, maybe already state that this is controlled by the parameter u. otherwise when you mention the different parameters is difficult to relate them to the explanation of the algorithm.

      Availability of supp...

      -- from from (typo) Tables

      -- Specify the number of genomes in each collection.

      -- change MBGC -c 3 to MBGC max or something similar. (see my previous comment -c flag is not explained!)

      Supplementary Material

      -- move table 1 after the text for ease of reading

      -- not clcear if the tool has random access or not. it is discussed the percentage of time (w.r.t. decompreessing the whole collection i believe) that it would take to decompress one of the first gneomes vs one of the last ones. this should be better explained. for example, if we decompress the last genome of the collection we will employ 100% of the time, right? given that previous genomes are part of REF (potentially). please explain better and discuss this point in the analysis part, not only in the supplementary. seems like an important aspect of the algorithm.

      -- I assume this is not possible, but should be discussed as well. can you add a genome to an already compressed collection? this together with the random access capabilities will highlight better the main possible uses of the tool.

      -- section 4.3: here HT is used, and then HT is introduced in the next paragraph. please revise the whole text and make sure everything is in the right order.

      -- parameter m, please explain better.

      -- add colors to figures, it will be easier to read them. Overall, as I mentioned before, I believe the tool offers significant improvements with respect to the competitors for bacterial genomes, and performs well on non bacterial genomes as well. What should be improved for publication is the description of the method, since at the end of the day is the main contribution, and how the text is presented.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Jinyan Li

      This paper proposed a compression algorithm to compress sets of bacterial genome sequences. The motivation is based on the reason that the existing algorithms and tools are targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is unknown. The key idea of the proposed method is to detect characteristic features from both the direct and reverse-complemented copies of the reference genome via LZ-matching. The compression ratio is high and the compression speed is fast. Specifically, on a collection of 168,311 bacterial genomes (587 GB in file size), the algorithm achieved a compression ratio around the factor of 1260. The author claimed that the performance is much better than the existing algorithms. Overall, the quality of the paper is quite good.

      I have two suggestions for the author to improve the manuscript:

      1/ it's not clear to me about this sentence that "we focus on the compression of bacterial genomes, for which existing genome collection compressors are not appropriate from algorithmic or technical reasons." More clarifications are needed.

      2/ With my own experience, GDC2 has a better performance on virus genome collections than HRCM. It's strongly suggested for the author to add the performance of GDC2 on the bacterial genome collections.

    3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Diogo Pratas

      This article presents a new compressor that uses both direct and reverse-complemented LZ-matches with multi-threaded and cache optimizations.

      Generally, the reported results of this tool are exciting, and once confirmed, they have good applicability in the bioinformatics community.

      However, I could not reproduce the results by lack of instructions, the benchmark is not representative of the state-of-the-art, and there are also several associated questions. Below the comments are specified.

      Regarding the experiments:

      1. The experiments could not be reproduced. Unfortunately, the instructions and documentation are not clear (See below my tentatives).

      2. The benchmarking is missing several well-known tools (for example, naf, geco3, Deliminate, MFCompress, Leon, ...). See, for example:

      Kryukov, Kirill, et al. "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences." Bioinformatics 35.19 (2019): 3826-3828.

      Silva, Milton, et al. "Efficient DNA sequence compression with neural networks." GigaScience 9.11 (2020): giaa119.

      Yao, Haichang, et al. "Parallel compression for large collections of genomes." Concurrency and Computation: Practice and Experience (2021): e6339.

      To access more compressors, please see the following benchmark (that is already cited in the article): https://academic.oup.com/gigascience/article/9/7/giaa072/5867695

      Regarding the manuscript:

      1. The State-of-the-art in genomic data compression (or at least in collections of genomes) is brief and does not offer a consistent and diverse description of the already developed tools.

      2. "By the compression ratio we mean the ratio between the original input size and the compressed size. If, for example, the ratio improves from 1000 to 1500, e.g., due to changing some parameters of the compressor, we can say that the compression ratio improves 1.5 times (or by 50%)." This sentence seems a little confusing (at least for me). Please, rephrase.

      3. "The performance of the specialized genome compressor, HRCM [7], is only mediocre, and we refrained from running it on the whole collection, as the compression would take about a week.". The purpose of a data compressor can be very different: to use in machines with lower RAM, for compression-based analysis, for long-term storage, research purposes, among others. The qualification of HRCM without putting it into context seems to be depreciative.

      Regarding the tool and documentation:

      1. Although I have downloaded and compiled the tool, I had to dedicate some minutes to a "libdeflate" default version issue. The majority of the bioinformatics community uses conda. In order to minimize installation issues for the users, please, provide a conda installation for the proposed tool. Also, the libdeflate can already be retrieved with conda. Then, with the instructions for the installation of mbgc, please, add this line to mbgc repository:

      conda install -c bioconda libdeflate

      Notice that this "conda" part is a suggestion that will facilitate the usage of mbgc by the bioinformatics community.

      1. Running ./mbgc gives the output:

      ./mbgc: For compression expected 2 arguments after options (found 0)

      try './mbgc -?' for more information

      If the menu appears as default (no arguments besides the program's name), it will be much more helpful.

      1. The program should have a version flag to depict the version of the program (besides the version at the menu). This feature is essential for integration/implementations (e.g., conda) and to differentiate from eventual new versions to the mbgc software.

      2. Please, provide a running example at the help menu (with tiny existing sequences at the repository).

      3. Is this characteristic of mbgc a strict property: "decompresses DNA streams 80 bases per line"? This characteristic may create differences between original files and uncompressed files. Perhaps, having the possibility to have a custom line size would be a valuable feature, at least for data compression scientists to access and compare with other compressors, mainly because it makes the decompressor not completely lossless (although in practice, there is minimal information required to maintain the whole lossless property). Nevertheless, if the program decompresses FASTA data with a unique line size (for DNA bases) of 80 bases, this should also be mentioned in the article (besides what already exists in the repository).

      4. The first impression was that "sequencesListFile" are the IDs of the bacterial genomes, then I found out that they are the URL-suffixes for the FASTA repository. Then, I start to wonder if mbgc could accept directly the FASTA containing the collection of genomes. How can the user provide the FASTA file directly? This feature would simplify a lot the usage of mbgc. Rationale: most of the reconstruction pipelines output multi-FASTA sequences in a single file. Therefore, this feature has direct applicability. Please, add more information about this in the help print and at the README. A higher goal would be to have stdin and stdout in compression/decompression as an option and the style of the argument as POSIX (Program Argument Syntax Conventions). This features are important for building bioinformatics pipelines and perform analysis (especially since the tools seems to be ultra-fast).

      5. Table 1,2,3,4 (and the additional table at supplementary material) have "Compress / decompress times (as "ctime" / "dtime") are given in seconds," but no unity reference is provided in the cap to the cmemory and dmemory. Is this value on a GigaByte unity?

      6. The README should provide a small example for testing purposes with the files already available at the repository or by efetch download (see below).

      7. The reproducibility is hard to follow:

      I had to search for the following procedure to test the software:

      wget https://github.com/kowallus/mbgc/releases/download/v1.1/tested_samples_lists.7z

      7z e tested_samples_lists.7z

      After the cere download, also tar -vzxf cere_assemblies.tgz

      Then, I realized that it was missing the sequences, and by the NCBI interface, I lost track. I gave up after a few segmentation faults/combinations without understanding if the program or the settings generated the issue.

      Please provide supplementary material and README with the complete instructions to reproduce the experiments (the exact commands).

      Also, this simple way to download a multi-FASTA file with Escherichia Coli sequences may be helpful: conda install -y -c conda-forge -c bioconda -c defaults entrez-direct

      esearch -db nucleotide -query "Escherichia coli" | efetch -format fasta > Escherichia.mfa

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab092), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Luiz Gadelha

      This manuscript proposes a presents a tool called ExTaxsI for management and plotting of molecular and taxonomic data from NCBI. Information can be persisted on a local database as well as FASTA-formatted sequences, which can be used to display the information as scatter or sunburst-pie plots, and maps. The tool uses the Entrez API from NCBI to retrieve data. It also uses the ETE toolkit to manage taxonomic data. Three use cases were presenting to demonstrate ExTaxsI: - geospatial distribution and gene data of Atlantic cod and the Gadiformes Order, - exploration of biodiversity data related to the SARS-COV-2 pandemic.

      Using ExTaxSi from the command-line apparently produces consistent and correct outputs. However, ExTaxSi functionality seems to be available only through this command-line interface. This considerably limits the applicability of the tool since many researchers usually incorporate these routines programmatically to their scripts. It would be more useful if ExTaxSi functions were provided additionally through a library that could be imported in Python scripts. This would enable more use cases and would lead to a wider applicability. Some issues in a previous submission of this manuscript were corrected. A more detailed comparison with related tools is included and the installation instructions for the tool now work correctly. The documentation was also significantly improved.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab092), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Iddo Friedberg

      The authors have markedly improved the software in terms of usability and documentation. The manuscript could still use some language editing.

    1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Shanlin Liu

      The authors presented us with an improved genome for Glanville fritillary butterfly. However, there are several issues that need to be addressed before its acceptance.

      Major:

      What the current manuscript lacks the most is the comparison between the improved genome assembly and its former version. Although the authors showed us an improved N50, I failed to find the explanations for several critical differences. For example, (1) the authors stated that ca. 90 MB additional assembly sequences were achieved, but no further information is available for those new sequences, are they redundancies or missed fragments in the version 1; (2) the improved genome predicted less genes compared to its former version, decreasing from ca. 16,000 genes to ~ 14,000 genes, which is contradictory to the aforementioned longer genome assembly; (3) the former genome version observed unevenly distributed repeat elements across chromosomes, while not in this improved one, which also needs explanations.

      Another important issue of the present manuscript is the confusion introduced by varied genome assembly sizes. Firstly, the authors did not provide this critical information that can be estimated using several well-known methods, such as C value based on flow cytometry, or estimations based on kmer frequency information. Secondly, the author firstly mentioned that they sampled individuals that have low heterozygosity, but later the FALCON generated an assembly almost twice the size of the final genome. The authors may want to add extra analysis or words to clarify the genome size uncertainty. Same to the above concern, Haplomerge seems an important step to obtain the final version assembly, and if I understand it correctly, the authors did not use a standardized analysis pipeline, please consider to include a schematic plot for your procedure to help readers better understand your steps and the principle behind them.

      In addition, lots of methods are vaguely described, the authors should provide details for them to make sure the analyses are repeatable, e. g. on Page 6, the authors wrote: "This cut-off was experimentally found to give the best contiguity for the assembly, while minimizing (within a small margin of error) the percentage of possibly erroneous contigs". But I failed to find any details of their experiments. And on the same page, the authors checked putative chimerics manually, saying the error regions are with low coverage or repeat regions, the authors should give demonstration examples and statistics for different kinds of errors. Meanwhile, when they say the error regions were split, the authors should give details about how they determined the split positions since what they found are error regions instead of error bases. Also, on page 7, the authors stated "The contigs orders and orientations were manually fixed when needed", please list the different situations that meet your criteria. The author may want to explain why they choose the 1,232 genes for manual annotation. Random?

      Minor: Remove

      "(e.g. Kahilainen et al. unpubl.)", it provides no useful information.

      Table 1. N(%) of the verion 2 genome is zero? The scaffolding step does not introduce any Ns? I doubt that.

      Page 5, please give the location information instead of a citation.

      Page 7, please clarify the assembly version for raw read mapping, is it the one generated by FALCON with a genome size ~ 700 MB?

      Page 9, "the first two step (bath A1 and bath A2)", please provide biological explanations.

      Marey map needs citation and a brief explanation of its debut.

      "In M. cinxia the repeats are placed in single chromosomes whereas in H. melpomene they are present in all chromosomes. " How does it help to show the power of long read assembly? Need explanation.

      Page 10, how does Velvet apply a kmer size of 99 bp when you only have a read length as long as 85 bp?

      Table 2 title: species name should be in format of italic.

      Please give a full name for BUSCO in its first appearance.

    2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Annabel Charlotte Whibley

      In this manuscript, Blande, Smolander and colleagues report an improved chromosome-level genome assembly of the important ecological model lepidopteran species Melitaea cinxta. The manuscript would benefit from further language review by a native English speaker to improve readability, but the intentions of the authors are nevertheless clearly articulated throughout, the workflow is logical, and the assembly quality is a clear improvement on the earlier draft release.

      I would suggest revisiting the title to better reflect the work- as it stands it is a little underwhelming. One suggestion would be "Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating PacBio long reads and a high-density linkage map". I would ideally also like to see more discussion of the more unusual aspects of this project- for example, long-read assemblies are commonplace now, but the linkage map approach (and the extent to which there was manual curation of potential chimeric scaffolds) is less frequently employed these days and often superscaffolding and error correction is undertaken with Hi-C methods only. Similarly the extensive manual curation of gene annotations and the impact that this had on the models is likely of more general interest (e.g. how many gene models were corrected, what type of errors were encountered?). Particularly also some mention of some of the specific challenges of this project (e.g. the need to combine multiple individuals to obtain sufficient quantities of gDNA) might be interesting for the readership.

      The absence of line numbers is a little cumbersome for reviewing purposes, I'll below refer to specific parts of the text by page number (as printed on pdf document) - paragraph -line(within paragraph). 3-1-6: suggest changing "…. and included both laboratory and natural environmental conditions" to "…and have included…"

      3-2-1: change "The first M. cinxia genome was released in 2014" to "The first M. cinxia draft genome" or "The first M. cinxia genome assembly"

      Table1: reporting both GC and AT % is unnecessary. There are some discrepancies between the statistics reported for the chromosomal assembly in the Ahola et al (2014) paper vs this table. This may simply be due to different methods for assessing summary statistics (e.g. whether or not gaps are included by default), but warrants investigation/clarification. For example, the largest scaffold reported in the Ahola et al (2014) paper is 14,178,551bp. The description of the generation of a chromosomal build for the previous version indicates >280Mb were assigned to chromosomes, whereas the total assembly size in this table is reported to be only 251Mb.

      6-2-2: What are the units for the cut-off (read length?)? If available, the data exploring the impact of different cut-offs on the assembly error rate could be of interest to others assembling genomes de novo. 6-2-6: As a specific example of a more general comment on number reporting, perhaps state 24.4 Gb instead of 24,409,505,551 bp? I am not sure that the precision is always necessary and scaling/rounding can help readability.

      6-2-10: Are the alternative contigs extracted by default by the FALCON pipeline? Are there any adjustments that need to be made for an input of >1 individual, for example?

      7-2-2: The raw data for the linkage map crosses, and also the RNAseq data for the transcriptome studies (on ) is described as "unpublished", but I believe public sequence accessions are also being released with this manuscript. Is there additional information that would need to be disclosed for this information to be utilised by others or is the intention to highlight that the data will also be presented in upcoming publications?

      7-2-6 "Part of" should be "Some of"

      7-3-3: Specify "relative humidity" instead of RH. Discuss why different approaches used for different RNAseq experiments.

      8-1-5: Sequencing was "performed" rather than "made". Can you specify which HiSeq model and which sequencing library kit (or at the very least whether it was PCR-free)?

      9-2-6: Presumably "de novo transcripts" refers to both transcriptomes 1 and 2, in which case I think it would be helpful to state this here. I assume the different analysis approaches for datasets 1 and 2 reflect different histories of the two datasets but it would be interesting to see some assessment of the relative performances of these approaches.

      13-2-4: I think that http://butterflygenome.org would be sufficient for the URL here.

      14-1-4: Are there any flow cytometry (or other) estimates of genome size that can be used to set alongside the v1 and v2 assembly sizes?

    1. current

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Stephen Nayfach

      In their manuscript, Ortuno et al. develop a procedure for imputing missing genotypes of SARS-CoV-2. Missing genotypes can arise from fragmented whole genome assemblies, targeted sequencing (e.g. spike protein), or incomplete genotype panels. I really like this idea and thought the paper was conducted quite carefully. I was impressed by the high level of precision across all experiments. I have a few minor comments, questions, and suggestions below:

      Major comments: My understanding is that only SNPs are imputed by the program. Is this correct? If this is the case, can the authors comment on the frequency of other types of variants in the SARS-CoV-2 genome? How common are small indels, large indels, or rearrangements? Can the authors include code for building their reference panel? This would enable the same pipeline to be applied to updated SARS-CoV-2 references or to other kinds of viruses entirely. For example, metagenomic DNA sequencing often yields partial viral genomes, and it would be great to use this same pipeline to impute these genomes (where sufficient references exist). I noticed that several of the PANGOLIN lineages seem especially hard to impute. Can the authors comment on why this might be the case? Regarding the PANGOLIN lineages, how to these correspond to specific variants of interest (e.g. delta variant)? Is this information provided to users? A visual could really help here showing the phylogenetic relationships between PANGOLIN lineages and how they relate to variants of interest. The authors indicate that missing regions of partial genome assemblies must be indicated by Ns. This seems like an artificial constraint that may be a pain point for users. Can the authors modify their program to detect missing regions from FASTA files and automatically fill these regions with Ns prior to imputation?

      Minor comments: For the installation options, please provide an alternative to docker. Would it be feasible to add an installation option using conda? In their methods, could the authors clearly define true positives, true negatives, false positives, and false negatives in the context of their validation experiments? Related to this point, I noticed that the precision is consistently high in the validation experiments, but recall can be quite low. I assume this means that the program will not impute a genotype where there is insufficient evidence, leaving it as a "N". In this case, users should have high confidence in all imputed genotypes. Is this correct? All the figures in the manuscript were of low resolution and difficult to read. The authors should use a consistent tense (present or past) throughout the manuscript. In some places future tense was even used: "Once we have validated the robustness of our imputation against different missing regions scenarios, the validation will focus on the imputation of variants"

    2. the

      This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

      These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Siyang Liu

      The authors have introduced an imputation pipeline that integrated softwares of minimac 3, minimac 4 and PANGOLIN to impute the variant of the missing region of the SARS-CoV-2 sequencing data. The accuracy of the imputation for genotyping assay kits is around 0.9. The idea is interesting and may be helpful in a few limited scenario. However, given the high mutation rate of the SARS-CoV-2 and for most of the studies that can generate high quality SARS-CoV-2 (reference-based) genome assembly, I don't think the method will be widely used in the SARS-CoV-2 studies. In addition, it lacks a bit genuine creativity in terms of mathematics behind the method. I think the author's study may be more suitable for a journal like bioinformatics.

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab080), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102906

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102907

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102908

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102909

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab081), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102910

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102911

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102912

    1. A version of this preprint has been published in the Open Access journal GigaScience https://doi.org/10.1093/gigascience/giab079), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102903

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102904

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102905

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paperhttps://doi.org/10.1093/gigascience/giab077), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102900

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102901

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102902

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix107), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102986

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102988

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.100893

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102987

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1186/s13742-016-0150-5), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102985

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz096), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102982

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102983

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102984

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz088), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102978

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102979

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102980

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102981

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz144), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102961

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102962

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102963

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz135), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102964

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102965

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz138), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102959

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102960

    1. A version of this preprint has been published in the Open Access journal GigaScience (https://doi.org/10.1093/gigascience/giz143), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102956

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102957

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102958

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz145), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102945

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102946

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102947

      Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102948

      Reviewer 5: http://dx.doi.org/10.5524/REVIEW.102949

    1. Long

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz125), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102935 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102936

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz150), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102942

      Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102943

      Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102944

    1. Abstract

      This paper has been published in GigaByte as part of the Asian citrus psyllid community annotation series. https://doi.org/10.46471/GIGABYTE_SERIES_0001.

      The CC-BY 4.0 peer reviews are as follows:

      Reviewer 1. Mary Ann Tuli Are all data available and do they match the descriptions in the paper?

      Yes. As with the other manuscripts, OGS v3 is mentioned, but this is not get available from the CGEN. The data underlying Fig 4 and Fig5 are available.

      This manuscript is a comprehensive description of the manual curation of the ubiquitin proteasome pathway gene, with clear aims and methodology.

    2. Ubiquitination

      Reviewer 2. Subhas Hajeri

      The manuscript is well written. Even though the authors could not find a major impact of CLas infection on the annotated, subset of ubiquitin-proteasome genes but the negative data is also equally important for further understanding of pathways and developing better RNAi targets.

      I would like to recommend acceptance of the manuscript as is.

    1. sequencing

      This paper has been published by GigaScience ( https://doi.org/10.1093/gigascience/giab100) and the peer-reviews have been shared under a CC-BY 4.0 license. These are as follows.

      Reviewer 1. Edward Rice

      In this manuscript, the authors present a sophisticated method for closing gaps in assemblies, built around the knowledge that gaps usually occur in repetitive regions. They test their software against similar software with more realistic scenarios than previous studies, through the use of gaps from real assemblies of genomes that have other assemblies with fewer gaps, rather than randomly generated gaps. These tests convincingly demonstrate that this software is more sensitive and accurate than existing gap closers.

      Given this increase in performance over existing software and the novelty of the methods, I recommend this manuscript for publication with some changes. I do have some concerns about the usability and maintainability of the software it describes, noted below, but most of the alternate options have similar issues, and the methodological advancements present in the manuscript merit publication. 1. The introduction seems to imply that the primary use of this software is for closing gaps in short-read assemblies where high-coverage long reads are not available due to cost. Although I do not have a statistic to back this up, it is my sense from recent genome assembly papers that long-read de novo assembly is much more the norm these days than short-read assembly. In my personal experience I have found that gap closing can sometimes greatly improve long-read assemblies as well, especially CLR assemblies of highly repetitive genomes. I recommend rewriting the introduction somewhat to make it clear that usage of this software is not limited to short-read assemblies, as these are becoming rarer and rarer. 2. I have some concerns about the maintainability of this code base, considering its size (>40k lines), language (D, which is not a common language in bioinformatics), and sparsity of comments in the code. Further, the use of non-standard dependencies and file formats may make it difficult to adapt the software to future advances in sequencing technology; for example, this package uses daligner to perform alignment, and so far as I can tell, daligner does not produce output in SAM format, so it may be difficult to switch to using another aligner in the future as the types of long reads available change. The fact that many of the dependencies are not maintained on bioconda is also concerning. The presence of integration tests is helpful. I apologize that this is probably not a particularly helpful comment as it's far too late to change any of these things, but still wanted to point them out. 3. I also have concerns about usability. The availability of a docker file and snakemake workflow for running this software and the thorough and mostly comprehensible documentation alleviate these concerns to some degree, but it still takes a significant amount of work to configure it for a specific cluster. The example run did not work out of the box without fixing some errors (see minor edits). To test on my own assembly, I had to edit one JSON file to choose the parameters for dentist itself, which required reading about the two ways to specify two required coverage parameters; one yaml file to configure the workflow options; and one yaml file to make snakemake work with my cluster. In addition, not all clusters have singularity, so the lack of a conda package may be a problem for some potential users. The singularity image and snakemake workflow make its usability far better than PBJelly, which required actually editing the source code to make it work on my cluster with conda-installable versions of its dependencies, but it is still much worse than TGS-GapCloser, which only takes a single conda command to install with all dependencies and a single command to run, and no editing of configuration files.

      Minor comments: Abstract: - "Here, we developed" -> "Here, we present" - "Highly-accurate" — no hyphen - "Short read assemblies" -> "short-read assemblies" (this occurs in several other places too throughout manuscript) - Replace "right loci" with "correct loci" Introduction: - Page 3: "High contiguity, completeness, and accuracy... is fundamental" — change "is" to "are" - Page 3: avoid parentheses inside other parentheses - Page 3: I'm not sure I've ever heard of GenomicConsensus being used for gap closing, and cannot find any reference to it being used for this purpose with a quick scan of documentation. It must be capable of doing this, though, as you tested it alongside other gap closers. Could you explain this in the manuscript?

      Results: - Page 4: replace "right loci" with "correct loci" - Page 4: say a little more about what makes DENTIST's "state-of-the-art" consensus module better than or different from existing consensus callers - Page 5: "real life" to "real-life" - Page 5: "high quality" to "high-quality" Discussion: - Page 9: "long read data" -> "long-read data" Methods: - Page 11: "genomic regions, where the number" — remove comma - Page 12: "a common conflict are" to "a common conflict is" - Page 12: "less than three reads" to "fewer than three reads" - Page 14: "'copied' gaps from short read assembly" to "copied gaps from the short-read assembly" - Page 14: remove quotation marks around "disassembled"

      Software: - The "small example" does not work out of the box as "dentist_v1.0.2.sif" is hard-coded into snakemake.yml but the image distributed with the example is v2.0.0. - The "read-coverage" and "ploidy" options are listed as required (unless you're using "min-coverage-reads" and "max-coverage-reads", but they are not among the "important options" listed in the README under the "How to choose DENTIST parameters" subheading. - In the more extensive list of command-line options, the description of the "read-coverage" option is "this is used to provide good default values for -max-coverage-reads or -min-coverage-reads; both options are mutually exclusive." This tells the user how it is used by the program but gives the reader no explanation of how it should be chosen, which is important as it is one of the required options. - The use of comments in dentist.json by putting double slashes in front of attribute strings is confusing and also not supported by the json specification. Dentist.json would be better in yaml format because: a) YAML supports comments b) YAML is easier to read by humans c) YAML is used for the other two configuration files necessary to run the pipeline, so for consistency purposes it's best to have them all in the same for

      Re-review The authors have thoroughly and satisfactorily addressed all of my comments and the comments of the other reviewers. After testing the latest version, I can confidently say ease of use is much improved as it took me less than five minutes to go from zero to successfully starting a run of the example. I am therefore happy to recommend this manuscript for publication in its current format.

    2. reads

      Reviewer 2. Leena Salmela

      Overview: The paper presents a new tool called DENTIST for closing gaps in short read assemblies using PacBio CLR data. Although new assemblies are nowadays most often done with PacBio HiFi data resulting in contiguous and accurate assemblies, closing the gaps of an existing short read assembly with long read data is a cost effective and therefore attractive alternative for species for which short read assemblies are already available. The new tool is shown to be more accurate than previous tools and of comparable sensitivity.

      Suggestions for revision: 1) The authors should clearly indicate in the Introduction that their tool is tested on PacBio CLR reads. It would also be good to specify in the abstract that the reads were CLR reads and not HiFi reads. 2) In the Discussion, the authors recommend to "polish" the final gap closed assembly with Illumina reads. It would be interesting to see how much this improves the accuracy of gap closing. I would assume that the improvement on the gap sequences would be smaller than on other regions of the assembly because the gap sequences typically cover repetitive regions. 3) Last paragraph of section "Closing the gaps", page 14: DENTIST has three modes. Here it is indicated that the third mode (only use scaffolding information for conflict resolution and freely scaffold the contigs using long reads) would be the best mode for contig-only assemblies. It seems to me that also the second mode would be appropriate for this as it also closes gaps between scaffolds (or contigs in case of lack of scaffold information). Is this so?

    3. allow

      Reviewer 3. Ian Korf.

      The paper by Ludwig et al demonstrates that DENTIST offers a substantial improvement in closing genomic assembly gaps. The paper is well written with a clear and concise style. I liked the way they approached the experiments with a combination of simulated and real data for both the assemblies and reads. Specifically, I applaud how they generated gaps where they actually happen. The figures are generally effective. The only exception to this is Figure 4 with the black background and inconsistent ordering of competing software. In addition to winning the bake-off against other software, they did a very useful analysis of read depth (figure 6) and resources used (table 2). These help future users plan their projects. From a code perspective, I like that they have put their code on github. I don't think they need to have the supplemental file of command line parameters, as anyone who wants to use the software is going to go to the github anyway, which has a much more comprehensive explanation of usage.

  3. Feb 2022
    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaByte (see paper https://doi.org/10.46471/gigabyte.42), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Jose De Vega

      I think this long-read assembly is a great improvement against the previous short-read version available to the community to date. The assembly metrics are good, the dataset public, and there is good quality control all through the process. The manuscript is well written and the protocols are well explained. The data is public and the new assembly of interest to the community.

      However, I think the assembly has a limited interest for the research and breeding community without a gene annotation, which is not part of the manuscript. Since the authors have the data (e.g. iso-seq) and expertise, I do not understand why it has not been included in first place.

    2. Red clover

      Reviewer 2. Jianghua Chen

      Red clover is one of the most important forage crops in the world. The gametophytic self-incompatibility resulting in inherent high heterozygosity is the big challenge to get a high quality genome sequence using traditional short-read based genome assemblies. The author Bickhart et al used the long-read based assemblies method to get a high quality genome which significantly reduced the number of contigs by more than 500-folds, and improves the per-base quality and the genome size to 413.5 Mb matching well with the predicted genome size. This assembly accurately represents the seven main linkage groups, and it will help scientists to understand the origin of condensed tannins biology pathway in the leaf forages and to facilitate gene discovery and application of biotechnology to increase the nutritional value.

      I strongly support the editor to accept this manuscript to be published.

    1. ABSTRACT

      Reviewer 2. Cory Hirsh

      This manuscript describes the generation of a time-series dataset of conventional and hyperspectral images of commonly known and important maize lines. The authors describe the methods of data collection and how it is useful, especially in conjunction with other already available datasets for the same lines. The authors begin to analyze the dataset generated, focusing on biomass measures and determining heritability. The authors conclude that they believe it is important and necessary to combine controlled environment data with field data to tackle problems facing crop production. I do have several comments about the manuscript in its current form:

      1. My main concern about the manuscript is the amount of data use in the article. The manuscript was submitted as a 'Data Note', but it is not obvious this data is exceptional, rare, or novel as it was collected nearly 2 years ago. One criteria to review this type of article is dataset size. The authors are claiming a dataset size of ~500Gb, but this includes data (thermal infrared and fluorescence images) that was not mentioned in the manuscript except that it was collected. I applaud the authors for the willingness to be so open with their data, but I'm not convinced that one month worth of images for 32 genotypes is enough for publication.

      2. The manuscripts main point is not to get into conclusions based on their image analysis, but I would have liked to have seen more strenuous ground truthing. The manual measurements were made only at the very last time point. These really should encompass the variation of plants throughout development. How can we determine if the measured traits are accurate at day 9 for example? Nothing can be done for true manual measurements, but digital manual measurements could be made and correlated with image analysis extracted values.

      3. Board sense heritability needs to be corrected throughout the manuscript.

      Re-review:

      This manuscript describes the generation of a time-series dataset of conventional and hyperspectral images of commonly known and important maize lines. The authors describe the methods of data collection and how it is useful, especially in conjunction with other already available datasets for the same lines. The authors begin to analyze the dataset generated, focusing on biomass measures and determining heritability. The authors conclude that they believe it is important and necessary to combine controlled environment data with field data to tackle problems facing crop production.

      Comments: I want to clarify my first review of this manuscript. It was not my intention to make it seem as the dataset generated for this manuscript is not important, large, or useful for the broader maize and plant phenotyping community. This dataset could be very useful for some research groups, including the corresponding authors group. The authors response to the age question of the dataset of, look at the cycle time of data collection to publication in plant phenomics is generally longer, I totally agree with. The authors give numerous examples to back up this point. I'm not disputing this, but the authors should also note the amount of downstream analyses and new biological findings that are in these manuscripts as well. The importance of the presented dataset as outlined by the authors is its ability to link with other already available datasets, which isn't shown in the manuscript. This paper is a data release paper with a valuable, controlled, and well documented dataset. The real value in the dataset will be shown in subsequent publications that begin to combine the multiple datasets available from these maize lines (field phenotyping, genotyping, controlled environment phenotyping).

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix103), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Andrew Adey

      In the manuscript by De Maere and Darling, the authors describe their computational simulator for HiC and 3C sequencing that models the 3D arrangement of chromatin and how that arrangement is conveyed via proximity ligation methods. Overall the manuscript is long and does not clearly describe the main goals of the simulator. The detail is appreciated, but not when it obfuscates the main goal of the manuscript. Also the figures could be condensed so that there are less figures with more panels. That being said, I do believe the simulator that the authors have developed is very sophisticated and appears to work well with a few exceptions. The major issue is the packaging of the method into more a concise and clear text. Below are some more specific comments:

      My first thought is regarding where this simulator will be particularly useful? The authors mention it is primarily for software tool development and that the cost of generating HiC/3C data is very high and that many of the existing datasets are sparse. However, there are many existing datasets that are extremely rich and deep that would seem more appropriate. While I am not convinced on the utility for software development when abundant real data is publicly available, I do agree that having means to simulate sequence read data may have other valuable applications - primarily in exploring power in deconvolving metagenomic samples. For the eukaryotic simulated data there is a clear stretch of signal this is perpendicular to the diagonal as is typically observed for circular genomes, though this would not be expected for linear chromosomes (e.g. Figure 7). Does the simulator assume all chromosomes are circular? This is odd and needs to be addressed. Also on figure 7, the authors are highlighting that there is a greater inter chromosomal signal when compared to real data - is that a good thing? I can see that it may be desirable if the goal is to generate signal that would be generated under the assumption that there is no chromatin organization in the genome and thus be used as a background model. I can see this as a potential use, but it should be more clearly stated. The authors describe the ability to simulate TADs - however it is not clearly described how the TADs are decided upon - can users specify where TADs should be located (e.g. if they have a callset of TADs and want to create data simulating them that they can then alter - e.g. change one TAD and see how it effects signal nearby so they can know what to expect for an experiment where they may be altering TAD-forming loci). Or are they only created randomly (which seems the case given page 8 line 212). This could also be more clearly described by stating broadly what is done then going into the methods of how that is accomplished. Figure 2 is an extremely simple and small diagram – could it not just be added into figure 1? It seems a bit excessive to stand as its own figure. This goes for several other figures. Figure 8 - there is no description for c and d panels. I assume c is real and d is simulated. The strong perpendicular band midway through the chromosome is observed which is discouraging as I have commented on for Figure 7.

      Re-review; The major issues I had with the manuscript previously were that it was too long and may have limited interest. The authors have addressed the first point. For the second, I believe that the interest is broad enough to warrant publication.

    2. Background

      Reviewer 2. Ming Hu

      In this paper, the authors developed a software package Sim3C to simulate Hi-C data and other 3C-based data. This work addresses a very important research question, and has the potential to become a useful computational tool in genomics research. However, the authors need to provide more explanations and technical details to further improve the current manuscript.

      Here are my specific comments: Major comments: 1. Figure 3. It is better to plot Figure 3 in log scale for both x-axis and y-axis. In log scale, the slope of contact probably has direct biophysical interpretation, as described by the first Hi-C paper (Lieberman-Aiden et al, Science, 2009). I am very curious to see how biophysics model contributes to the data generation mechanism. 2. In Rao et al, Cell, 2014 paper, they identified chromatin loops anchored by CTCF motifs. In Sim3C, the authors considered the 1D genomic distance effect and hierarchical TAD structures. It would be great if Sim3C can also take chromatin loops into consideration. 3. Hi-C data can help to detect allelic-specific chromatin interactions. Is Sim3C able to simulate allelic specific proximity ligation data? 4. It is very important to rigorously evaluate the data reproducibility. Using Sim3C, if users simulate Hi-C data multiple times with different random seeds, would the reproducibility between two simulated datasets be comparable to the reproducibility between two real biological replicates? 5. The authors showed simulated contact matrices of bacteria (Figure 6) and budding yeast (Figure 7). They also need to simulate both human and mouse genome-wide contact matrices, and compare the simulated contact matrices with real data.

      Minor comments: 1.Please replace all 'HiC' by 'Hi-C'. 2. Page 6, line 116, "sciHiC" should be "scHi-C".

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaByte (see paper https://doi.org/10.46471/gigabyte.41), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      Reviewer 1. Mary Ann Tuli

      Are all data available and do they match the descriptions in the paper? No. The paper states "The D. citri genome assembly (v3), OGS (v3) and transcriptomes are accessible on the Citrusgreening.org portal" I believe v2 is available, not v3 yet.

      Additional Comments The paper states "The gene models will be part of an updated official gene set (OGS) for D. citri that will be submitted to NCBI." Until these models are available in NCBI their reuse is limited.

      Recommendation: Minor Revision

    2. Citrus greening disease

      Reviewer 2. Xinyu Li

      In the paper entitled “Annotation of glycolysis, gluconeogenesis, and trehaloneogenesis pathways provide insight into carbohydrate metabolism in the Asian citrus psyllid”, the authors conducted a high quality annotation of genes involved in glycolysis, gluconeogenesis, and trehaloneogenesis in Diaphorina citri genome, which provided the bases to develop gene-targeting therapeutics for this important pest species.

      The MS is well-written, and the analyses are clear and proper. I found some minor concerns that should be addressed.

      In the first paragraph of Page 10, the authors used cross symbol and the asterisk in the sentence “The number of genes identified in glycolysis….from NCBI, OrthoDB, and Flybase.”. However, the cross symbol and the asterisk are used without any explanation or citation. I suggest to cite the Appendix the authors referred to or add an explanation to make it clearer.

      In Conclusion part, on Page 15, the authors stated “Expression analysis of the genes annotated in the carbohydrate metabolism pathways identified differences related to life stage, sex and tissue.”. But what are the differences are not mentioned here. I think it would be better to summarize the key/predominant differences about gene expression in the carbohydrate metabolism pathways.

      In addition, it is interesting that the gene expression related with carbohydrate metabolism is sexually different in the Asian citrus psyllid. Is it common in insects or existed in some specific groups?

    1. Abstract

      Reviewer 2. Bruno Fosso

      The paper by Bremges et al. describe CAMITAX a workflow designed for the taxonomic classification of microbial genomes obtained from the application of NGS-based methodologies, such as single-cell sequencing and metagenomics. Even if the 4 implemented methodologies itself do not represent a real novelty in the field, their harmonization by using a classification algorithm is interesting. Moreover, the idea to deploy the workflow in a container greatly simplify both the installation and usage and ensure the analysis reproducibility.

      The manuscript is well written and easy to read. All the proposed figures are appropriate and adequately support the data described in the main text. Figure 2 may be improved by using different colors allowing to easily discriminate the paths through the plot.

      The CAMITAX GitHub repository clearly describe how to access and configure the container but very few information are available about the manual installation. The usage section needs an improvement.

      I have some minor concerns about the paper: - the classification algorithm needs to be described more in deep. A figure may help the readers; - regarding the overall drop of CAMITAX recall in mid-range ranks, I was wondering if it may be due to the fact that CAMITAX seems to be more conservative than the Delmont classification (figure 2). Authors should discuss in how many cases CAMITAX results more conservative than the reference classification. - Moreover, the authors claim that "Notably, 95% of CAMITAX's predictions were consistent with Delmont et al., i.e. the two assignments were on the same taxonomic lineage and their LCA is either of the two." Does it mean the authors consider consistent a classification for which CAMITAX assigns to the kingdom rank while Dermont assigns to species? Please clarify

      It would be useful to add some information about the technical requirements such as consumed RAM and required CPU time.

    2. Now published in GigaScience doi: 10.1093/gigascience/giz154 Andreas Bremges 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, Germany2German Center for Infection Research (DZIF), partner site Hannover-Braunschweig, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Andreas BremgesFor correspondence: andreas.bremges@helmholtz-hzi.de alice.mchardy@helmholtz-hzi.deAdrian Fritz 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlice C. Mchardy 1Computational Biology of Infection Research, Helmholtz Centre for Infection Research, 38124 Braunschweig, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: andreas.bremges@helmholtz-hzi.de alice.mchardy@helmholtz-hzi.de

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz154 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102049

    1. Abstract

      A version of this preprint has been published in the journal GigaByte under a CC-BY 4.0 license (see https://doi.org/10.46471/gigabyte.40)

      Reviewer 1. Jianbo Jian

      This submission described a reference genome for the Atlantic chub mackerel (Scomber colias) using the combination of PacBio HiFi long reads and Illumina short reads. The sequencing data process and genome assembling and related bioinformatics are comprehensive and adequate. The reported reference genome is the first genome and good continuity. It is a pity that the genome is not the chromosome level due to lack of the Hi-C data or genetic map data. However, the associated analysis and results make sense. In my opinion, as the first reference genome in the genus Scomber, this reference genome is a valuable genomic resource for population genetics, ecology and physiology and other future research. I have some concerns that should be addressed before publication in GigaByte.

      1) In the project design, for genome assembly, two individuals were used for genomics DNA extraction. Why not used the same individual for avoiding the assembly error due to the genetic different between individuals? 2) Line 186-196, I have some confuse about the contamination process, is there some contamination in your sample? In general, most of the genome project will not contain contamination. This process is effective for the specific sample to avoid the contamination. 3) In Phylogenomics analysis, the divergence time was recommended, then the Figure should be updated make more sense. 4) Supp. Table 6 is blank. 5) All of the supplementary tables were not shown in manuscript. 6) The genome assemble for Illumina sequencing is useless compared with HiFi data. 7) In supplementary Table 5, N50 (Kb) should be N50 (bp).

      Recommendation: Minor Revision

      Reviewer 2. Rong Huang

      Is there sufficient information for others to reuse this dataset or integrate it with other data? No. It is suggested that the author can make a simple table to show the assembly effects within the Order Scombriformes.

      Additional Comments: Scomber colias is a valuable marine resource, with a high impact on the fisheries of several countries on the west coast of the Atlantic Ocean and/or the Mediterranean Sea. This study reports the first genome assembly of Atlantic chub mackerel. This genome is timely and the assembly process is clearly describedis, which contribute to the effective conservation, management, and sustainable exploitation of S. colias species in the Anthropocene. I still have the following questions.

      The assembly effect of the genome does not seem to be particularly good. For example, the length of N50 length of scaffolds is not long enough. How many ploidy is this species? Do heterozygosity and repetition rate affect the assembly effect?

      It is suggested that the author can make a simple table to show the assembly effects within the Order Scombriformes. It is helpful for relevant researchers to make use of the genomic resources.

      Is "data validation" followed by the results section? And there is no subtitle in the result part. Is it required by this type of article?

      Recommendation: Major Revision

    1. Abstract

      A version of this preprint has been published in the journal GigaByte under a CC-BY 4.0 license (see paper), and is also part of the Asian citrus psyllid community annotation series of papers that can be viewed here: https://doi.org/10.46471/GIGABYTE_SERIES_0001

      Reviewer 1. Alex Arp In “Genomic identification, annotation, and comparative analysis of Vacuolar-type ATP synthase subunits in Diaphorina citri” the authors did just that. The paper is well written, direct, and easy to follow. The reasoning of annotating these genes is clearly defined; that they are possible targets for RNAi based control for Diaphorina citri, an economically important pest of citrus. The annotation of the genes utilized genomic and transcriptomic databases and the gene expression profiles used existing datasets. Figures and Tables are clear and the phylogenetic trees give sufficient supporting evidence that the annotations are correct. Overall is a good manuscript and needs no major revision for publication.

      Additional Comments: On Page 18 what is meant by "new protein" in "It is a relatively new protein critically associated with the assembly of a certain cell type V-ATPase and is still being studied"?

      Reviewer 2. Mary Ann Tuli Are all data available and do they match the descriptions in the paper?

      Yes. As with the other manuscripts, OGS v3 is mentioned, but this is not get available from the CGEN. The requesting data underlying table and figures has been uploaded.

      Any Additional Overall Comments to the Author This manuscript is a comprehensive description of the manual curation of the V-ATPase genes, with clear aims and methodology.

      Recommendation: Accept

    1. Now published in GigaScience doi: 10.1093/gigascience/giab063 Yilei Fu 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yilei FuMedhat Mahmoud 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of America3Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Medhat MahmoudViginesh Vaibhav Muraliraman 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFritz J. Sedlazeck 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Fritz J. SedlazeckFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.eduTodd J. Treangen 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Todd J. TreangenFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab063 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102841 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102842

    1. Now published in GigaScience doi: 10.1093/gigascience/giab062 Lukas M. Weber 1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lukas M. WeberAriel A. Hippen 2Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ariel A. HippenPeter F. Hickey 3Advanced Technology & Biology Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Peter F. HickeyKristofer C. Berrett 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Kristofer C. BerrettJason Gertz 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jason GertzJennifer Anne Doherty 4Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, UT, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteStephanie C. Hicks 1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Stephanie C. HicksFor correspondence: shicks19@jhu.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab062 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102826 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102827

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab064 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102834 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102835

    1. Now published in GigaScience doi: 10.1093/gigascience/giab056 Shufang Wu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteZhencheng Fang 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJie Tan 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMo Li 3Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChunhui Wang 3Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteQian Guo 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCongmin Xu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiaoqing Jiang 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuaiqiu Zhu 1State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China2Center for Quantitative Biology, Peking University, Beijing 100871, China4Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Georgia 30332, USA5Institute of Medical Technology, Peking University Health Science Center, Beijing 100191, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Huaiqiu ZhuFor correspondence: hqzhu@pku.edu.cn

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab056 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102812 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102813

    1. Now published in GigaScience doi: 10.1093/gigascience/giab004 Fan Zhang 1Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Fan ZhangFor correspondence: fanzhang@umich.eduHyun Min Kang 2Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab004 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102627 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102628

    1. Now published in GigaScience doi: 10.1093/gigascience/giz121 Yun-Ching Chen 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAbhilash Suresh 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChingiz Underbayev 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteClare Sun 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteKomudi Singh 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFayaz Seifuddin 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAdrian Wiestner 2Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMehdi Pirooznia 1Bioinformatics and Computational Biology Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, United StatesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: mehdi.pirooznia@nih.gov

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz121 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101927 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101928 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101929

    1. Now published in GigaScience doi: 10.1093/gigascience/giz118 Xiao Hu Find this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz118 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101954 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101955 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101956

    1. COVID-19 pandemic

      Reviewer 3. Daniel Mietchen

      This review includes supplemental files, videos and hypothes.is annotations of the preprint!: https://zenodo.org/record/4909923

      The videos of the review process are also available on YouTube:

      Part 1 (Screen Recording 2021-06-05 at 10.02.02.mov): https://youtu.be/_UnDdE3Oi-4 Part 2 (Screen Recording 2021-06-05 at 10.52.51.mov): https://youtu.be/z5xRK0lg3b4 Part 3 (Screen Recording 2021-06-05 at 11.27.01.mov): https://youtu.be/VnztlEqFW2A Part 4 (Screen Recording 2021-06-07 at 02.51.59.mov): https://youtu.be/IYtLfMcLTvA Part 5 (Screen Recording 2021-06-07 at 06.11.52.mov): https://youtu.be/Jv_AUHCASQw Part 6 (Screen Recording 2021-06-07 at 18.07.45.mov): https://youtu.be/6Y-yA9oahzM Part 7 (Screen Recording 2021-06-07 at 19.07.02.mov): https://youtu.be/LV5whFhfmEU

      First round of review:

      Summary The present manuscript provides an overview of how the English Wikipedia incorporated COVID-19-related information during the first months of the ongoing COVID-19 pandemic.

      It focuses on information supported by academic sources and considers how specific properties of the sources (namely their status with respect to open access and preprints) correlate with their incorporation into Wikipedia, as well as the role of existing content and policies in mediating that incorporation.

      No aspect of the manuscript would justify a rejection but there are literally lots of opportunities for improvements, so "Major revision" appears to be the most appropriate recommendation at this point.

      General comments The main points that need to be addressed better: (1) documentation of the computational workflows; (2) adaptability of the Wikipedia approach to other contexts; (3) descriptions of or references to Wikipedia workflows; (4) linguistic presentation.

      Ad 1: while the code used for the analyses and for the visualizations seems to be shared rather comprehensively, it lacks sufficient documentation as to what was done in what order and what manual steps were involved. This makes it hard to replicate the findings presented here or to extend the analysis beyond the time frame considered by the authors. Ad 2: The authors allude to how pre-existing Wikipedia content and policies - which they nicely frame as Wikipedia's "scientific infrastructure" or "scientific backbone" - "may provide insight into how its unique model may be deployed in other contexts" but that potentially most transferrable part of the manuscript - which would presumably be of interest to many of its readers - is not very well developed, even though that backbone is well described for Wikipedia itself. Ad 3: there is a good number of cases where the Wikipedia workflows are misrepresented (sometimes ever so slightly), and while many of these do not affect the conclusions, some actually do, and overall comprehension is hampered. I highlighted some of these cases, and others have been pointed out in community discussions, notably at https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_COVID- 19&oldid=1028476999#Review_of_Wikipedia's_coverage_of_COVID and http://bluerasberry.com/2021/06/review-of-paper-on-wikipedia-and-covid/ . Some resources particularly relevant to these parts of the manuscript have not been mentioned, be it scholarly ones like https://arxiv.org/abs/2006.08899 and https://doi.org/10.1371/journal.pone.0228786 or Wikimedia ones like https://en.wikipedia.org/wiki/Wikipedia_coverage_of_the_COVID-19_pandemic and https://commons.wikimedia.org/wiki/File:Wikimedia_Policy_Brief_-_COVID-19_- _How_Wikipedia_helps_us_through_uncertain_times.pdf . Likewise essentially missing - although this is a common feature in academic articles about Wikipedia - is a discussion of how valid the observations made for the English Wikipedia are in the context of other language versions (e.g. Hebrew). On that basis, it is understandable that no attempt is made to look beyond Wikipedia to see how coverage of the pandemic was handled in other parts of the Wikimedia ecosystem (e.g. Wikinews, Wikisource, Wikivoyage, Wikimedia Commons and Wikidata), but doing so might actually strengthen the above case for deployability of the Wikipedia approach in other contexts. Disclosure: I am closely involved with WikiProject COVID-19 on Wikidata too, e.g. as per https://doi.org/10.5281/zenodo.4028482 . Ad 4: The relatively high number of linguistic errors - e.g. typos, grammar, phrasing and also things like internal references or figure legends - needlessly distracts from the value of the paper. The inclusion of figures - both via the text body and via the supplement - into the narrative is also sometimes confusing and would benefit from streamlining. While GigaScience has technically asked me to review version 3 of the preprint (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v3 and also via GigaScience's editorial system), that version was licensed incompatibly with publication in GigaScience, so I pinged the authors on this (via https://twitter.com/EvoMRI/status/1393114202349391872 ), which resulted (with some small additional changes) in the creation of version 4 (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4 ) that I concentrated on in my review.

      Production of that version 4 - of which I eventually used both the PDF and the HTML, which became available to me at different times - took a while, during which I had a first full read of the manuscript in version 3.

      In an effort to explore how to make the peer review process more transparent than simply sharing the correspondence, I recorded myself while reading the manuscript for the second time, commenting on it live. These recordings are available via https://doi.org/10.5281/zenodo.4909923 .

      In terms of specific comments, I annotated version 4 directly using Hypothes.is, and these annotations are available via https://via.hypothes.is/https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4.full .

      Re-review: I welcome the changes the authors have made - both to the manuscript itself (of which I read the bioRxiv version 5) and to the WikiCitationHistoRy repo - in response to the reviewer comments. I also noticed comments they chose not to address, but as stated before, none of these would be ground for rejection. What I am irritated about is whether the proofreading has actually happened before the current version 5 was posted. For instance, reference 44 seems missing (some others are missing in the bioRxiv version, but I suspect that's not the authors' fault), while lots of linguistic issues in phrases like "to provide a comprehensive bibliometric analyses of english Wikipedia's COVID-19 articles" would still benefit from being addressed. At this point, I thus recommend that the authors (a) update the existing Zenodo repository such that there is some more structure in the way the files are shared (b) archive a release of WikiCitationHistoRy on Zenodo

    2. Background

      Reviewer 2. Dean Giustini This is a well-written manuscript. The methods are well-described. I've confined my comments to improving the reporting of your methods, some comments about the paper's structure, and a few about the readability of the figures and tables (which I think in general are too small, and difficult to read). Here are my main comments for your consideration as you work to improve your paper:

      1) Title of manuscript - the title of your paper seems inadequate to me, and doesn't really convey its content. A more descriptive title that includes the idea of the "first wave" might be useful from my point of view as a reader who scans titles to see if I am interested. I'd recommend including words in the title that refer to your methods. What type of research is this - a quantitative analysis of citations? Title words say a lot about the robust nature of your methods. As you consider whether to keep your title as is, keep mind that title words will aid readers in understanding your research at a glance, and provide impetus to read your abstract (and one hopes the entire manuscript). These words will help researchers find the paper later as well via the Internet's many search engines (i.e., Google Scholar).

      2) Abstract - The abstract is well-written. Could the aims of your research be more obvious? and clearly articulated? How about using a statement such as "This research aims to" or similar? I also don't understand the sentence that begins with "Using references as a readout". What is meant by a "readout" in this context? Do you mean to read a print-out of references later? Lower down, you introduce the concept of Wikipedia's references as a "scientific infrastructure", and place it in quotations. Why is it in quotations? I wondered what the concept was on first reading it. A recurring web of papers in Wikipedia constitutes a set of core references - but would I call them a scientific infrastructure? Not sure; they are a mere sliver of the scientific corpus. Not sure I have any suggestions to clarify the use of this phrase.

      3) Introduction - This is an excellent introduction to your paper, and it provides a lot of useful context and background. You make a case for positioning Wikipedia as a trusted source of information based on the highly selective literature cited by the entries. However, I would only caution that some COVID-19 entries cite excellent research but the content is contested, and vice versa. One suggestion I had for this section was the possibility of tying citizen science (part of open science) to the rise of Wikipedia's medwiki volunteers. Wikipedia provides all kinds of ways for citizens to get involved in science. As an open science researcher, I appreciated all of the open aspects you mention. Clearly, open access to Wikipedia in all languages is a driving force in combatting misinformation generally, and the COVID "infodemic" specifically. I admit I struggled to understand the point of the section that begins, "Here, we asked what role does scientific literature, as opposed to general media, play in supporting the encyclopedia's coverage of the COVID-19 as the pandemic spread." The opening sentence articulates your a priori research question, always welcome for readers. Would some of the information that follows in this section around your methods be better placed in the following section under the "Material and Methods"? I found it jarring to read that "....after the pandemic broke out we observed a drop in the overall percentage of academic references in a given coronavirus article, used here as a metric for gauging scientificness in what we term an article's Scientific Score." These two ideas are introduced again later, but I had no idea on reading them here what they signified or whether they were related to research you were building on. You might consider adding a parenthetical statement that they will be described later, and that the idea of a score is your own.

      4) Material and methods - Your methods section might benefit from writing a preamble to prepare your readers. As already mentioned, consider taking some of the previous section and recasting it as an introduction to your methods. Consider adding some information to orient readers, and elaborating in a sentence or two about why identifying COVID-19 citations / information sources is an important activity.

      By the way, what is meant by this: "To delimit the corpus of Wikipedia articles containing DOIs"? Do you mean "identify" Wikipedia articles with DOIs in their references? As I mentioned (apologies in advance for the repetition), it strikes me as odd that you don't refer to this research as a form of citation analysis (isn't that what it is?). Instead you characterize it as "citation counting". If your use of words has been intentional, is there a distinction you are making that I simply do not understand? Also: bibliometricians and/or scientometricians might wonder why you avoid the phrase citation analysis. Further to your methods which are primarily quantitative and statistical - what are the qualitative methods used throughout the paper to analyze the data? How did you carry out this qualitative work? (On page 10, you state "we set out to examine in a temporal, qualitative and quantitative manner, the role of references in articles linked directly to the pandemic as it broke.") That part of your methods seems to be a bit under-developed, and may be worth reconsidering as you work to improve your reporting in the manuscript.

      5) Table 1. I am not sure what this table adds to the methods given it leads off your visuals. Do you really need it? It doesn't reveal anything to me and could be in a supplemental file. I also have difficulties in properly seeing table 1; perhaps you could make it larger and more readable?

      6) Figure 1. This is the most informative visual in the paper but it is hard to read and crowded. It deserves more space or the information it provides is not fully understood.

      7) Figure 3. This is very bulky as a figure, although informative. Again, I'm not sure all of it needs inclusion. Perhaps select part of it, and include other parts in a supplement.

      7) Limitations - The paper does not adequately address its limitations. A more fulsome evaluation of limitations would be beneficial to me as a reader, as it would place your work in a larger context. For example, consider asking whether the results are indicative of Wikipedia's other medical or scientific entries? Or are the results not generalizable at all? In other works, are they indicative of something very limited based on the timeframe that you examined? I found myself disagreeing with: "....the mainstream output of scientific work on the virus predated the pandemic's outbreak to a great extent". Is this still true? and what might its significance be now that we are in 2021? Would it be helpful to say that most of the foundational research re: the family of coronaviruses was published pre-2020, but entries about COVID-19 disease and treatment entries are now distinctly different in terms of papers cited, especially going forward. Wiki editors identify relevant papers over time but are not adept at identifying emerging evidence in my experience, or at incorporating important papers early; it's strange given that recency is one of its true calling cards. For me, the most confounding aspect of the infodemic is the constant shifts of evidence, and how to respond in a way that is prudent and evidence-based. As you point out, Wikipedia has a 8.7 year latency in citing highly relevant papers - and, it seem likely that many important COVID-19 papers were neglected in Wikipedia in the first wave especially about the disease. As you point out, this will form part of future research, which I hope you and your team will pursue.

      8) Reference 31 lacks a source: Amit Arjun Verma and S. Iyengar. Tracing the factoids: the anatomy of information reorganization in wikipedia articles. 2021.

      Good luck with the next stages in improving your manuscript for publication. I believe it adds to our understanding of Wikipedia's role in promoting sources of information.

    3. Abstract

      This paper has been published in GigaScience under a CC-BY 4.0 license (see: https://doi.org/10.1093/gigascience/giab095). As the journal carriers out open peer review these have also been published under the same license.

      Reviewer 1. Dariusz Jemielniak This is a very solid article on a timely topic. I also commend you for the thorough and meticulous methodology.

      One thing that I believe you could amplify on is what would your proposed solution to the "trade off between timeliness and scientificness"? After all, Wikipedia relies on the sources that are reliable, verifiable, but foremostly... available. At the time when there are no academic journal articles published (yet) the chosen modus operandi does not appear to be a trade-off, it is basically the only logical solution. A trade-off would occur if the less valuable sources were not replaced when more academic ones appear, and this is not the case. I believe you should mention the fact that Wikipedia has an agreement with Cochrane database, which likely affects the popularity of this source.

      Additionally, I think that the literature review needs to be expanded. There are already some publications about Wikipedia and COVID-19, as well as about medical coverage on Wikipedia (some non-exhaustive references added below). Moreover, Wikipedia has been a topic covered in GigaScience and it would be reasonable to reflect on the previous conversations in the journal in your publication.

      Chrzanowski, J., Sołek, J., & Jemielniak, D. (2021). Assessing Public Interest Based on Wikipedia's Most Visited Medical Articles During the SARS-CoV-2 Outbreak: Search Trends Analysis. Journal of medical Internet research, 23(4), e26331. Colavizza, G. (2020). COVID-19 research in Wikipedia. Quantitative Science Studies, 1-32. Jemielniak, D. (2019). Wikipedia: Why is the common knowledge resource still neglected by academics?. GigaScience, 8(12), giz139.

      Jemielniak, D., Masukume, G., & Wilamowski, M. (2019). The most influential medical journals according to Wikipedia: quantitative analysis. Journal of medical Internet Research, 21(1), e11429.

      Kagan, D., Moran-Gilad, J., & Fire, M. (2020). Scientometric trends for coronaviruses and other emerging viral infections. GigaScience, 9(8), giaa085.

  4. Jan 2022
    1. Abstract

      Reviewer 2. Rhonda Bacher It is good to have alternative workflows for single-cell analysis, and I am glad to see the authors have submitted the package to Bioconductor. I hope the authors maintain the package and update with new methods as necessary such as if new normalizations or batch corrections are developed. I only have two comments that I hope the authors try to clarify further:

      1. The statement starting with "Optionally, after batch-to-batch normalisation, we also..." should not be in that location. It seems to suggest to readers that this is the recommended method, whereas later that is not the case. In these sentences the manuscript also claims that this normalization approach is more "robust" without providing any evidence or citation.
      2. It's still not completely clear to me how the authors extension of the sc-qPCR method is different from MAST. The same authors of the qPCR method extended it here: "MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data". MAST is also an LRT, but I am assuming that here you are not using the detection rate as a covariate? That's OK if true, it just needs to be clear to the reader. I imagine this could be a frequently asked question by users down the road, so even a sentence on how it is different from (or similar to) MAST would help. Suggestion only: I may have missed it, but it might be helpful to include a statement that says something like "Statistical methods for single-cell analysis are constantly evolving. Here we have implemented XX. The flexibility of ascend allows it to adapt as future methods are developed and prove useful".
    2. Now published in GigaScience doi: 10.1093/gigascience/giz087 Anne Senabouth 1Institute for Molecular Bioscience, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSamuel W Lukowski 1Institute for Molecular Bioscience, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJose Alquicira Hernandez 1Institute for Molecular Bioscience, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteStacey Andersen 1Institute for Molecular Bioscience, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXin Mei 2South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteQuan H Nguyen 1Institute for Molecular Bioscience, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJoseph E Powell 1Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia3Queensland Brain Institute, University of Queensland, Brisbane, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz087 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101872

    1. Background

      Reviewer 2. Jianbo Jian In this manuscript, Xi et al reported a chromosome-level genome of the common vetch (Vicia sativa) with integration of Oxford Nanopore sequencing, Illumina sequencing, CHiCAGO and Hi-C. Then, the gene annotation and evolution were performed based on the reference genomes. These genomic resources are valuable for evolution research, genetic diversity and genomic breeding. I think this manuscript is suitable published in Gigabyte. Some minor comments and suggestions as following:

      1) The Line Number is missed in this manuscript, which make the detailed comments is not inconvenient. 2) Page 6, “resequenced short-reads” should be “De novo sequencing” or “sequencing”. 3) For the 1.93 Gb assembled genome size, it is a little larger than that of estimated by the flow cytometry (1.77 Gb) and Genomescope (1.61 Gb). Maybe there are some duplicated sequences in this version of assembled genomes. Some redundancy removal software can deal with this question such as Haplotigs, Purge_dups and so on. 4) For the evaluation of genome, LTR Assembly Index (LAI) was suggested for the quality assessment. 5) In Table S2, the mapping rate is very well but the genome coverage is just 76% which looks a little low. What’s the reason? 6) In Table S4, the gene set was combined by August. However, in methods, the annotation software is BRAKER v2.1.6.

      Recommendation Minor Revision

      Re-review: The revised manuscript and response are satisfactory. The additional analyses that the authors have performed are correctly structured. The data presented is clear. In my opinion, I recommend accepting this manuscript.

    2. Abstract

      This paper has been published in GigaByte Journal under a CC-BY 4.0 Open Access license (see: https://doi.org/10.46471/gigabyte.38), and the open peer reviews have also been shared under this license. These reviews were as follows.

      Reviewer 1. Jonathan Kreplak Is there sufficient detail in the methods and data-processing steps to allow reproduction?

      No. For "Phylogenetic tree construction and divergence time estimation", 64 single copy orthologs are selected, they should be included in a supplementary table to be able to fully reproduce the analysis. Also, Supplementary table S9 should be related to fossil calibrations but show the length of chromosome.

      Recommendation: Minor revision

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa083 Andre Macedo Chronic Diseases Research Center (CEDOC), NOVA Medical School | Faculdade de Ciências Médicas, Universidade Nova de Lisboa, Rua do Instituto Bacteriológico 5, 1150-190, Lisbon, PortugalFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Andre MacedoFor correspondence: andre.macedo@nms.unl.pt alisson.gontijo@nms.unl.ptAlisson M. Gontijo Chronic Diseases Research Center (CEDOC), NOVA Medical School | Faculdade de Ciências Médicas, Universidade Nova de Lisboa, Rua do Instituto Bacteriológico 5, 1150-190, Lisbon, PortugalFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alisson M. GontijoFor correspondence: andre.macedo@nms.unl.pt alisson.gontijo@nms.unl.pt

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa083 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102344 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102345

    1. Now published in GigaScience doi: 10.1093/gigascience/giz149 Michal Stolarczyk 1Center for Public Health Genomics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteVincent P. Reuter 1Center for Public Health Genomics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteNeal E. Magee 5Research Computing, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteNathan C. Sheffield 1Center for Public Health Genomics, University of Virginia2Department of Public Health Sciences, University of Virginia3Department of Biomedical Engineering, University of Virginia4Department of Biochemistry and Molecular Genetics, University of VirginiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nathan C. SheffieldFor correspondence: nsheffield@virginia.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz149 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102075 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102076 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102077 Reviewer 4: http://dx.doi.org/10.5524/REVIEW.102078

    1. Now published in GigaScience doi: 10.1093/gigascience/giz115 Bo Song 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYue Song 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, China4BGI-Qingdao, BGI-Shenzhen, Qingdao, 266555, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYuan Fu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteElizabeth Balyejusa Kizito 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePamela Nahamya Kabod 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuan Liu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSandra Ndagire Kamenya 5Uganda Christian University, Bishop Tucker Road, Box 4, Mukono, UgandaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSamuel Muthemba 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRobert Kariba 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiuli Li 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSibo Wang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShifeng Cheng 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlice Muchugi 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRamni Jamnadass 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHoward-Yana Shapiro 6African Orphan Crops Consortium, World Agroforestry Centre (ICRAF), Nairobi, Kenya7University of California, 1 Shields Ave, Davis, USA9Mars, Incorporated, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAllen Van Deynze 7University of California, 1 Shields Ave, Davis, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuanming Yang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJian Wang 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXun Xu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDamaris Achieng Odeny 8ICRISAT-Nairobi, Nairobi, KenyaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: liuxin@genomics D.Odney@cigar.orgXin Liu 1BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China2China National GeneBank, BGI-Shenzhen, Jinsha Road, Shenzhen 518120, China3State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: liuxin@genomics D.Odney@cigar.org

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz115 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101930 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101931 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101932

    1. Now published in GigaScience doi: 10.1093/gigascience/giz119 Sion C. Bayliss 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Sion C. BaylissFor correspondence: s.bayliss@bath.ac.ukHarry A. Thorpe 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Harry A. ThorpeNicola M. Coyle 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nicola M. CoyleSamuel K. Sheppard 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Samuel K. SheppardEdward J. Feil 1The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AYFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Edward J. Feil

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz119 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101935 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101936 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101937

    1. Now published in GigaScience doi: 10.1093/gigascience/giz100 Elena Bushmanova 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDmitry Antipov 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlla Lapidus 1Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz100 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101881 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101882 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101883 Reviewer 4: http://dx.doi.org/10.5524/REVIEW.101884

    1. Now published in GigaScience doi: 10.1093/gigascience/giz105 Luca Alessandrì 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMarco Beccuti 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMaddalena Arigoni 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartina Olivero 3Department of Oncology, University of Torino, SP142, 95, 10060 Candiolo TO, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGreta Romano 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGennaro De Libero 4Department Biomedizin, University of Basel, Hebelstrasse 20, 4031 Basel, SwitzerlandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLuigia Pace 5IIGM, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFrancesca Cordero 1Department of Computer Sciences, University of Torino, Corso Svizzera 185, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRaffaele A Calogero 2Department of Molecular Biotechnology and Health Sciences, University of Torino, Via Nizza 52, Torino, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz105 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101893 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101894

    1. Now published in GigaScience doi: 10.1093/gigascience/giz106 Yingxin Lin 1School of Mathematics and Statistics, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShila Ghazanfar 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDario Strbenac 1School of Mathematics and Statistics, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAndy Wang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia3Sydney Medical School, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteEllis Patrick 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia4Westmead Institute for Medical Research, University of Sydney, Westmead, NSW 2145, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDave Lin 5Department of Biomedical Sciences, Cornell University, Ithaca, NY, 14853, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTerence Speed 6Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Australia7Department of Mathematics and Statistics, University of Melbourne, Melbourne, VIC 3010, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJean YH Yang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jean YH YangPengyi Yang 1School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia2Charles Perkins Centre, University of Sydney, NSW 2006, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz106 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101910 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101911

    1. Now published in GigaScience doi: 10.1093/gigascience/giz107 Thomas P. Quinn 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia2Centre for Molecular and Medical Research, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas P. QuinnIonas Erb 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGreg Gloor 4Department of Biochemistry, University of Western Ontario, London, Ontario, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCedric Notredame 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMark F. Richardson 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia5Genomics Centre, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, Australia6Centre for Integrative Ecology, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTamsyn M. Crowley 7Poultry Hub Australia, University of New England, 2351, Armidale, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz107 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101917 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101918

    1. Now published in GigaScience doi: 10.1093/gigascience/giz094 Marek Wiewiórka 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Marek WiewiórkaAgnieszka Szmurło 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Agnieszka SzmurłoTomasz Gambin 1Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, PolandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Tomasz Gambin

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz094 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101847 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101848 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101849

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa097

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa097 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102383

    1. Now published in GigaScience doi: 10.1093/gigascience/giz104

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz104 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101860 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101861 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101862

    1. Now published in GigaScience doi: 10.1093/gigascience/giz091

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz091 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101873 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101874 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101875

    1. Now published in GigaScience doi: 10.1093/gigascience/gix084 Jonathan A. Atkinson 1Centre for Plant Integrative Biology, School of Biosciences, University of Nottingham, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jonathan A. AtkinsonGuillaume Lobet 2Agrosphere, IBG3, Forschungszentrum Jülich, Jülich, Germany3Earth and Life Institute, Université catholique de Louvain, Louvain-la-Neuve, BelgiumFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Guillaume LobetManuel Noll 4InBios, Université de Liège, Liège, BelgiumFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePatrick E. Meyer 4InBios, Université de Liège, Liège, BelgiumFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMarcus Griffiths 1Centre for Plant Integrative Biology, School of Biosciences, University of Nottingham, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Marcus GriffithsDarren M. Wells 1Centre for Plant Integrative Biology, School of Biosciences, University of Nottingham, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Darren M. Wells

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix084 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100810 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100811

    1. A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/gix100 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.100867 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.100868

    1. Leaf shape among

      Reviewer 2. Tim Dickinson.

      Review of GIGA-D-16-00069, "Morphometric analysis of Passiflora leaves I: the relationship between landmarks of the vasculature and elliptical Fourier descriptors of the blade" by D.H. Chitwood and W.C. Otoni.

      This is a valuable contribution, both in terms of (a) the data collected (leaf outlines and landmarks from a large sample of Passiflora species exhibiting a wide range of leaf shapes), and (b) the well thought out analyses undertaken with care and, for the most part, attention to detail. The paper also showcases, at least for me, the range of tools now available for comparative biologists and others who seek to analyze patterns of morphological variation and ontogenetic change. Until relatively recently it appeared that there was no real replacement for MorphoSys (Meacham, 1993) and the ability of this MS-DOS based system for outline capture and measurement. In many ways there still isn't, leaving workers to pick through an assortment of programs with more limited scopes or (better), to learn to use the imageJ toolbox effectively (Schneider, Rasband, and Eliceiri, 2012). Or SHAPE (Iwata and Ukai, 2002), if only outline data are needed. And in the meantime, adepts from the morphometrics world have embraced R (R Core Team, 2016) as a data analysis and graphics environment, resulting in packages like shapes (Dryden, 2016) and Momocs (Bonhomme et al., 2014). Others (e.g. Jensen, Ciofani, and Miramontes, 2002), myself included, have compared results from different analytical methods based, variously, on landmark and outline data from a common sample of study objects.

      The authors' correlation analysis (Fig. 7) is especially valuable (and as far as I know entirely original) for the way it enables the authors to infer connections between particular landmarks (their x or y coordinates) and particular elliptic Fourier coefficients (A, B, C, or D) for a given harmonic. Although they tend to contrast their landmark data as reflecting leaf vasculature features (because these provide the locations of several of their landmarks), as against overall shape features provided by their outline data, the authors also emphasize the complementarity of these two aspects of their study objects (p. 5). In their subsequent ms (GIGA-D-16-00070) the authors are more adroit in emphasizing this complementarity. Here, they occasionally sound as if they think these associations are somehow intrinsic in these data sources, as if the would obtain even in objects other than leaves (pp. 2, 3, 6). It may be that what's at issue here is just unnecessarily bringing forward interpretations of their results (as if they described the results) before those results have been fully presented. I have three concerns about this ms. First, the authors have been variously careless about details of spelling ("fratal," "heteroblatsy") and the agreement between verb and noun ("data" is a plural noun) and somewhat cavalier about explaining terms when first introduced ("eigenleaf"; "amplification factor," in the caption to Fig. 1).

      Second, basic details of the authors' sampling are unclear, probably because they are so abundantly obvious to them. What is the actual sample size for their study? They refer (pp. 2, 4, 5, 10) to having analyzed more than 3,300 leaves, and (on p. 5) to making available a dataset comprising "...555 scans of leaves from 40 different species of Passiflora..." Presumably, the data released represent a subset of the total sample, but it would help to make this, and their overall sampling strategy, explicit by tabulating how many vines (with what numbers of leaves) were studied for each of the 40 species. These details are important in order to dispel any idea that their discriminant analyses are overfitted because of the large numbers of descriptors (30, 80, or 110) and binary-valued dummy variables for species and leaf position (39, 10); Gittins 'monograph on canonical analysis (Gittins, 1985) references simulation studies suggesting that upwards of 20 times as many study objects as descriptors (measured, plus those designating groups) are needed in order for an analysis to be anything other than a deterministic description of a particular dataset. One of the exciting aspects of this study is the refreshingly large sample size that appears to have been used.

      The third, related, concern is that the way in which leaf position on individual vines was recorded is unclear, given that nothing seems to be said about whether the same number of leaves was produced on all vines of all species during the period during which the study material was grown. Numbering leaves from the youngest leaf at the tip of the shoot, to the base, suggests that the youngest leaf would be numbered 1, the next 2, and so on to N, the most basal leaf on the vine. If the total number of leaves varies from vine to vine then there won't be an exact homology between leaf positions. Figures 5 and 6 suggest that leaf position 1 (basal-most, left-most, respectively) is in fact the most basal leaf. Would this position in fact correspond to the cotyledons, or to the first post-cotyledonary leaf? Clarification would be helpful. I also wonder about the continued reference to "node position" and "heteroblastic node position" as well as, in one case, "shoot position." I suggest that a more transparent usage would be to refer throughout to leaf position (numbered from the most basal leaf), or to the position of a leaf on a shoot. Reference to "heteroblastic node position" seems completely unnecessary, since heteroblasty is an emergent property of some or all of the shoots studied, as seen from the way in which leaves vary over the course of shoot development in shape (or some other property) from the most basal to the most apical.

      BONHOMME, V., S. PICQ, C. GAUCHEREL & J. CLAUDE. 2014. Momocs: Outline Analysis Using R. Journal of Statistical Software 56: 1-24. DRYDEN, I. L. 2016. shapes package, version 1.1-13. website: https://www.maths.nottingham.ac.uk/personal/ild/shapes/index.html. GITTINS, R. 1985. Canonical Analysis. Springer-Verlag, Berlin. IWATA, H. & Y. UKAI. 2002. SHAPE: A computer program package for quantitative evaluation of biological shapes based on elliptic Fourier descriptors. Journal of Heredity 93: 384-385. JENSEN, R. J., K. M. CIOFANI & L. C. MIRAMONTES. 2002. Lines, outlines, and landmarks: morphometric analyses of leaves of Acer rubrum, Acer saccharinum (Aceraceae) and their hybrid. Taxon 51: 475-492. MEACHAM, C. A. 1993. MorphoSys An interactive machine vision program for acquisition of morphometric data. In R. Fortuner [ed.], Advances in computer methods for systematic biology: Artificial intelligence, databases, computer vision, 393-402. R CORE TEAM. 2016. R: A language and environment for statistical computing. website: http://www.R-project.org. SCHNEIDER, C. A., W. S. RASBAND & K. W. ELICEIRI. 2012. NIH Image to ImageJ: 25 years of image analysis. Nature Methods 9: 671-675.

      Re-review; The authors have responded to my comments, and for the most part have satisfied my concerns. In this regard, the copy of the original ms marked up to show the changes the authors have made is extremely helpful. The material on github consists of the data files and R code used to produce the figures, and does not appear provide direct access to the original leaf spectra that would show the variation observed in outline and landmark position. For the publication, I trust that the journal will require that the raw data referenced as "44. Giga DB reference" will consist of a series of files, e.g. one for each species, comprising complete leaf spectra, with the leaves labeled acropetally, i.e. in the heteroblastic sequence from base to tip, for each plant referenced in Figure S1A. If these are provided as vector files (outlines plus landmarks) this data repository should not be excessively large. The vector format will also make it possible for the interested reader to zoom in and out without loss of resolution in order to examine the patterns of variation at different scales.

    2. BACKGROUND

      Reviewer 1. Christopher Jiggins. This paper assesses leaf shape variation in Passiflora using two different morphometric methods. I will say from the start that I am not an expert in these techniques and am not in a position to review the methods or statistics in this paper, so my comments are somewhat superficial and the paper should also be reviewed by a morphometrics/statistical expert.

      Background: I'd like to see a bit more justification for the study- Is this aimed at developing automated methods for species identification, or quantifying ecological communities and their diversity of leaf shape? Or is it purely an assessment of the methods for shape quantification (Passiflora have such dramatic shape variation it should be an easy test case?) The analysis considers 'species' and developmental stage 'heteroblasty' separately and as independent factors. However I think this is the wrong approach because the plasticity of leaves is highly species dependent - so in analysing plasticity, species should be included in the model, such that the interaction of stage and species is considered. It is not really surprising that when considered independently, leaf position is not a very good predictor of leaf shape because the developmental process is different in each species. Or alternatively the leaf position analysis could be conducted on each species separately, if the data are sufficient.

      It is interesting that more juvenile leaves show less species-specificity as detected in the dataset - anecdotally I have noticed the same thing.

      Discussion: Are there any ecological implications of this work? Are the species sympatric, or where do they come from. I'd be interested to know about disparity in leaf shape among sympatric species, and whether this might be greater than expected by chance, as would perhaps be predicted by the Heliconius coevolution hypothesis. It would be interesting to think further about which ecological questions could be addressed with these methods

      Page 3 Line 21 Refer to recent paper on shape learning Dell'aglio, D. D., Losada, M. E. & Jiggins, C. D. Butterfly Learning and the Diversification of Plant Leaf Shape. Front. Ecol. Evol. 4, 81 (2016).

    3. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper doi: 10.1093/gigascience/giw008), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      There is also a Q&A with the author here: http://gigasciencejournal.com/blog/a-passion-for-morphing-passiflora-leaves-author-qa-with-daniel-chitwood/

    1. ABSTRACT

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa147), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows: Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102559 and http://dx.doi.org/10.5524/REVIEW.102558 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102557

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab074 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102883 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102884 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102885

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab061 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102870 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102871

    1. https://doi.org/10.1101/2020.06.06.138248

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab066 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102862 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102863

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab052 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102790 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102791 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102792

    1. ABSTRACT

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab039 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102755 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102753

    1. Now published in GigaScience doi: 10.1093/gigascience/giab033 Colin Farrell 1Department of Human Genetics, University of California, Los Angeles, CA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Colin FarrellMichael Thompson 2Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, CA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAnela Tosevska 2Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, CA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAdewale Oyetunde 2Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, CA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMatteo Pellegrini 2Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, CA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: matteop@mcdb.ucla.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab033 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102730 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102731

    1. Now published in GigaScience doi: 10.1093/gigascience/giab029 Lucie A. Bergeron 1Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lucie A. BergeronFor correspondence: guojie.zhang@bio.ku.dk lucie.a.bergeron@gmail.comSøren Besenbacher 2Department of Molecular Medicine, Aarhus University, Aarhus, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJiao Zheng 4BGI-Shenzhen, Shenzhen 518083, Guangdong, China5BGI Education Center, University of Chinese Academy of Sciences, Shenzhen 518083, Guangdong, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePanyi Li 4BGI-Shenzhen, Shenzhen 518083, Guangdong, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGeorge Pacheco 6Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMikkel-Holger S. Sinding 7Trinity College Dublin, Dublin, Ireland8Greenland Institute of Natural Resources, Nuuk, GreenlandFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMaria Kamilari 1Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteM. Thomas P. Gilbert 6Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark9Department of Natural History, NTNU University Museum, Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, NorwayFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMikkel H. Schierup 10Bioinformatics Research Centre, Aarhus University, Aarhus, DenmarkFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGuojie Zhang 1Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen, Denmark4BGI-Shenzhen, Shenzhen 518083, Guangdong, China11State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China12Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: guojie.zhang@bio.ku.dk lucie.a.bergeron@gmail.com

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab029 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102718 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102719

    1. Now published in GigaScience doi: 10.1093/gigascience/giab032 Guy M. Hagen 1UCCS BioFrontiers Center, University of Colorado at Colorado Springs, 1420 Austin Bluffs Parkway, Colorado Springs, Colorado, 80918Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Guy M. HagenFor correspondence: ghagen@uccs.eduJustin Bendesky 1UCCS BioFrontiers Center, University of Colorado at Colorado Springs, 1420 Austin Bluffs Parkway, Colorado Springs, Colorado, 80918Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteRosa Machado 1UCCS BioFrontiers Center, University of Colorado at Colorado Springs, 1420 Austin Bluffs Parkway, Colorado Springs, Colorado, 80918Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTram-Anh Nguyen 2George Mason University, 4400 University Drive, Fairfax, Virginia, 22030Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteTanmay Kumar 3Department of Computer Science and Software Engineering, California Polytechnic State University, San Luis Obispo, California, 93407Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteJonathan Ventura 3Department of Computer Science and Software Engineering, California Polytechnic State University, San Luis Obispo, California, 93407Find this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102721 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102722

    1. Now published in GigaScience doi: 10.1093/gigascience/giab022 Mikhail G. Dozmorov 2Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA3Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Mikhail G. DozmorovFor correspondence: Mikhail.Dozmorov@vcuhealth.org Joshua.Harrell@vcuhealth.orgKatarzyna M. Tyc 4VCU Massey Cancer Center, Richmond, VA, 23298, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Katarzyna M. TycNathan C. Sheffield 5Center for Public Health Genomics, University of Virginia, Charlottesville, VA, 22908, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nathan C. SheffieldDavid C. Boyd 3Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA6Integrative Life Sciences Doctoral Program, Virginia Commonwealth University, Richmond, VA, 23298, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for David C. BoydAmy L. Olex 7C. Kenneth and Dianne Wright Center for Clinical and Translational Research, Virginia Commonwealth University, Richmond, VA, 23298, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Amy L. OlexJason Reed 4VCU Massey Cancer Center, Richmond, VA, 23298, USA8Department of Physics, Virginia Commonwealth University, Richmond, VA, 23220, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jason ReedJ. Chuck Harrell 3Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for J. Chuck HarrellFor correspondence: Mikhail.Dozmorov@vcuhealth.org Joshua.Harrell@vcuhealth.org

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab022 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102709 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102710

    1. Now published in GigaScience doi: 10.1093/gigascience/giab023 Jeffrey B.S. Gaither Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGrant E. Lammi Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJames L. Li Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDavid M. Gordon Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHarkness C. Kuck Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteBenjamin J. Kelly Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJames R. Fitch Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePeter White Computational Genomics Group, The Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Peter WhiteFor correspondence: peter.white@nationwidechildrens.org

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab023 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102700 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102701

    1. Now published in GigaScience doi: 10.1093/gigascience/giab020 Carolina Peñaloza 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlejandro P. Gutierrez 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UK2Institute of Aquaculture, Faculty of Natural Sciences, University of Stirling, Stirling FK9 4LA, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLel Eory 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShan Wang 3Haskin Shellfish Research Laboratory, Department of Marine and Coastal Sciences, Rutgers University, 6959 Miller Avenue, Port Norris, NJ 08349, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiming Guo 3Haskin Shellfish Research Laboratory, Department of Marine and Coastal Sciences, Rutgers University, 6959 Miller Avenue, Port Norris, NJ 08349, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlan L. Archibald 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTim P. Bean 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRoss D. Houston 1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian EH25 9RG, UKFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ross D. HoustonFor correspondence: ross.houston@roslin.ed.ac.uk

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab020 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102693 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102694

    1. Now published in GigaScience doi: 10.1093/gigascience/giab018 Ben Blamey 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ben BlameyFor correspondence: ben.blamey@it.uu.seSalman Toor 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartin Dahlö 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHåkan Wieslander 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePhilip J Harrison 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteIda-Maria Sintorn 1Department of Information Technology, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, Sweden4Vironova AB, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlan Sabirsh 5Advanced Drug Delivery, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCarolina Wählby 1Department of Information Technology, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteOla Spjuth 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ola SpjuthAndreas Hellander 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab018 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102684 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102685 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102686

    1. Now published in GigaScience doi: 10.1093/gigascience/giab017 Anton Zamyatin 1Computer Technologies Laboratory, ITMO University, Saint Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Anton ZamyatinPavel Avdeyev 2Department of Mathematics, The George Washington University, Washington, DC, USA3Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Pavel AvdeyevJiangtao Liang 4Department of Entomology, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jiangtao LiangAtashi Sharma 5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA6Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Atashi SharmaChujia Chen 5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA6Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteVarvara Lukyanchikova 4Department of Entomology, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA7Institute of Cytology and Genetics the Siberian Division of the Russian Academy of Sciences, Novosibirsk, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Varvara LukyanchikovaNikita Alexeev 1Computer Technologies Laboratory, ITMO University, Saint Petersburg, RussiaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Nikita AlexeevZhijian Tu 5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA6Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Zhijian TuMax A. Alekseyev 2Department of Mathematics, The George Washington University, Washington, DC, USA3Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA8Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Max A. AlekseyevIgor V. Sharakhov 4Department of Entomology, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA5Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Igor V. SharakhovFor correspondence: igor@vt.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab017 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102680 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102681

    1. Now published in GigaScience doi: 10.1093/gigascience/giab011 Holly C. Beale 1UC Santa Cruz Molecular, Cell and Developmental Biology; UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Holly C. BealeFor correspondence: hcbeale@ucsc.eduJacquelyn M. Roger 2UC Santa Cruz School of EngineeringFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMatthew A. Cattle 2UC Santa Cruz School of EngineeringFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLiam T. McKay 2UC Santa Cruz School of EngineeringFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDrew K. A. Thomson 2UC Santa Cruz School of EngineeringFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteKatrina Learned 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteA. Geoffrey Lyle 1UC Santa Cruz Molecular, Cell and Developmental Biology; UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteEllen T. Kephart 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteRob Currie 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDu Linh Lam 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLauren Sanders 1UC Santa Cruz Molecular, Cell and Developmental Biology; UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJacob Pfeil 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJohn Vivian 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteIsabel Bjork 3UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteSofie R. Salama 4Dept. of Biomolecular Engineering, UC Santa Cruz Genomics Institute, Howard Hughes Medical InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDavid Haussler 4Dept. of Biomolecular Engineering, UC Santa Cruz Genomics Institute, Howard Hughes Medical InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteOlena M. Vaske 1UC Santa Cruz Molecular, Cell and Developmental Biology; UC Santa Cruz Genomics InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab011 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102677 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102678

    1. Now published in GigaScience doi: 10.1093/gigascience/giab007 James K. Bonfield 1Wellcome Trust Sanger Institute Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for James K. BonfieldJohn Marshall 2University of Glasgow, Institute of Cancer SciencesFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for John MarshallPetr Danecek 1Wellcome Trust Sanger Institute Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Petr DanecekHeng Li 3Dana-Farber Cancer Institute and Harvard UniversityFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Heng LiValeriu Ohan 4Wellcome Sanger InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Valeriu OhanAndrew Whitwham 4Wellcome Sanger InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Andrew WhitwhamFor correspondence: aw7@sanger.ac.ukThomas Keane 5EMBL-EBIFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas KeaneRobert M. Davies 4Wellcome Sanger InstituteFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Robert M. Davies

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab007 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102657 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102658

    1. Now published in GigaScience doi: 10.1093/gigascience/giab005 Ruichen Rong 1Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShuang Jiang 2Department of Statistical Sciences, Southern Methodist University, Dallas, TX 75275, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLin Xu 1Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGuanghua Xiao 1Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA3Harold C. Simmons Cancer Center, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA4Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYang Xie 1Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA3Harold C. Simmons Cancer Center, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA4Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDajiang J. Liu 5Institute for Personalized Medicine, College of Medicine, Pennsylvania State University, Hershey, Pennsylvania, 17033, USA6Division of Biostatistics, Department of Public Health Sciences, College of Medicine, Pennsylvania State University, Hershey, Pennsylvania, 17033, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteQiwei Li 7Department of Mathematical Sciences, The University of Texas at Dallas, Dallas, Texas, 75080, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteXiaowei Zhan 1Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA8Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: Xiaowei.Zhan@UTSouthwestern.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab005 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102648 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102649

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa162 Xinying Fang 1Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Xinying FangYu Liu 2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteZhijie Ren 3Department of Electric Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYuheng Du 1Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteQianhui Huang 1Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLana X. Garmire 2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Lana X. GarmireFor correspondence: lgarmire@med.umich.edu

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa162 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102617 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102618

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa155 Nikhil Bhagwat 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: nikhil153@gmail.com jean-baptiste.poline@mcgill.caAmadou Barry 2Lady Davis Institute for Medical Research, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteErin W. Dickie 3Kimel Family Translational Imaging-Genetics Research Lab, CAMH, Toronto, ON, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteShawn T. Brown 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGabriel A. Devenyi 4Computational Brain Anatomy Laboratory, Douglas Mental Health Institute, Verdun, QC, Canada5Department of Psychiatry, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteKoji Hatano 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteElizabeth DuPre 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlain Dagher 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteM. Mallar Chakravarty 4Computational Brain Anatomy Laboratory, Douglas Mental Health Institute, Verdun, QC, Canada5Department of Psychiatry, McGill University, Montreal, QC, Canada10Department of Biomedical Engineering, McGill UniversityFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCelia M. T. Greenwood 2Lady Davis Institute for Medical Research, McGill University, Montreal, QC, Canada8Ludmer Centre for Neuroinformatics Mental Health, McGill University, Montreal, QC, Canada9Gerald Bronfman Department of Oncology; Department of Epidemiology, Biostatistics Occupational Health; Department of Human Genetics, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteBratislav Misic 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteDavid N. Kennedy 7Child and Adolescent Neurodevelopment Initiative, University of Massachusetts, Worcester, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteJean-Baptiste Poline 1Montreal Neurological Institute Hospital, McGill University, Montreal, QC, Canada6Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada8Ludmer Centre for Neuroinformatics Mental Health, McGill University, Montreal, QC, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: nikhil153@gmail.com jean-baptiste.poline@mcgill.ca

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa155 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102614 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102615

    1. Abstract

      A version of this preprint has been published in the Open Access journal GigaScience (see paper [insert DOI here]), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102603 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102604

    1. Now published in GigaScience doi: 10.1093/gigascience/giaa154 Satria A. Kautsar 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Satria A. KautsarJustin J. J. van der Hooft 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Justin J. J. van der HooftDick de Ridder 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Dick de RidderMarnix H. Medema 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Marnix H. MedemaFor correspondence: marnix.medema@wur.nl

      A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa154 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

      These peer reviews were as follows:

      Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102605 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102606 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102607